How AI Learns to Speak | AI Without Limits

How AI Learns to Speak

Speech-to-text and text-to-speech are two of the most technically sophisticated things AI does every day. Here is what is actually happening under the hood.

Pankaj 10 min read

How speech recognition works — AI pipeline illustration

Ask Siri something and it answers in under a second. Dictate a message and it appears on screen. None of it feels like engineering. But the gap between “you speak” and “it hears” is one of the more technically involved problems in modern AI, and understanding what’s actually happening there changes how you reason about what these systems can and cannot do.

The core problem is simple to state: computers don’t have ears or mouths. They only process numbers. Everything else is translation.

The First Translation: Sound into Numbers

When you speak, you create sound waves. Rapid changes in air pressure that travel outward from your vocal cords. A microphone contains a thin membrane called a diaphragm that vibrates when those pressure waves hit it. That mechanical vibration converts into an electrical signal, which then converts into numbers through a process called sampling.

The core idea

Think of sampling like taking snapshots of a moving car. If you take 16,000 photographs per second, you can reconstruct the car’s movement accurately. Audio works the same way. A typical speech recognition system samples audio 16,000 times per second. Each sample is a measurement of air pressure at that exact moment.

One second of your voice becomes a list of 16,000 numbers: [0.23, 0.45, -0.12, 0.67, …]

Why 16,000? Human speech contains frequencies up to roughly 8,000 Hz. The Nyquist theorem says you need to sample at twice the highest frequency you want to capture. So 2 x 8,000 gives you 16,000 samples per second . Just enough to reconstruct everything meaningful in human speech.

The Spectrogram: Seeing Sound

Raw numbers are not useful for recognition. Looking at 16,000 amplitude measurements per second does not tell you what sounds were made. It only tells you how loud it was at each moment. What you actually need to know is which frequencies were present and when. That is where spectrograms come in.

A spectrogram transforms audio from the time domain into the frequency domain. Instead of asking “how loud is it now?” it asks “which pitches are present now?”

The mathematical tool that creates spectrograms is called the Fast Fourier Transform. The intuitive way to think about it: a prism for sound. White light passed through a prism splits into its component colors. The FFT takes a complex sound wave and splits it into its component frequencies. Your voice saying “hello” is not one thing. It’s a specific combination of 200 Hz, 800 Hz, 1,200 Hz, and dozens of other frequencies all happening simultaneously. The FFT separates them out.

It analyzes small windows of audio, typically 25 milliseconds, and reports which frequencies were present at each intensity. Stack those windows across time and you get a spectrogram: a two-dimensional image where the horizontal axis is time, the vertical axis is frequency, and brightness indicates intensity. It’s genuinely one of the more beautiful ways to visualize information I’ve come across in signal processing.

Why mel-scale mattersThe raw frequency scale is not how human hearing works. The difference between 100 Hz and 200 Hz sounds enormous. The difference between 8,000 Hz and 8,100 Hz is barely noticeable. The mel-scale adjusts frequency resolution to match human perception . More detail at lower frequencies, less at higher ones. Mel-spectrograms are what modern speech systems actually use.

Speech to Text: Pattern Recognition at Scale

Now you have a mel-spectrogram. A visual fingerprint of sound. The word “hello” always creates a similar pattern in that fingerprint. The job of speech recognition is to learn those patterns well enough to reverse-engineer the words from the image.

Modern systems use transformer models, the same architecture behind large language models, trained on millions of examples of spectrograms paired with their corresponding text. The model adjusts its internal mathematics until it can reliably predict the right text when it sees a given pattern. After enough examples, it generalizes to new speakers, new accents, new recording conditions.

Speech to Text Pipeline

Microphone

→

Sampling

→

FFT

→

Mel-Spectrogram

→

Transformer

→

Text

The attention mechanism in transformers is what makes context possible. The model does not just look at the current sound in isolation. It attends to what came before. If it hears “I saw a fire ___”, it knows the next word is more likely “truck” than “brick” based on surrounding context. This is why modern speech recognition handles homophones and ambiguous phrases far better than older systems.

OpenAI’s Whisper is the most widely used open-source model built on exactly this architecture. It was trained on 680,000 hours of audio pulled from the internet, which is why it handles accents and background noise unusually well. Deepgram takes a similar transformer-based approach but optimizes heavily for low latency, which matters when you need transcription in real time rather than after the fact. Both are running this same pipeline under the hood.

Text to Speech: The Harder Problem

Going from text to speech is harder than going from speech to text. The reason is asymmetry. When you hear speech, there is one correct transcription. When you read text, there is no single correct way to say it. The letters give you no information about pitch, duration, emphasis, rhythm, or emotion. Yet natural-sounding speech requires all of those.

Modern text-to-speech systems solve this in two stages.

The first stage, models like Tacotron2 or FastSpeech, takes text as input and predicts what the mel-spectrogram should look like. It learned from thousands of recordings paired with their text: what pitch a question mark implies, how long vowels last compared to consonants, where emphasis falls in a phrase. The output is not audio. It is the imagined spectrogram of what the audio should look like.

The second stage is a vocoder, a model that converts spectrograms into actual audio waveforms. This is where a subtle problem arises. The spectrogram tells you which frequencies are present, but not where each frequency is in its cycle at any given moment. That information, called phase, is essential for producing clean audio. It is missing from the spectrogram entirely.

The phase problem

Imagine two identical sine waves playing simultaneously. If their peaks align, the sound doubles in volume. If one peak aligns with the other’s trough, they cancel out completely and you hear nothing. The spectrogram only tells you “800 Hz is present”, not whether that wave’s peak is up or down right now.

Modern vocoders like HiFiGAN solve this using Generative Adversarial Networks. Two neural networks compete: a generator tries to produce realistic audio from spectrograms, and a discriminator tries to detect whether audio is real or generated. The generator improves to fool the discriminator, the discriminator improves to catch it, and the cycle repeats millions of times. What emerges is a generator good enough that its output is indistinguishable from human speech in most conditions.

ElevenLabs is currently the most capable consumer product built on this stack. It adds voice cloning on top of the base architecture, which lets the model learn the specific acoustic characteristics of a particular voice and reproduce them. Murf.ai takes a similar approach optimized for professional narration. The voice you hear isn’t a recording. It’s a neural network synthesizing audio it has never produced before, shaped to match a voice it studied.

Text to Speech Pipeline

Text

→

TTS Model

→

Mel-Spectrogram

→

Vocoder

→

Audio

→

Speaker

What These Systems Still Cannot Do

This technology works well enough that most people never think about when it fails. The failure modes are consistent though, and worth knowing. Accents and dialects underrepresented in training data get worse recognition accuracy. The model learned from what it saw, and if certain speakers were rare in the training set, those speakers get worse results. Background noise, multiple overlapping speakers, and low-quality microphones all degrade performance in ways that human hearing handles effortlessly. Sarcasm, subtle emotional nuance, and deliberately ambiguous phrasing remain genuinely difficult.

This is also why different products perform differently on the same audio. Google’s Speech-to-Text and Azure AI Speech are trained on different datasets with different optimization targets. Whisper prioritizes breadth across accents and languages. Deepgram prioritizes speed. ElevenLabs prioritizes naturalness in synthesis. They’re all running variations of the same pipeline, but the training data and the specific choices made during training produce meaningfully different results in practice.

Common assumption: AI understands language like humans do

The more accurate picture is sophisticated pattern matching. The model learned that certain acoustic patterns correspond to certain text strings, trained on vast amounts of labeled data. It does not understand meaning. It learned statistical correlations between patterns well enough to be useful in most situations.

Common assumption: TTS just stitches together recorded words

Modern text-to-speech generates entirely new audio that has never existed before. It learned the patterns of human speech well enough to synthesize novel, natural-sounding speech for any text it has never encountered.

• • •

The next time a voice assistant answers you, something genuinely strange has happened. Your voice became 16,000 numbers per second. Those numbers became a visual image of frequency over time. That image got pattern-matched against millions of training examples. Words got predicted. Those words got converted into an imagined spectrogram. A second neural network synthesized audio from that spectrogram in real time. All of it in under a second.

None of it is magic. But it is worth sitting with the strangeness for a moment before reaching for that word.

Topics

AI Fundamentals Speech Recognition Text to Speech Machine Learning Whisper ElevenLabs Deepgram

Free PDF + Weekly Newsletter

📄 Free PDF on subscribe — 8 AI Tools I Actually Use in 2026. Honest verdicts, no sponsored picks.

📬 Every Tuesday — new AI research, tool updates, and what it actually means for people who use these tools for real work. Written by a data scientist, not a tech blogger.

Subscribe Free