Speech to Data
To convert speech to on-screen text or a computer command, a computer has to go through several complex steps. When you speak, you create vibrations in the air. The analog-to-digital converter (ADC) translates this analog wave into digital data that the computer can understand. To do this, it samples, or digitizes, the sound by taking precise measurements of the wave at frequent intervals. The system filters the digitized sound to remove unwanted noise, and sometimes to separate it into different bands of frequency (frequency is the wavelength of the sound waves, heard by humans as differences in pitch). It also normalizes the sound, or adjusts it to a constant volume level. It may also have to be temporally aligned. People don’t always speak at the same speed, so the sound must be adjusted to match the speed of the template sound samples already stored in the system’s memory.
Next the signal is divided into small segments as short as a few hundredths of a second, or even thousandths in the case of plosive consonant sounds — consonant stops produced by obstructing airflow in the vocal tract — like “p” or “t.” The program then matches these segments to known phonemes in the appropriate language. A phoneme is the smallest element of a language — a representation of the sounds we make and put together to form meaningful expressions. There are roughly 40 phonemes in the English language (different linguists have different opinions on the exact number), while other languages have more or fewer phonemes.
The next step seems simple, but it is actually the most difficult to accomplish and is the is focus of most speech recognition research. The program examines phonemes in the context of the other phonemes around them. It runs the contextual phoneme plot through a complex statistical model and compares them to a large library of known words, phrases and sentences. The program then determines what the user was probably saying and either outputs it as text or issues a computer command.
We’ll take a closer look at exactly how it does this next.
Speech Recognition and Statistical Modeling
Early speech recognition systems tried to apply a set of grammatical and syntactical rules to speech. If the words spoken fit into a certain set of rules, the program could determine what the words were. However, human language has numerous exceptions to its own rules, even when it’s spoken consistently. Accents, dialects and mannerisms can vastly change the way certain words or phrases are spoken. Imagine someone from Boston saying the word “barn.” He wouldn’t pronounce the “r” at all, and the word comes out rhyming with “John.” Or consider the sentence, “I’m going to see the ocean.” Most people don’t enunciate their words very carefully. The result might come out as “I’m goin’ da see tha ocean.” They run several of the words together with no noticeable break, such as “I’m goin'” and “the ocean.” Rules-based systems were unsuccessful because they couldn’t handle these variations. This also explains why earlier systems could not handle continuous speech — you had to speak each word separately, with a brief pause in between them.
Today’s speech recognition systems use powerful and complicated statistical modeling systems. These systems use probability and mathematical functions to determine the most likely outcome. According to John Garofolo, Speech Group Manager at the Information Technology Laboratory of the National Institute of Standards and Technology, the two models that dominate the field today are the Hidden Markov Model and neural networks. These methods involve complex mathematical functions, but essentially, they take the information known to the system to figure out the information hidden from it.
The Hidden Markov Model is the most common, so we’ll take a closer look at that process. In this model, each phoneme is like a link in a chain, and the completed chain is a word. However, the chain branches off in different directions as the program attempts to match the digital sound with the phoneme that’s most likely to come next. During this process, the program assigns a probability score to each phoneme, based on its built-in dictionary and user training.
This process is even more complicated for phrases and sentences — the system has to figure out where each word stops and starts. The classic example is the phrase “recognize speech,” which sounds a lot like “wreck a nice beach” when you say it very quickly. The program has to analyze the phonemes using the phrase that came before it in order to get it right. Here’s a breakdown of the two phrases:
For more detail: How Speech Recognition Works