Interactive Introduction to Speech Processing
Sam Ezeh
What is sound?

What is sound?

Sound is a vibration that travels through the air as waves of pressure caused by some vibrating object.

These waves create pockets of high and low pressure that move through the air in every direction.

When these pressure waves reach our ears, they cause our eardrums to vibrate in response.

Our brains then interpret these vibrations as sound, allowing us to hear.

Below is a recording of a man saying "hello" and a graph of how the air pressure changes over time.

Tap the beginning of the graph to listen to the man speak.


How do we perceive sound?

Technically speaking, the word "frequency" refers to how often something repeats in some amount of time.

Simply put, the faster something repeats the higher it sounds to us.

Tap the button below to hear something that vibrates about 260 times a second or at a frequency of 260 Hertz.



Once the sound enters our ears, our eardrums then pass the vibration on to another region of our ear called the "cochlea".

The cochlea is filled with tiny hairs that each start to wiggle in response to different frequencies and our brain then decodes these wiggles into sound.

How do we perceive sound?

The sound from earlier is called a sine wave and we can combine different sine waves to make new sounds.

If we zoom in on the sound waves then we can see a clear repeating pattern.

Tap on the different frequencies to add and remove them from the sound wave.


How do we make sounds with our mouths?

We make sounds through our mouths by pushing air from our lungs through our vocal cords.

Our vocal cords vibrate to control pitch. Tightening our vocal cords causes it to vibrate faster and produce a higher sound.

We can make different sounds by changing the shape of our mouths and moving our tongue to alter the flow of air.

For example, we can make the sounds "ee" and "uhh" by changing the position of our tongue in our mouth.

How do we make sounds with our mouths?

Tap in different locations in the mouth to simulate moving the tongue to emulate different vowel sounds with the synthesiser.

The source-filter model

The source-filter model

The source-filter model approximates the sound-making process in our mouth by separating out the sound we make into two parts: the "source" and the "filter".

The source corresponds to the air flow from the lungs supplying the sound wave with a pulse of sound energy.

The filter corresponds to our vocal tract modifying the the sound wave to produce the different noises we hear when we speak.

Mathematically this is represented as a "convolution".

The source-filter model

Tap to select different sources and filters to create different sound waves.

Sources
Filters

How do we make sounds with our mouths?

DECtalk is an example of a source-filter synthesiser.

Tap the play button to listen to the DECtalk synthesiser sing Daisy Bell.



Linear Predictive Coding

Linear Predictive Coding

Linear predictive coding refers to a method to separate out the effects of the source and the filter in a sound wave.

It works by modelling the filter as an autoregressive linear function. This means that we assume that we can work out what the next part of the speech is by adding and multiplying previous pressure values of the sound wave.

The animation below shows that we only need to look at two previous values to work out the next value of a sine wave. φ₁ controls how much we multiply the previous value and φ₂ controls how much we multiply the value before that.

Modify the the model parameters below to create different sound waves.

Formants

Formants

Formants are the extra frequencies created by our vocal tract that give vowels their unique sound.

Speak into your microphone to see the sound wave and the formants change as you change the shape of your mouth.

Try saying "ee" like in "please" then pulling your tongue back to say "oo" to see how the formant changes.

If you're on mobile you might have to speak directly into your phone's microphone.

oo ee

The end

You can find my source code for this page on GitHub. I also wrote a flappy bird style game that uses formants where you control the bird by making sounds and aying "ee" and "oo". You can find it here.

I wrote the Formant estimation code in Zig where I implemented Burg's algorithm and you can find my source code here. I use zpoly for polynomial root-finding.

I also wrote a Python program to create animated vowel formant plots and you can find that here

I used Christian d'Heureuse's implementation of Dennis Klatt's Klatt Synthesiser to synthesise speech in the vowel sounds demonstration.

References

References