Mailing List Follow akustyk on Twitter
 
Tutorials
Akustyk
Linguistics
Audio technology
Field recording
Recommendations
Reviews
Film Processing

Speech acoustics and field recording

Definition of speech

Let us define speech as a product of human communication that can be described in terms of acoustics. This is a rather narrow definition, but it is adequate in the context of this article.

Because recording speech these days almost always involves digitization, I am going to assume that the product of your capture is going to be a digital audio file, such as MS WAVE. We need to make sure that the audio file contains as much information about the speech signal as possible. This is true whether we are recording speech with a microphone or digitizing an existing analog recording. In terms of acoustics, the digital audio file must be able reproduce the frequency range and dynamic range of speech as well as possible. Table 1 shows effects of sample rate and bit-depth on the frequency response and dynamic range of digital speech files.

Effects of sample rate and bit-depth on speech quality
16-bit, 48,000 Hz
8-bit, 8,000 Hz

Table 1. Effects of sample rate and bit-depth on the frequency response and dynamic range of digital speech files

Frequency

Sampling

Human speech spans a wide range of frequencies. Adult males typically produce speech sounds at lower frequencies than adult females and children. The entire range spans from approximately 50 to 20,000 Hz. Only the highest frequency fricatve consonants such as /s/ and /sh/ can reach up to 20,000 while most speech-relevant information is contained quite a bit lower on the frequency scale.

Figure 1 shows a spectrogram of an adult female and an adult male talker saying the phrase "to some smooth jazz" (click buttons to hear audio). Note that the female talker's fricative /s/ reaches almost all the way up to 20,000 Hz. By contrast, the male pronunciation shows the maximum frequencies reaching only approximately 15,000 Hz.

Content on this page requires a newer version of Adobe Flash Player.

Get Adobe Flash player

Figure 1. Spectrograms of female and male talkers showing a wide range of frequencies
DOWNLOAD FILES

The obvious conclusion is that we need to have all of the frequencies up to 20,000 Hz adequately represented in a digital audio file. Sample rate is the parameter that determines the frequency response of an audio file, and by Nyquist theorem, the maximum frequency reproduced in an audio file is exactly half the sample rate. Figure 2 shows the effect of sample rate reduction on the range of speech frequencies (in the phrase "smooth jazz") reproduced in an audio file. As you listen to each file, please not the decrease in perceived quality. This is largely due to the reduction in frequency response. You can read more about this in my post on analog-to-digital conversion.

Content on this page requires a newer version of Adobe Flash Player.

Get Adobe Flash player

Figure 2. The effect of sample rate reduction on the range of speech frequencies reproduced in an audio file

Keeping the frequency range broad and flat

The sampling theorem assumes no hardware limitations. However, one needs to recognize the limitations of current hardware's electrical and acoustical performance. The low end of the frequency scale is particularly problematic for modern hardware, esp. microphones. You should try to find a microphone that has a flat frequency response throughout its range. You can read more about this in the section on microphones.

Dynamic range

Speech contains sounds spanning a wide range of amplitudes. Some sounds are naturally softer (carry less energy) than others. The typical range between the softest and the loudest sound in speech (dynamic range) is about 40 dB. Dynamic range is determined in A/D conversion by quantization. Even 16-bit quantization (e.g., the audio CD standard) is capable of capturing the entire dynamic range of speech. Figure 3 shows the waveform (a plot of amplitude over time) of a phrase "Cathy just wanted to go to the pet store." Note how the amplitude varies from phoneme to phoneme, as indicated by the little red ball (click Play button to hear audio).

Content on this page requires a newer version of Adobe Flash Player.

Get Adobe Flash player

Figure X. An illustration of the changing amplitude of speech sounds over time
DOWNLOAD FILE

The sampling limitations are only one aspects of controlling dynamic range. The best way to reproduce the subtle changes in amplitude is to place the microphone close to the talker's lips and try to maximize signal-to-noise ratio.

Conclusion

As we saw in the examples above, the digital audio file must have the sample rate of at least 48,000 Hz and a 16-bit bit-depth in order to capture the entire frequency range and dynamic range of speech. The good news is that most digital recorders these days meet these specifications. You should, however, bear in mind that specifications alone do not guarantee good recordings. I encourage you to browse through the article on this site, as they might be helpful in learning how to capture high-quality speech signals by means of portable field recording equipment and technique.