Table of contents
In order to perform acoustic analysis on recorded speech data, the audio signal has to be converted into a digital audio, PCM file format, such as WAVE. Analog recordings have to be digitized and digital recordings need to be transferred to a personal computer via a digital audio file transfer interface (S/PDIF, AES/EBU, USB, FireWire, etc). This is an important, yet often underestimated, stage in the process of preparing audio data for analysis. Often, many of us take it for granted that analysis is done on WAVE files and do not give much thought to the audio file format and the digitization process.
A/D conversion fundamentals
Goals of A/D conversion
The main goal of A/D conversion (digitization) is to obtain the best possible digital representation of the original analog waveform. Without going into too much technical detail of the digitization process, one should choose a sample rate that will capture a broad range of frequencies and a bit-depth that will allow a wide dynamic range and a negligible amount of quantization noise. These goals can be achieved by means of a premium-quality, stand-alone A/D converter operating at the sample rate of at least 48,000 Hz and a 24-bit resolution. It is absolutely crucial not to use a consumer-grade PCI multimedia sound card, as they are built from inferior-quality electronic components and, more importantly, allow electrostatic noise and distortion to leak into the captured acoustic signal (Figure 1)

Figure 1. Spectrum of typical electrostatic noise generated by computer circuitry.
| |
AES/EBU |
S/PDIF (IEC-958) |
| Cabling |
110 ohm shielded |
TP 75 ohm coaxial or fiber |
| Connector |
3-pin XLR |
RCA (or BNC) |
| Signal level |
3..10V |
0.5..1V |
| Modulation |
biphase-mark-code |
biphase-mark-code |
| Max. Resolution |
24 bits |
24 bits |
Table 1. Audio data transfer standards
Sample rate
It is not obvious that an exact reconstruction of an analog signal should be possible, since a complete continuous signal is converted to a finite set of numerical values. The solution to this problem lies in the sampling theorem. In short, the sampling theorem states that if a band-limited signal is sampled at a sample rate of twice the highest frequency in the signal (the so-called Nyquist frequency), no information is lost and the original signal can be unambiguously reconstructed from the samples. Acoustic signals that humans can hear lie in a limited range of about 20 to 20,000 Hz. Thus, intuitively, in order to exactly reconstruct the original analog signal one should use the sample rate of at least 40,000 Hz. Therefore, if the sample rate of 48,000 Hz is used (e.g., the DAT standard), the resulting digital audio file will correctly represent frequencies in the entire human hearing range.
There has been a debate in the audiophile and archival world whether anything is gained by using sample rates higher than 48,000 Hz, particularly for speech recordings. There seems to be a general consensus that even though humans might not hear frequency components above 20,000 Hz, the original analog signal tends to be represented more accurately and with less quantization noise (digitization artifacts) at the sample rate of 96,000 Hz, which has now become the standard in the professional audio and motion picture industries.
Figure 2 illustrates the way in which digital sampling attempts to reproduce sound. You can imaging any complex sound (e.g., speech) to consist of individual sine waves at various frequencies. The higher the sample rate, the higher the frequencies can be reproduced. The illustration shows a 1,000 Hz sine wave (in gray). The top panel shows the effect of using a sample rate that is too low to faithfully reproduce the sound, while the bottom panel contains a properly sampled sine wave, at a high enough sample rate.

Figure 2. Effects of sample rate on reproducing a 1,000 Hz sine wave
Bit depth
For the sampling theorem to apply exactly, each sampled amplitude value would have to equal the true signal amplitude at the sampling instant. ADCs do not achieve this level of perfection, as only a fixed number of bits is used to represent a sample value. The difference between the analog signal and the corresponding closest digital sample value is known as quantization error (or quantization noise). This inherent limitation of the digitization process is often expressed as a signal-to-noise ratio (SNR), the ratio of the average power in the analog signal to the average power in the quantization noise. In terms of the dB scale, the quantization SNR for uniformly spaced sample levels increases by about 6 dB for each bit used in the sample. Thus, for 16-bit encoding about 91 dB is possible. It is 20 to 30 dB better than SNR levels achieved by professional analog audio recorders. Increasing the number of bits used to encode a continuously changing signal to 24, further increases SNR and improves the accuracy of digital representation. Figure 3 illustrates the benefits of increasing bit depth. The 4-bit system (right) is able to represent amplitude values more precisely (16 levels) of the sinewave than a 3-bit system (8 levels) at a sample rate of 15,000 Hz, thus producing a more accurate digital representation of the analog original.

Figure 3. Benefits of high bit depth in digital recording
A/D conversion workflow
The analog playback device (such as TASCAM 122 mkIII) should be connected to the A/D converter. One should make sure that the output levels on the tape deck match the input levels on the A/D converter. It is recommended to use balanced XLR line level interface (+24 dBu min. gain, +7 dBu max. gain, 65k ohm impedance). If the tape deck does not have this kind of output interface, a signal level transformer (such as Ebtech Line shifter) and a pre-amplifier should be used.
The A/D converter needs to be connected to a PCI (though USB and FireWire are becoming common) digital audio I/O card (e.g., Midiman Delta DiO 2496 via the S/PDIF interface) or via the USB or FireWire bus. The digital I/O card should be selected as the recording interface in the audio recording software (such as SONY SoundForge on a PC or BIAS Peak VST on a Mac). The digital audio signal should be captured with this software and saved either as WAVE (PC) or Aiff (Mac) file at the sample rate and bit depth that the A/D converter was set to. It is also possible to capture digital audio signal directly into acoustic analysis software, such as CSL or Praat, though it is not recommended due to the fact that specialized recording and processing software offers considerable more control over the incoming signal. It should also be mentioned that USB Pre may be used as a high-quality, stand-alone A/D converter.
In this case the digital audio signal is transferred to a PC via the USB interface, which eliminates the need to install a separate PCI digital I/O card and makes it possible to capture digital audio on a laptop. In addition, USB Pre has a pair of tape-level inputs, to which a cassette deck can be directly connected.
|
Sensitivity (typical, for 0 dB FS) |
Clip Level (1% THD) |
Impedance (actual) |
|
min. gain |
max. gain |
|
|
MIC |
-10 dBu |
-53 dBu |
-12 dBu (195 mV rms) |
2k ohm active-balanced |
LINE |
+24 dBu |
+7 dBu |
+24 dBu (12.3 V rms) |
65k ohm active balanced |
DI |
+8 dBu |
-9 dBu |
+9 dBu (2.2 V rms) |
10k ohm unbalanced |
TAPE |
+8 dBu |
-9 dBu |
+9 dBu (2.2 v rms) |
110k ohm
unbalanced |
Table 2. Summary of typical signal level types of the Sound Devices USBPre unit

Figure 4. A schematic overview of a minimalist A/D conversion workflow
Improving A/D conversion
There are a few simple, yet important ways in which the quality of the digital representation of an analog waveform can be improved.
1. Use a sample rate of 96,000 Hz.
In principle, if frequency response were the only issue, there would be no advantage in moving to formats with higher sampling rates. However, the evidence is otherwise. Direct psychoacoustic comparisons of the same source material, recorded and reproduced at 44.1 kS/s, 96 kS/s 192 kS/s show that there is an advantage in going to the higher rates - it sounds better! The most common comment is that such recordings have better spatial resolution. What mechanism can be at work? It seems unlikely that we have all suddenly developed ultrasonic hearing capabilities.
Energy dispersion and anti-alias filtering.
Sharp filtering inevitably causes a ringing transient response - the effect is referred to as the Gibbs phenomenon. The ringing contains energy, and although the energy in the input transient is concentrated at one time, the energy from the anti-alias filter is spread over a much longer time - the audio picture is "defocused. We might argue that the energy is ultrasonic, but this is certainly not the case at 44.1 or 48 kS/s - our bandwidth constraints mean that to get good anti-aliasing, we must filter as fast as we can, and only pass the audio bandwidth. A high sample rate gives us the extra bandwidth to contain the ringing (energy defocusing).
The audio DVD standard.
In addition to improved anti-aliasing and energy defocusing handling, the 96,000 Hz sample rate is part of the new, emerging digital audio standard, used in present-day recording studios, consumer PCs (e.g., the new Sound Blaster Audigy cards), and the audio DVD format.
2. Use 24-bit quantization
For the sampling theorem to apply exactly, each sampled amplitude value must exactly equal the true signal amplitude at the sampling instant. Real ADCs do not achieve this level of perfection. Normally, a fixed number of bits (binary digits) is used to represent a sample value. Therefore, the infinite set of values possible in the analog signal is not available for the samples. In fact, if there are R bits in each sample, exactly 2R sample values are possible. For high-fidelity applications, such as archival copies of analog recordings, 24 bits per sample or a so-called 24-bit resolution, should be used. The difference between the analog signal and the closest sample value is known as quantization error. Since it can be regarded as noise added to an otherwise perfect sample value, it is also often called quantization noise. The effect of quantization noise is to limit the precision with which a real sampled signal can represent the original analog signal. This inherent limitation of the ADC process is often expressed as a Signal-to-Noise ratio (SNR), the ratio of the average power in the analog signal to the average power in the quantization noise. In terms of the dB scale, the quantization SNR for uniformly spaced sample levels increases by about 6 dB for each bit used in the sample. For ADCs using R bits per sample and uniformly spaced quantization levels, SNR = 6R - 5 (approximately). Thus, for 16-bit encoding about 91 dB is possible. It is 20 to 30 dB better than the 60 dB to 70 dB that can be achieved in analog audio cassette players using special noise reduction techniques. A 24-bit encoding yields a theoretical SNR of 138 dB, which is only limited by the electronics of the hardware itself.
2. Use appropriate anti-aliasing filters
Simply put, aliasing is a kind of sampling confusion that can occur during the digitization process. It is a direct consequence of violating the sampling theorem. The highest frequency in a sampling system must not be higher than the Nyquist frequency. With higher audio frequencies, the sampler continues to produce samples above Nyquist at a fixed rate, but the samples will create false information in the form of alias frequencies. In practice, aliasing can and should be overcome. The solution is rather straightforward. The input signal must be band-limited with a low-pass (anti-aliasing) filter that provides significant attenuation at the Nyquist frequency. The most "archetypal" anti-aliasing filter will have "brick-wall" characteristics with instantaneous attenuation and a very steep slope. This results in unwanted ringing-type effects and should be avoided. In practice, our system should use an oversampling (see below) A/D converter with a mild low-pass filter, high initial sampling frequency, and decimation processing to prevent output sampling frequency.
3. Dither
Dither is a small amount of noise added to the audio signal before sampling. This causes the audio signal to shift with respect to quantization levels. Quantization error is thus decorelated from the signal and the effects of the quantization error become negligible. Dither does not prevent the quantization error; instead, it allows the system to encode amplitudes smaller than the least significant bit.
4. Oversampling
Oversampling is another technique aimed at improving the results of the digitization process. As noted above, a brick-wall filter may produce unwanted acoustic effects. In oversampling A/D conversion, the input signal is first passed through a mild low-pass filter, which provides sufficient attenuation at high frequencies. To extend the Nyquist frequency, the signal is then sampled at a high frequency and quantized. Afterwards, a digital low-pass filter is used to reduce the sampling frequency and prevent aliasing when the output of the digital filter (e.g. an interpolating, phase linear "FIR" filters) downsampled to achieve the desired output sampling frequency (e.g., 44,100 Hz). In addition to eliminating unwanted effects of a brick-wall analog filter, oversampling helps achieve increased resolution by extending the spectrum of the quantization error far beyond the audio base-band, rendering the in-band noise relatively insignificant.
5. Use high-quality, no-compromise hardware and software.
This goes without saying. Get the best you can afford or use a well-though out compromise
Direct-Stream Digital (DSD)
Direct quote from Wikipedia: "Direct-Stream Digital (DSD) is the trademark name used by Sony and Philips for their system of recreating audible signals which uses pulse-density modulation encoding, a technology to store audio signals on digital storage media which is used for the Super Audio CD (SACD)."
DSD is an alternative to PCM, which is what I have described on this page so far. PCM digitization is so widespread that it is unlikely to be replaced by DSD. DSD has its own fans who swear by it. The unfortunate thing is that these days most audio material is born in PCM and edited in PCM, and only mixed down to or converted to DSD, in much the same way as analog mixes would be digitized onto DAT tape in the early days of digital recording. From my experience in the audio archival community a few years back, there was some interest in DSD as an alternative digitization and storage medium. I must say that I somewhat agree because DSD shines when it comes to direct, stereo A/D conversions. I am not sure, though, whether it is going to garner enough support to replace PCM. Figure 5 shows a schematic comparison between sampling methods between PCM and DSD. I used the open-source Wikipedia figure and change its color scheme for better legibility (original author: Pawe3 Zdziarski).

Figure 5. A schematic comparison between sampling methods between PCM and DSD (original author: Pawe3 Zdziarski)
|