EDN logo


Design Feature: March 3, 1994

Ratio detection precisely characterizes signals' amplitude and frequency

Robert H McEachern,
Consultant

Straightforward models of how humans perceive sound and color lead to amazingly simple techniques that use Gaussian filters and ratio detectors to extract signals' information content.

The fast Fourier transform (FFT) is a well-known technique for estimating the frequency and amplitude of discrete tones within a signal's power spectrum. What is not so well known is that often, a simple modification of this technique can increase the accuracy of the frequency and amplitude estimates by several orders of magnitude. This modified technique, which involves a form of FM demodulation known as ratio detection, seems to be the basis for our visual and acoustic perceptions of color and pitch.


Ratio-detecting filter banks

In his classic paper on using windows in Fourier Analysis (Ref 1), Frederic J Harris notes that Gaussian windows have the minimum possible time-bandwidth product, at least until the windows are truncated to a finite length. This property makes Gaussian windows a good choice for spectral analysis. However, Gaussian windows possess another property that suits them even better for signal analysis—you can use them to construct ideal ratio detectors for performing modulation analysis.

When you pass a sinusoidal signal of amplitude A and frequency f through a pair of Gaussian bandpass filters, each filter's output is an attenuated copy of the signal. The amplitudes a(f) and b(f) are Gaussian functions of frequency:

a(f)=Ae-(f-NDf)2/cDf2,
b(f)=Ae-(f-(N+1)Df)2/cDf2,

where c is a constant, and N stands for the Nth filter in a set of evenly spaced filters, such as those created by a Gaussian-windowed FFT. (Recall that the Fourier transform of a Gaussian function of time is a Gaussian function of frequency, so using a Gaussian window with an FFT amounts to creating a filter bank in which each filter's response is a Gaussian function of frequency.) If no other signal is present within the passbands of a pair of these filters and the signal frequency varies slowly enough, you can use the filters' output amplitudes to accurately determine the input signal's instantaneous frequency. Taking the natural logarithm of the ratio a(f)/b(f) yields:

c(ln(a(f)/b(f)))=-((f-NDf)2/Df2)+((f-(N+1)Df)2/Df2).

Expanding the numerator and solving for f yields:

f=NDf+Df/2-(Df/2)c(ln(a(f))-ln(b(f))).

This equation says that a tone's frequency—if it varies slowly—is a simple function of the difference between the logarithms of successive filter pairs' (FFT bins') output amplitudes. Fig 1 shows how these ratio detectors yield multiple, accurate estimates of signal frequencies and amplitudes. As you can see, when there is little noise or interference from other signals, you can determine the frequencies of tones within a tiny fraction of the FFT bins' frequency spacing.

To create the figure, a digital signal generator produced 64 samples of a signal comprising the sum of three sinusoids. For the sake of comparison, the two closely spaced tones had the same amplitudes and frequencies as those Harris uses to test the responses of various window functions. Windowing these samples with a Gaussian function and using a 64-point, real FFT produced the 33-point power spectrum shown. Since the Fourier transform of a Gaussian function of time is a Gaussian function of frequency, the Gaussian-windowed FFT acts as a filter bank comprising 33 Gaussian bandpass filters. Each consecutive pair of filter power estimates forms a ratio detector. The 32 resulting ratio detectors yield the frequencies and amplitudes shown. Listing 1 contains a short computer program that performs the operations described.

Listing 1—C language example of the ratio-detecting filter bank used to produce the results in Fig 1.
void RatioDetectorDemo()
{
        int i,halfFFT,FFTSIZE;
        float pi,f1,f2,f3,z,F[256],A[256],Window[256],Data[256],LogData[256];
        float c=3.0;                                              /* determines window's main-lobe width */
        pi=3.1415926;
        FFTSIZE=64;
        halfFFT=FFTSIZE/2;
        for(i=0;i<halfFFT;i++)                                    /* create a Gaussian Window */
        {                                                      /* Transform of a Gaussian is a Gaussian */
        	z=(0.5+halfFFT-1-i)*pi/FFTSIZE;
        	Window[i]=exp(-c*z*z);
        	Window[FFTSIZE-1-i]=Window[i];
        }
        f1=10.5;                                                 /* test tone frequencies in "FFT bins" */
        f2=16.0;
        f3=25.21;
        for(i=0;i<FFTSIZE;i++)                                   /* generate sum of 3 sinusoids */
        {
                Data[i]=  1.00*cos(2.0*pi*f1*i/FFTSIZE);                        /*    0 dB */
                Data[i]+=0.01*cos(2.0*pi*f2*i/FFTSIZE);	                        /* -40 dB */
                Data[i]+=0.10*cos(2.0*pi*f3*i/FFTSIZE);                         /* -20 dB */
                Data[i]*=Window[i];                              /* multiply the signal by the window */
        }
        GetRealPowerSpectrum(Data,FFTSIZE);                      /* perform real FFT and compute power */
        for (i=0;i<=halfFFT;i++) LogData[i]=c*0.5*log(Data[i]);  /* log(amplitude) = 0.5*log(power) */
        for (i=0;i<halfFFT;i++)
        {                                                        /* frequencies F & amplitudes A */
                F[i]=i+0.5+0.5*(LogData[i+1]-LogData[i]);
                A[i]=sqrt(c)*exp((LogData[i]+(F[i]-i)*(F[i]-i))/c)*2.0*sqrt(pi)/FFTSIZE;
        }
}

A recently developed theory of sensory perception (Ref 2) proposes this technique as the basis for our perception of color and the reason that colors mix in the way they do. By forming ratio detectors (log-amplitude-detected bandpass filters) from the retina's three types of visual-receptor cells (cone cells), the visual system effectively measures the instantaneous frequencies of light entering the eye. When two frequencies, such as those of red and green light, are present simultaneously, we perceive yellow because the ratio of the amplitudes produced when red and green light are present together is identical to that produced by yellow light alone. A modification of this mechanism also appears to be responsible for our perception of sounds (Ref 3).


Speech analysis

If you change variables (from frequency to the log of a frequency ratio), the equations above also describe a ratio detector. However, instead of computing frequency, this detector directly computes the logarithms of instantaneous-frequency ratios. It does this from differences in the log-detected output amplitudes of a pair of constant-Q filters. These filters have bandwidths proportional to their center frequencies. The fact that you can use ratio detectors and constant-Q filters seems to explain our logarithmic perception of sound frequencies and the well-known constant-Q characteristic of the auditory system at frequencies above about 500 Hz.

Below 500 Hz, the filter bandwidths within the auditory system appear to be approximately constant. If the constant-Q characteristic were continued to these low frequencies, the filters would become too narrow to measure the slow frequency modulations of speech harmonics. You can see this by considering Carson's rule-of-thumb relationship for the bandwidth of a frequency-modulated signal (Ref 4):

B=2Df+2M,

where B is the bandwidth of the transmitted signal, Df is the frequency deviation, and M is the bandwidth of the modulating waveform.

For frequency-modulated harmonics, such as those in speech, the first term on the right side depends on the harmonic number and the pitch; higher harmonics and higher pitched voices possess larger frequency deviations than low ones. Indeed, the Nth harmonic's frequency deviation is precisely equal to N times the fundamental's frequency deviation. On the other hand, the second term depends on how fast people talk (modulate the pitch) and the frequency spacing between harmonics. This spacing determines the maximum filter bandwidth that is possible without mutual interference from adjacent harmonics. Thus, for a receiving system designed to characterize the frequency modulation of individual harmonics, the first term dictates the bandwidths at high frequencies. At low frequencies, the bandwidths asymptotically approach the lower limit given by the second term. This yields the same peculiar bandwidth-vs-frequency characteristic as that in the cochlea of the ear.

A simple modification of the constant-Q filter bank produces a ratio-detecting filter bank whose bandwidths match those within the ear.


Frequency-diversity signaling

For true harmonic components, such as those found in speech, the information content of each harmonic's instantaneous frequency is identical to the information content of the fundamental. Simultaneously transmitting the same information at several frequencies, called frequency-diversity signaling, ensures that the needed information is received, even when noise or destructive interference in the environment filters out some frequencies or obliterates others.

Even if the fundamental's instantaneous frequency varies as a function of time, f(t), the log of the Nth harmonic's instantaneous frequency is simply ln(N)+ln(f(t)). If you were to graph the log of each harmonic's instantaneous frequency vs time, the curves would all look identical except for a vertical offset. Indeed, if you were to subtract the center frequency from each harmonic (actually, if you just didn't add it in), you would obtain identical functions. Other things being equal, the output from this operation is independent of the pitch. That is one of the main reasons why you can easily recognize the same word spoken by speakers who have different-pitched voices.

Since the FM information content of all of the harmonics is identical, the auditory system can estimate the signal's frequency-modulation content by taking a weighted average of the harmonics. In a weighting scheme optimized to suppress additive white Gaussian noise, the weighting of each harmonic is a function of each harmonic's S/N ratio. If you design the filter bank so that the filters to be combined are spaced at precise harmonic-frequency intervals, you can easily make the weighting self-normalizing.

In this case, all of the harmonics' amplitude ratios, a(f)/b(f), are identical. Hence, to use the ratios to compute each harmonic's frequency, you need only sum all of the amplitudes from the upper (b) or lower (a) filters of each detector's filter pair. Each term in the two sums is thus automatically proportional to the amplitude of the respective harmonic (or the power, if you use the sums of the amplitudes squared). Furthermore, because the noise power is proportional to each filter's bandwidth, weighting each term in the two sums by the respective filter's bandwidth effectively weights by the S/N ratio. Then, computing the frequency via ratio detection from the two weighted composite amplitudes yields a weighted-frequency estimate of the fundamental. Since the noise power density and the harmonics' relative power levels cancel out when you take the ratio, you can perform optimal weighting in this way without knowing the noise power density or the harmonic power.


Picking information out of babble

You can also extend this technique to separate interleaved harmonics from several simultaneous speakers. Because the system can determine precise harmonic relationships between signal components, it can recognize and filter out interfering signals. Collapsing the multiple harmonics into a single FM-vs-time waveform and log-encoding the FM and AM from the individual harmonics also illustrates how you can obtain the large data-compression factors long known to be theoretically possible with speech. The extracted AM and FM waveforms have bandwidths that are orders of magnitude smaller than that of the speech waveform itself, so you can encode them with far fewer bits.

Changes in amplitude and pitch do not affect the log-encoded AM and FM, and thus log encoding is a significant step toward speaker-independent speech analysis. Moreover, because the modulation is the only part of a signal that carries information, characterizing just the modulation preserves all of the information.

You can find additional details on these processes in Ref 2, which describes the functioning and evolution of the brain as a sensory-signal processor, and in the US Patent disclosure for the speech-processing techniques described in Ref 5.


Author's biography

Robert H McEachern, based in Edgewater, MD, has been a consultant for more than five years. His work involves the development of signal-processing algorithms and systems. Recently, he received a patent for a speech-information extractor. He holds a BS in physics and astronomy from the University of Michigan and an MS in physics from Michigan State University. His hobbies include traditional dance, hiking, bicycling, and cross-country skiing.

References

  1. Harris, Frederic J, "On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform," Proceedings of the IEEE, Vol 66, No. 1, January 1978, pg 51.
  2. McEachern, R H, Human and Machine Intelligence—An Evolutionary View, R&E Publishers, Saratoga, CA, 1993.
  3. McEachern, R H, "How the Ear Really Works," Proceedings of the IEEE International Symposium on Time-Frequency and Time-Scale Analysis, October 4-6, 1992, Victoria, BC, Canada, pg 437, IEEE, Piscataway, NJ, 1992.
  4. Schwartz, M, Information Transmission, Modulation and Noise, McGraw-Hill, New York, NY, 1980, pg 275-276.
  5. McEachern, R H, "Speech Information Extractor," US Patent 5,214,708, May 25, 1993.



| EDN Access | feedback | subscribe to EDN! |
| design features | design ideas |


Copyright © 1996 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.