Was just doing some tidying up on my webserver and I came across an old demo I created quite a while ago, but never demoed publicly (as far as I can remember): a real-time pitch monophonic pitch detector, written in Flash. It looks like this (click to run it!):
The pitch detector listens to the microphone input, displays the waveform, and shows the detected pitch as a red dot on a keyboard. It updates continuously, so as you sing, whistle, or play an instrument, you can see the red do move around.
I’m not planning on releasing the source code to this. I earn a living writing audio code other people, so if I give away all my secrets, I’d be putting myself out of business 🙂 [Update: I’ve released non-optimized C++ source code for a monophonic pitch detection algorithm. Consider it a public service].
But I can say a bit about how it works. Basically there are two main approaches to pitch detection: time-domain approaches, which typically use an autocorrelation; and spectral approaches, which typically use Fourier transforms and some simple pattern matching. This demo uses a time-domain approach.
Time-domain approaches are only useful for monophonic cases: that is, where there’s at most one pitched source at any given instant. The idea behind autocorrelation is basically to see how well a signal lines up with a delayed version of itself for varying amounts of delay (or “lag”). The very best alignment is for zero lag, but that’s not very interesting. What is interesting is that you also get very good alignment (auto-correlation) at a lag that corresponds to the period of the waveform; the reciprocal of that is the fundamental frequency (i.e. the pitch or f0 ‘F naught’) of the sound.
That’s basically it, but as always with anything audio, there are loads of subtleties and tricks. First off, the autocorrelation will be very strong not just for a lag of one period, but also two periods, three periods, etc. What that means is that a middle C (C4) can easily be ‘mis-heard’ as a note an one octave below (C3, whose period is 1/2 that of C4) or an octave-and-a-fifth below (F2, period 1/3 that of C4). Depending on the signal and what sort of normalization you use, the auto-correlation peaks for multiples of the real period may be stronger than the peak at a lag of one period.
Next, there’s a problem of resolution: for high pitches, the period is really not very long. For the highest note on a piano (C8, ~4186Hz), the period is less than a dozen samples (if the sample rate is 44.1 kHz) To get a musically accurate measurement of the pitch, you need to upsample (say, to 88.2 kHz), interpolate the peaks of the autocorrelation, or both.
There’s also a matter of CPU load. Brute force auto-correlation is pretty expensive computationally. Fortunately autocorrelation can be performed more efficiently using Fourier transforms, as the autocorrelation of a signal is equal to the inverse Fourier transform of the product of the signal’s Fourier transform and its complex conjugate: AC(x) = IFFT(FFT(x)FFT(x)*). There are some subtleties there – you need to zero-pad the signals, otherwise you’ll compute the circular autocorrelation, which is far less useful.
Even if you get all this stuff right, monophonic pitch detectors using autocorrelation can be thrown off pretty easily, as real-world signals tend not to be as monophonic as we’d like. Even with an instrument that physically only produces one note at a time (say a clarinet – ignoring advance playing with multiphonics), if you record it in a highly reverberant space, at any given time-slice the recorded signal will contain not just the current note, but also the echoes/reverberation of notes played slightly earlier.
In my more recent experiments with pitch detection, I generally use spectral approaches, as they can be applied to polyphonic pitch detection, and can be tweaked to deal with reverb in the monophonic case.
If this all seems like Greek – even without the mathematical notation, which tends to use Greek letters a lot! – well, that’s the nature of this domain… and that why it’s worth hiring experts 🙂
Incidentally, my AudioStretch app for iOS includes a spectrum analyzer graphically aligned with a keyboard display (which is playable). While it doesn’t do note-recognition per se, it shows you the spectrum of whatever notes are playing; by playing the keyboard you can audibly and graphically figure out which note(s) best line up with the spectrum. Eventually I’ll have the app automatically identify the notes.
For example, here’s AudioStretch displaying the spectrum for a major third played on a piano, specifically middle-C (C4) and the E just above it (E4). The spectrum clearly shows the peaks at the C4 and E4, as well as at the harmonics of those notes.