Automatic tempo estimation is a useful tool for musicians for the purposes of transcription and also for audio researchers as we can use the tempo of a piece of music to inform other types of analysis such as pitch detection or chord detection. Since musically significant events such as chord changes tend to occur on the beat, knowing when the beats occur should provide greater accuracy in detection algorithms.
An incredibly useful tool in tempo estimation is the autocorrelation function. In signal processing cross-correlation is a method to estimate how similar two signals are. It is defined (in the discrete case) as

f* is the complex conjugate of f. In practice, cross-correlation is very similar to convolution since we are working with audio signals, f will always be real and therefore f* = f.
Auto-correlation therefore is a cross-correlation of a signal with itself at a given lag or delay. By carrying out the auto-correlation of a signal we can identify at what lag the signal is most similar to itself. For most periodic music with a strong beat, the highest correlation will occur on the beat and we can therefore perform a quick calculation to convert the lag index (number of samples delayed) of the greatest correlation to a bpm value. This value is simply the reciprocal of the lag index divided by the sampling frequency with the result multiplied by 60 i.e.
Tempo = Fs x 60 / Lag_index
As an illustration I did an autocorrelation of a couple of seconds of an audio track – I used “Molly’s Chambers” again – which you can see below.
The highest correlation will naturally occur at index 0 since we are cross-correlating identical signals they will line up perfectly. We look for a strong correlation away from the origin, in this case we have a good candidate at approximately index 36560. Plugging this value into the formula gives us a bpm of 72.37. According to BPMdatabase.com, “Mollys Chambers” has a tempo of 146 which seems about right to me. So our crude example gave us a result which is just about half the actual tempo of the song without any pre-processing carried out on the audio.
In reality there is a lot of pre-processing being done to the signal before the auto-correlation is carried out. Typical processing includes running the audio through a filter-bank and processing the sub-bands seperately (this helps to isolate the bass drum or the high hats), envelope tracking to smooth the signal, down-sampling to reduce the computational load and other more advanced filtering to create a detection function that will show a stronger correlation on the beat.
(Unrelated to everything above but I still tend to think of this comic everytime I hear the word correlation)
