Honestly It’s Not Magic – How Shazam Works

The music identification service Shazam has been with us for a while now. This great service/app let’s you identify a piece of music playing in the background or on the radio just by holding up your phone and letting it ‘listen’ to the music for about 10-15 seconds. The app more often than not successfully gives you the title, artist and a link to buy the track on iTunes or other music services. This works even in noisy environments like bars.

The service makes use of ‘fingerprinting‘ of recorded music to allow it to identify individual  tracks. What this means is that new music is entered into their database and processed. The sample you record on your phone goes through the same processing and they search for a match in the database. The concept is simple enough, but how do you reduce complex audio to a simple fingerprint which is easily searchable in a database of over 11 million songs.

The answer comes more or less directly from its creator, Avery Li-Chun Wang, in the paper he wrote describing the algorithms used. I came across the paper via this excellent blog post which itself goes into some of the details.

Diagrams explaining the hash generation principle of operation for Shazam

Hash Generation from Spectogram of recorded music for Shazam

The processing involves taking the spectogram of the music clip, and mapping out ‘constellation points’, representing the frequencies in time of highest intensity. The tricky part here is deciding how many points are needed. There needs to be enough to be able to quickly generate matches, but few enough that the fingerprints are still unique. These constellation points are mapped as pairs by taking one of the points as an anchor point so many seconds from the start of the file, pairing with another constellation point and recording the time difference between them. The pairs are mapped within a ‘target zone’ which limits the number of pairs per anchor point. These constellation pairs are used to generate the entries of a hash table, which is used to speed up the searching and matching. The video below explains briefly how hash tables work.

Since the constellation pairs are stored with a relative time difference between them, the recorded sample you want to match doesn’t need to include the start of the file and can begin at any point in the song. The diagram below shows the constellation pairs that were matched from the target file with those of the stored fingerprints.

Diagram showing a scatterplot of matching hash locations

An Industrial-Strength Audio Search Algorithm – Scatterplot of matching hash locations

The tell-tale diagonal shown in the diagram indicates a positive match and voila – you no longer have to wrack your brain for the name of the song that’s playing in the background of the noisy bar!

For more of a deep dive, here is an implementation of the algorithm written in Java with a nice step-by-step description.

One of the the drawbacks of this method of course is that it only works on recorded music and not live music with few exceptions as noted in my favourite line from the paper.

“The algorithm was designed specifically to target recognition of sound files that are already present in the  database. It is not expected to generalize to live recordings.  That said, we have anecdotally discovered several artists in  concert who apparently either have extremely accurate and  reproducible timing (with millisecond precision), or are more plausibly lip synching.”

Share this post
  , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.