Inside an Audio-to-Midi Engine: Building Audio Analysis Infrastructure (Part 4)

Overview

We've been building up to this one big topic: "how to map energy to musical content." and it's finally here. We don't need to be DSP buffs to go headlong into the audio world. What we're really trying to solve is how to transform uniform spectral energy into meaningful musical information.


Updated Architecture

nyx.a2m.architecture.png

As you can see, we keep expanding as we find new problems and have to create new solutions. You can see that our MIDI Mapping pipeline has grown. Let's dig into why.


FFT to Meaningful Musical Information

For illustrative purposes, here's a buffer of data output from our FFT:

fft[   0 ] = 0.0
fft[   1 ] = 0.0
...
fft[ 340 ] = 0.2
fft[ 341 ] = 0.3
fft[ 342 ] = 1.0
fft[ 343 ] = 0.4
fft[ 344 ] = 0.1
fft[ 345 ] = 0.0
...

That's energy across uniform and discrete data points. If our sample rate is 44,100 and our FFT window (i.e., its output size/buffer size) is 4,096, then the size of each uniform and discrete data point represents ~10.76 Hz.

$$Δf=\frac{44100}{4096}​≈10.76 Hz$$

In the example above, 10.76 * 340 = 3658.4 Hz. If we do that for [340 - 344] = [3658.4 Hz - 3701.44 Hz]. It's our job to now understand what that means both in terms of energy and music.


Why Radial Bins Isn't the Solution

If you read article 3, then you'll know that when I first approached this problem, I thought I could use a bin radius, where you select the number of surrounding bins and sum them and then map to midi pitches. You end up with dead zones due to the mapping technique though in the upper frequencies. Have a look at the image below:

nyx.a2m.harmonic.dead.zone.png

You can see there's plenty of energy in the FFT bins, but no MIDI notes are active!

Here are the important parameters:

Sample Rate = 44100 
FFT Window Size = 4096 
Windowing Algorithm = Hann 
Oscillator Frequency = 3661.239 
Radius Bins = 4
Sigma Scale = 0.261

The frequency should map to Midi Pitch 105 or 106 given its range [A7 @ 3520 Hz - A#7 @ 3729.31).

I can increase the radius bins and get a reaction, but that's not very stable or intuitive. It also becomes increasingly more unstable the higher in frequency we go.

Gaussian Bin Radius

Let's break this down:

$$σ = binRadius ⋅ sigmaScale = bins$$

This is what I've been using to help me compute how much spread to the surrounding bins to take into account. It will also explain why only increasing the binRadius parameter made any real difference:

$$σ=4⋅0.261=1.044 bins$$
Meaning that only bins very close to exactly note center count. In the higher frequencies, that's going be challenging (see article 3 for more details). So, increasing the binRadius helps, but there are two problems:

  1. One sigma calculation doesn't fit across the spectrum
  2. We don't want to set up a bunch of sigmas that we manually set across the spectrum

Building a Musically Scale-Aware Mapper

We gotta dump the following fixed approach (for now):

  • radius = fixed 4 bins
  • sigma = fixed fraction of radius

There are three options we're going to consider:

  1. Semitone width
  2. Note-to-note Spacing
  3. Cents

I'll briefly discuss each one and choose one. Just note that each one automatically includes a Gaussian Scale, but its coverage will shrink at the lower end and widen as we increase the frequency.

Option A: Semitone width

A semitone is the gap between two adjacent notes, e.g., C2 = 65.40639 and C#2 = 69.29566. 69.29566 - 65.40639 = 3.88926. But that semitone gets larger and larger as we've discussed. So what's the formula?

$$semitone = f ⋅ (2^{{1}/{12}} - 1)$$

We know our starting point because we limit the MIDI pitch to 24 (C1). The formula for converting a MIDI note to frequency is:

$$f = 2^{{(midi-69)}/{12}} ⋅ tuning$$
$$2^{{(24-69)}/{12}} ⋅ 440 = 32.7032 Hz$$

Now we have all the components we need to calculate the semitones for our range. Our range, btw, is 88 keys, just like on a full piano [C1 24 - E8 112].

Option B: Note Boundaries

This is basically the same as the semitone approach, except we view the world in terms of midi notes rather than adjacent frequency differences:

$$f_{lower} = \sqrt{f_{m-1}f_m}$$
$$f_{upper} = \sqrt{f_{m}f_{m+1}}$$

The menu interface already does this and it would map 1-to-1 with it, making debug easy.

Option C: Cents

This one scores cents distances instead of bin distances

We can take any frequency and convert it using this formula:

$$c = 1200log_2(\frac{f_k}{f_0})$$

It's important to note here that f0 is NOT 32.703. We want to use the center frequency of a candidate midi note. I made that mistake at first.

Find the Candidate Note

So fk = 3661.239 as our frequency, then we need to find our candidate note by converting 3661.239 to midi first:

$$m = 69 + 12 ⋅ log_2(f_k / tuning)$$
$$m = 69 + 12 ⋅ log_2(3661.239 / 440) = 105$$

We round to the nearest MIDI note because we need a candidate note center to compute cents distance.

And lastly, convert the midi note to its frequency, which is 3520 (see formula for doing so above).

Calculate the cents

$$1200log_2(\frac{3661.239}{3520}) = 68.108$$

What that means is that our note is ~68 cents sharp of 3520 Hz (A7). Because we're between two notes (1 semitone = 100 cents), we know that 68.108 - 100 = -31.89, which means we're 31.89 cents flat of A#7 (3729.31 Hz).

Use Gaussian Weighting

Now use Gaussian Weights, but in cents:

nyx.a2m.gaussian.weight.png

NOTE: this is a standard Gaussian kernel centered at the note frequency (and sorry for the image, but the LaTeX renderer failed on this one).

Where:

  • c = cents distance from the candidate note
  • σ = Gaussian width in cents!

σ tells u how fast the note falls off as pitch moves away or better put "how tolerant should this note be to pitch deviation?"

We can set our sigma to be [1, 100] (cents). Let's run a few scenarios:

  1. σ = 25

$$e^{-\frac{1}{2}(68.108 / 25)^2} = 0.0245$$
$$e^{-\frac{1}{2}(-31.982 / 25)^2} = 0.443$$

* 25 is tight. Neighboring semitone centers contributions are low.
  1. σ = 50

$$e^{-\frac{1}{2}(68.108 / 50)^2} = 0.395$$
$$e^{-\frac{1}{2}(-31.982 / 50)^2} = 0.816$$

* 50 very forgiving. Neighboring semitone centers bleed.
  1. σ = 100

$$e^{-\frac{1}{2}(68.108 / 100)^2} = 0.793$$
$$e^{-\frac{1}{2}(-31.982 / 100)^2} = 0.950$$

* 100 basically says "ensure neighbors are included".

The solution: Option C

It had to be, right? It's the coolest and the most musical. I don't really like grouping by any type of frequency necessarily, because we run into problems still at the lower end. Btw, this identical problem can be seen in Reaper's ReaTune. If you have a down-tuned guitar, it fails to read the tuning. I say that with love and respect for Reaper. I'm just excited that I understand why it's so difficult to do.

The Steps

  1. Loop through FFT bins
  2. If fft[ i ] >= power filter, then
    1. Find the frequency of fft[ i ]
    2. Find candidate centers based on fft[ i ]
    3. Compute cents distance
    4. Compute Gaussian weight
    5. Accumulate Score & Weight Sum

$$score = score + w ⋅ power_k$$
$$wSum = w + wSum$$

  1. Calculate the Result

$$result = \frac{score}{wSum}$$

nyx.a2m.harmonic.live.zone.png

As you can see in the image, we now have a working model that deals with upper frequencies without any dead zones (same frequency as above).

And for a full demonstration, you can see the tool in action sweeping across the mid-to-upper range.


Why Cents Work

Musical pitch is logarithmic: every octave doubles in frequency and every semitone represents a constant ratio. FFT bins, however, are linearly spaced in Hz. That mismatch means a fixed bin distance cannot represent pitch deviation consistently across the spectrum (again, see article 3 for a complete discussion of this).

Cents solve this by expressing pitch distance on a logarithmic scale, where one semitone always equals 100 cents regardless of frequency. This gives us a uniform way to measure how far a spectral peak is from a note center.

Once distances are expressed in cents, Gaussian weighting becomes a natural fit. The Gaussian curve simply defines how tolerant a note should be to pitch deviation, allowing nearby frequencies to contribute smoothly to the note's energy instead of producing unstable or missing detections.

Having said all that, it doesn't work well in all circumstances.


The Future is Multi-Regional

The Gaussian Cents Mapping works really well for mid-to-high frequencies, but it absolutely bombs at the lower frequencies. We could split up the spectrum, use the same technique, and use different parameters, but we need the power of the first algorithm that I said created dead zones because it works wonders in the lower frequencies. So, our new approach is:

  1. Divide the spectrum in n regions
  2. Each region has its own:
    1. assignable midi mapping algorithm
    2. customizable parameters
    3. customizable crossover regions
  3. Crossover section midi emitting

My brother-in-law is an audiophile and he's always talking about how high-end speakers manage crossover points; some of them even let you choose width and shape of crossovers between ranges. We need to apply that same concept, because the lower Hz rely on larger FFT sizes and/or harmonics, while higher frequencies can get away with smaller FFT sizes and rely on fundamentals.

In article 5, I'll specifically cover how we create regions and resolve crossover/border problems.

If you've made it this far, then thanks for sticking around.