Inside an Audio-to-MIDI Engine: Building Audio Analysis Infrastructure (Part 1)

Overview

In this series, I'm going to build an Audio-to-Midi plugin. There are several utilities that do this already (and very well), but I have a narrow scope in mind: create MIDI data points from audio data for forwarding to the nyx.vfx plugin. In addition to having a specific use case, I want to model this on the same ethos of nyx.vfx's "Everything Is Adjustable!" I want to expose as many fine-grained controls available to really be able to dial in all types of scenarios.

Additionally, I want a few perks for myself, such as "sympathetic" midi notes, where we can trigger multiple surrounding midi notes at different reduced velocities based on the audio data. This means the lines of faithful accuracy should be blurred in favor of aesthetics, if the user chooses. We should be able to consume those into fractal like creations in nyx.vfx.

sympathetic_midi_trigger.png

Don't get bogged down in the note identification. The big takeaway from the image is that C4 is the main note and then there's some bleeding into the surrounding notes that we want to capture along with their magnitude. We can use a simple nyx.vfx layout to group them and then blur or liquify them into some cohesive shape that would look really cool.

Creating Testing Infrastructure

nyx.a2m.test.flow.png

Before we can even get started on building a VST3 plugin, we need to start building out our infrastructure for testing and analysis. Since we're not DSP PhDs (at least I'm not), but C++ software engineers, we're going to do this using only C++, the standard library, and then eventually an FFT library.

  1. Simulation Oscillators
    1. Pure-tone generators
    2. Ability to config the oscillators live
  2. Oscillator validators
    1. Zero-cross estimator
    2. Goertzel algorithm
  3. DFT Overview
    1. Naive implementation
  4. kissfft usage

Build Simulation Oscillators

We want a flexible oscillator that generates one note for us and we want it generate notes similarly to how the VST3's Processor hands us data.

using FPType = double;
  
struct OscillatorData_t  
{  
  FPType frequency = 440.f;    // A4
  FPType sampleRate = 44100.f;
  FPType phase = 0.f;  
  FPType time = 0.f;  
  FPType amplitude = 0.8f;     // [0, 1]
  uint32_t blockSize = 128;  
};

What would it look like to generate a single sample point based on this data? We need to be able to generate data regardless of the wall clock.

FPType renderSample( OscillatorData_t & o )  
{  
  // sin( tau * phase ) * amplitude 
  const FPType s =  
    std::sin( o.phase * 2.f * std::numbers::pi_v< FPType > ) * o.amplitude;    
  // phase advance per sample, e.g., 440 / 44100 = 0.009977...
  // so that we can generate it accurately.
  // we're not moving time forward, we're stepping in phase. 
  const FPType phaseInc = o.frequency / o.sampleRate;
  
  // increase the phase based on the incremental step above
  // and store it because we'll need to keep stepping forward
  o.phase += phaseInc; 
  
  // wrap phase to [0, 1) to prevent drift in extreme frequencies
  o.phase -= std::floor( o.phase ); 
  
  return s;  
}

Now we just need to stuff generated samples into a buffer, so that whenever a simulation process function gets called, then we can hand back a buffer of data.

void renderBlock( FPType * out, const size_t szOut, OscillatorData_t & o )  
{  
  for ( size_t i = 0; i < szOut; ++i )  
    out[ i ] = renderSample( o );  
}

Now we have an adjustable oscillator. Notice that we're not interested in square or triangles or saw waves. We can add those later, if we want.

Validating Oscillation Output

This took me some reading to find because I didn't want to use our DFT in the pre-analysis stage. There are two ways we can validate a simple frequency:

  1. Zero-cross estimation
  2. The Goertzel algorithm

We're going to implement both for practice.

The Zero-Cross Estimation Method

In very simple terms, we're just going to look at changes in the signs (hence, crossing zero). A sine wave crosses the zero line on the x-axis twice within a cycle. We count rising zero crossings only (i.e., going above zero), so one crossing ≈ one cycle. The only issue is that it's not actually quantized to our sine's frequency, so it tends to be off a little.

FPType estimateFreqZeroCrossing(  
  const FPType * buffer,  
  const size_t szBuffer,  
  const FPType sampleRate )  
{  
  // number of crossing
  int crossings = 0;
  
  // loop across our buffer whose total time we do know
  for ( size_t i = 1; i < szBuffer; ++i )  
  {  
    // check the signs of prev & curr samples to see if they differ
    if ( buffer[ i - 1 ] < 0.f && buffer[ i ] >= 0.f )  
      ++crossings;  
  }  
  
  // find the duration by 128 / 44100 in this example
  const auto seconds = static_cast< FPType >( szBuffer ) / sampleRate;  
  return crossings / seconds; // ~Hz for a sine
}

128 samples @ 44.1kHz ≈ 2.9 ms window, where 440 Hz is ~1.27 cycles. It's pretty noisy. We get this back whenever I run it:

Zero-cross: 439.8 Hz (err 0.2)

Improvement

We might be able to improve the accuracy using linear interpolation instead, which gives us fractional sample positions instead of integers tied directly to a sample position. Although, I'm only interested in revisiting this if the Goertzel algorithm doesn't pan out.

The Goertzel Algorithm

Goertzel can be thought of as asking: how much of a specific frequency, e.g., 440 Hz, exists in a signal?

That's very similar to DFT, except we're just asking what a single bin is instead of calculating the entire spectrum.

Wikipedia describes the formula as this:

$${s[n]=x[n]+2\cos(\omega _{0})s[n-1]-s[n-2]}$$
Where

$${\omega _{0}} = \frac{2\pi x}{f_s}$$
All that is to say:

FPType findGoertzelPower(  
  const FPType * buffer,  
  const size_t szBuffer,  
  const FPType sampleRate,  
  const FPType targetHz)
{
  // targetHz = the frequency we're trying to find, e.g., 440
  const FPType w = 2.f * std::numbers::pi_v< FPType > * targetHz / sampleRate;
	
  // this becomes our coefficient
  const FPType coeff = 2.f * std::cos( w );
	
  // these are the x[ n ], x[ n - 1], and x[ n - 2 ] respectively
  FPType s0 = 0.f, s1 = 0.f, s2 = 0.f;
  
  // we'll perform that formula on all samples in the buffer
  for ( size_t i = 0; i < szBuffer; ++i )
  {  
    // y = x[ n ] + coeff * x[ n - 1 ] - x[ n - 2 ]
    s0 = buffer[ i ] + coeff * s1 - s2;
	  
    s2 = s1; // s1 becomes s2
    s1 = s0; // s0/y becomes s1 for the next iteration
  }

  // power at target bin, e.g., 440, using magnitude^2  
  const FPType power = s1 * s1 + s2 * s2 - coeff * s1 * s2;  
  return power;
}

This exact formula can be found in pseudocode on Wikipedia https://en.wikipedia.org/wiki/Goertzel_algorithm

We get the output:

Goertzel: P(f)=7.77924e+09

What does the output mean?

It's the squared magnitude of the complex DFT bin, which should not be confused with its amplitude, RMS, dB, or normalization. It's just raw energy accumulated over all the samples.

Let's break down our current scenario and then we'll look into converting it into something humans understand.

targetHz   = 440
sampleRate = 44100
seconds    = 5     (the duration of our entire sample window)
amplitude  = 0.8

N = 44100 * 5 = 220500 samples (a perfect sine wave)

$$|X| = NA / 2$$
$$|X| ≈ N * 0.8 / 2$$
$$|X|^2 ≈ 88200^2 ≈ 7.78 * 10^9 ≈ 7.77924e+09$$

So we get our expected outcome, but it's pretty careless to directly compare the two for our test harness. We need to convert it to amplitude or something human readable.

$$A = 2 \frac{\sqrt{p}}{N}$$
Where p = the power we get back from our Goertzel algorithm. An important note is that perfect bin alignment is only required for exact amplitude formulas. Goertzel still detects off-bin frequencies reliably, though leakage affects magnitude scaling. So be aware of this if you decide to start throwing this into practical applications.

// nSamples = total samples, i.e., sampleRate * seconds
FPType getGoertzelAmplitude( const FPType power, const int32_t nSamples )  
{  
  return 2.f * std::sqrt( power ) / static_cast< FPType >( nSamples );  
}
  Goertzel: P(f)=7.77924e+09 
	  Amplitude estimate: 0.8000000119  
	  Relative error: 0

And now we can do the following for validation:

assert( relErr < 0.01 && "Amplitude estimate is too small" );

In the end, I'm going to keep both approaches.

DFT Overview

A DFT projects a signal onto sinusoidal basis functions corresponding to discrete frequency bins (that's a bit more formal than I usually like to state things, but it's the most accurate).

excel.dft.png

In the image above is a 128-block from summing two sine waves. Then I calculated it in Excel to see all the steps. The first thing to note, is that it's a mirrored spectrum because input is real-valued. The second thing to note is that there are two humps that leak across a few bins. Leakage occurs because tones are not bin-aligned. The energy is in bins 5 and 21 primarily. If we want to convert those to frequencies, then we need the following formula:
$$fk= \frac{k f_s}{N}​$$​​

where x is the block size:

$$f_5 = 5 \frac{44100}{128} ≈ 1600 Hz$$
Between G6 - G#6

$$f_{21} = 21 \frac{44100}{128} ≈ 6730 Hz$$

G#8

Quick Recap

  • Don't confuse power, magnitude, and amplitude. They are all different DSP quantities.
  • Frequency accuracy depends on analysis window length
  • Perfect bin alignment eliminates leakage

Coming Up

The next steps are where it gets really exciting. We need to set up a naive DFT and then incorporate a good FFT. Then we'll start testing them to make sure we get back what we want. There are ways of setting up a "perfect" DFT by ensuring the tone aligns squarely within a bin, i.e., no leakage (unlike in the graph above). And then we'll start summing our wave forms, connecting those to the DFT/FFT implementations, do more testing, and then start mapping MIDI notes in fun ways.