Inside an Audio-to-Midi Engine: Building Audio Analysis Infrastructure (Part 2)

Overview

In the last article, we built our Oscillator simulators for our Audio-to-Midi engine. We then implemented some verification algorithms for those oscillators. Now it's time to design our data structures for managing audio buffers, set up our FFT wrapper, sum some waves, and then validate the data.

We have one more article to get to before we can begin moving out of the test infrastructure and into VST3 land.

Let's have a look at what we've done and what we'll do in this article:

a2m.article2.testflow.png


Summing Waves

If you look online, you'd think this is the most complicated process in the history of sine wave analysis. If the amplitudes or frequencies are different then you have to sum them one way versus another way. In the end, it's simply:

$$f(t) = sin(2\pi\omega_1t) + sin(2\pi\omega_2t)$$
where ω = Hz and t = time.

So, let's say that I set one of my oscillators to 440 Hz (A4) and one to 1046.5 Hz (C5), then we do the following:

$$f(t) = sin(440 ⋅ 2\pi t) + sin(1046.5 ⋅ 2\pi t)$$
Pretty easy.

If we want aliasing (and we very well may), we can reduce our sample rate below 2 × 1046.5 (based on the Nyquist Theorem). This will give us some distortion to play with to see what happens whenever we begin mapping audio data to midi notes. That's another feature that we'll add once we get through the next section.


FFT of Two Summed Sines

a2m.sum.128.2048.png

The image above represents an FFT of the following:

FFT Window Size: 2048
FFT Bin        : 2048 / 2 + 1 = 1025 (≈ 21.53 Hz bin size)
Sample Rate    : 44,100
Frequencies    : 440 + 1046.5

You can clearly see the two dominant peaks. This validates that our summing logic behaves exactly as expected.


FFT of Am Triad

Now let's do an A minor triad:

  • A4 @ 440 Hz
  • C6 @ 1046.5 Hz
  • E6 @ 2637 Hz

a2m.sum.128.2048.Am.png

This time, we increase the FFT window size to 4096 (≈ 10.76 Hz bin size). Obviously, we still have leakage. We'll discuss that more below. For now, this is just validation that summing multiple oscillators behaves correctly.


Simulating VST3's process()

At the moment, we generate a chunk of data through our Oscillators:

Oscillator o;  
auto& odata      = o.getOptions();  
odata.frequency  = freqHz;        // e.g., 440
odata.sampleRate = sampleRate;    // e.g., 44100
odata.blockSize  = blockSize;     // e.g., 128
  
// total samples = seconds * sampleRate
// 5 seconds @ 44100 = 220,500 samples
const auto oResult = o.generate( seconds );

We generate way more data than process() will operate on at a time. Our entire goal here is to create realistic data streams, so we'll address that shortly.


Oscillator as an Audio Interface Abstraction

Think of our oscillator like an audio interface. You plug in your guitar and start running analog data into the interface. The interface converts it to digital using a configured sample rate and block size. It buffers n samples, pushes them through the driver, and hands them off to the OS.

The only difference here is that we buffer way more than an audio interface ever should, because we don't have to worry about optimized hardware performance. That's for interface manufacturers and driver engineers.

We just need realism in structure.


Ring Buffer

A ring buffer is straightforward in theory: you have a fixed-size buffer that overwrites the oldest data with the newest data.

ringbuffer 1.png

If green is the oldest sample and red is the newest, you get the idea. But the implementation causes problems.

If m_buffer[5] is actually the oldest sample and m_buffer[4] is the newest, we can't just return m_buffer[0] unless we only care about magnitude. If we care about phase (and we're going to pretend we do!) order matters.


Ring Buffer: The Mirror Technique

There are a few solutions, yet I'm only going to focus on one of them and that's the mirror technique. We double the memory size and perform two copies on each write.

ringbuffer.mirror.excel.png

The image of the spreadsheet should clarify what the mirror technique does. The light-green stripes show you what index we need to return to get time-consecutive data after each write.

This convenience comes at a cost, but it's simple, fast enough, and the tradeoff is worth it. Let's look at the basic implementation:

///
/// @tparam T the data type (float, double)
/// @tparam N the number of blocks the buffer will contain
template < typename T, size_t N >
class RingBuffer
{
public:
  ///
  /// @param sectionSize the size of a block
  /// @param hopSections the number of blocks to skip (determines our overlap %)
  RingBuffer( const size_t sectionSize,
              const size_t hopSections ) noexcept
    : m_sectionSize( sectionSize )
  {
    // ... error-checking elided ...

    // clamp this down to prevent bizarre errors
    m_hopSections = std::clamp< size_t >(
      hopSections, 1, N );

    m_mirrorBase = N * sectionSize;
  }

  // return a pointer to our buffer starting at the oldest index
  const T * getBuffer() const { return &m_buffer.data()[ m_oldestIndex ]; }

  void emplace( std::span< const T > block )
  {
    const size_t writePosition = m_oldestIndex * m_sectionSize;
    const size_t writeMirrorPosition = m_mirrorBase + writePosition;

    // we make two copies of the buffer to our buffer
    std::copy( m_buffer, m_buffer + block.size(), m_buffer.begin() + writePosition );
    std::copy( m_buffer,
               m_buffer + block.size(),
               m_buffer.begin() + writeMirrorPosition );

    // now we always point at the oldest section of the buffer
    m_oldestIndex = ( m_oldestIndex + 1 ) % N;
  }

private:
  std::array< T, 2 * N > m_buffer;

  size_t m_sectionSize { 0 };
  size_t m_hopSections { 0 };
  size_t m_oldestIndex { 0 };
  size_t m_mirrorBase  { 0 };
};

That's the heart of the implementation. The rest is just configuration and record-keeping, such as indicating whether the buffer is full.


RingBuffer Overlap Options

Depending on how we set up our RingBuffer, we can get a percentage of overlap between FFT calls. That might be good for smooth analysis, e.g. A realistic setting might look more like the following:


/*
Ring Segments        = 32
Ring Segment Size    = 128
RingBuffer Total     = 4096 (32 * 128)

process() block size = 128 => the ring segment size should equal this
 */

RingBuffer< FPType, 32 > rb( 128, 16 ); // 50% overlap

As a test example, here's RingBuffer< int, 8 > rb( 1, 3 ):


RingSegments          : 8
RingSegmentSize       : 1
NonOverlappingSections: 3
------------------------------
0: 0 <-- buffer is full for the first time
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
------------------------------
0: 3 <-- we've replaced 0, 1, 2
1: 4
2: 5
3: 6
4: 7
5: 8
6: 9
7: 10
------------------------------
0: 6 <-- we've replace 3, 4, 5
1: 7
2: 8
3: 9
4: 10
5: 11
6: 12
7: 13

DFT

I'm not going to spend time explaining what a DFT does. There are plenty of real DSP professors (e.g., Steve Brunton) who are far more qualified than I am and can give you a million examples how a DFT is useful. We only want to understand why we need one.

We need an FFT, because we're going to have complex wave forms that we'll need to analyze. An FFT will help us determine where energy in the frequency spectrum is, so that we can map those frequencies with energy to MIDI pitches.


kissfft

A few things to note about kissfft:

  1. it is barebones! It's easy to use and it's fast enough for us, but you get just the real and imaginary parts and that's it
  2. it does NOT normalize or provide power or phase
  3. it does NOT have any windowing
  4. it does let you choose between data types you want (I use double because VST3 uses double).

In CMake, you can easily set the type by using:

find_package(kissfft CONFIG REQUIRED COMPONENTS double)

Before we write out our FFT wrapper, let's talk about normalization and windowing briefly.


Normalization

We use inverse FFT normalization:

mag = sqrt( r^2 + i^2 ) * 1/N

We also apply one-sided amplitude correction (i.e, we multiply by 2 except DC & Nyquist).

Without normalization, comparing different FFT window sizes becomes messy and misleading. There are other normalization techniques you can read about.


Windowing

We're not going to apply windowing yet because I want to create hot-swappable windowing algorithms. Whenever you read things about "leakage" here, just know that we didn't attempt to reduce it any because we didn't apply windowing before running the FFT.


FFT Wrapper

We'll use a wrapper because kissfft also doesn't provide any buffering mechanisms and we don't want the user to have to know the internals of kissfft and our buffering technique to be able to use this.

class FFT
{
public:

  FFT()
  {
    m_fft_cfg =   
      kiss_fftr_alloc( 
        FFT_WINDOW_SIZE, // 1024, 2048, 4096, etc.
        0,               // 0 = forward FFT, 1 = inverse FFT
        nullptr,         // user-supplied buffer
        nullptr ) );      // length of user-supplied buffer
  }

  ~FFT()
  {
    if ( m_fft_cfg )
      kiss_fftr_free( m_fft_cfg );
  }

  bool processBlock( const FPType * samples, const size_t szSamples )
  {
    return m_sampleBuffer.emplace( samples, szSamples );
  }
  
  bool hasFrame() const { return m_sampleBuffer.isBufferFull(); }
  
  std::span< const FPType > computeFFT()
  {
    // ensure our RingBuffer has enough data
    if ( !hasFrame() ) return {};
    
    // this is all we need to get the real & imaginary parts
    // from the library
    kiss_fftr( m_fft_cfg, m_sampleBuffer.data(), m_fftOut.data() );
	
    // now we need to get the magnitude
    setMagnitude();
	
    // return our result
    return m_magnitude;
  }

private:

  void setMagnitude()
  {
    // this is the inverse FFT normalization method
    const FPType invN = 1.f / static_cast< FPType >( FFT_WINDOW_SIZE );  
  
    for ( size_t i = 0; i < m_fftBins; ++i )  
    {  
      // the following real & imag are what kissfft gets us
      const FPType r  = m_fftOut[ i ].r;  
      const FPType im = m_fftOut[ i ].i;  
      
      // now apply the magnitude formula
      FPType mag = std::sqrt( r * r + im * im ) * invN;
      
      // one-sided amplitude correction (except for DC & Nyquist)
      if ( i != 0 && i != ( m_fftBins - 1 ) ) mag *= FPType { 2 };  

      // save the magnitude	
      m_magnitude[ i ] = mag;  
    }
  }

private:

  // our kissfft instance
  kiss_fftr_cfg m_fft_cfg { nullptr };

  // this contains our FFT bin data
  std::array< kiss_fft_cpx, FFT_WINDOW_SIZE / 2 + 1 > m_fftOutput;
  
  // this is the result we'll hand back out that contains our magnitude
  std::array< FPType, FFT_WINDOW_SIZE / 2 + 1 > m_magnitude {};
  
  // this is our RingBuffer we implemented above
  RingBuffer< FPType > m_sampleBuffer;
};

Test Outputs

Let's have at look at some outputs now that we've gotten everything connected:

Case a440 { 2024, 440.0 , 44100.0, 128, 5.0, 0.1 };

We have a pure A440 sine wave running for 5 seconds with a process block size of 128 and an FFT window size of 2048. Any FFT frequency bins >= the magnitudeThreshold will get printed. If we look at the following results, we can see how much leakage there is given our small process block size and FFT window size.

Hz     : Magnitude
387.220: 0.108163
408.732: 0.179408
430.244: 0.580069
451.756: 0.434474
473.268: 0.15385

a440-fft-2048-window-128-block.png

That's an incredible amount of leakage that will be fun to play with. You can see how that would be undesirable, however, in different contexts. So if we adjust our FFT window to 4096 with the same block size, we get:

Hz     : Magnitude
430.454: 0.118272
441.215: 0.777753

a440-fft-4096-window-128-block.png

That's a big difference because the number of bins has increased from 2048 / 2 + 1 to 4096 / 2 + 1. We've managed to reduce leakage by narrowing the length of the bins across the frequency spectrum.


If you made it this far, then thanks for reading. Even though it's a bit of a departure from Graphics and VST3 talk, I've been having a blast thinking about this stuff and trying to design it.