FFT1D – Fixed-Point Fast Fourier Transform

Tinymind provides FFT1D, a standalone radix-2 decimation-in-time FFT for power-of-two lengths. It is designed for embedded signal processing where frequency-domain features feed into a neural network classifier. All dimensions are compile-time template parameters with zero dynamic allocation.

Why FFT for Embedded ML?

Many embedded ML tasks involve classifying signals that are better separated in the frequency domain than in the time domain:

Vibration analysis – bearing fault frequencies, motor harmonics
Audio keyword spotting – spectral features before a small classifier
IMU/accelerometer – gait analysis, gesture recognition via frequency signatures
ECG/PPG – heart rate extraction, arrhythmia frequency patterns

A time-domain signal fed directly into a neural network requires the network to learn the Fourier basis – wasting capacity. Computing the FFT first gives the network frequency bins as input, dramatically reducing the model size needed for the same accuracy.

Template Declaration

template<typename ValueType, size_t N>
class FFT1D

Template Parameters:

ValueType – Numeric type (QValue, float, or double)
N – FFT length (must be a power of two, >= 2)

Static Properties:

Length = N
HalfLength = N / 2
NumStages = log2(N)

Compile-Time Constraints:

static_assert((N & (N - 1)) == 0, "FFT length must be a power of two.");
static_assert(N >= 2, "FFT length must be at least 2.");

Design for Fixed-Point

Three design choices make FFT1D practical on MCUs without an FPU:

1. Scaled Butterfly

Each butterfly stage divides results by 2, giving a total scaling of 1/N across the forward transform. This prevents intermediate overflow in fixed-point arithmetic – a Q8.8 value can never exceed the representable range during computation, regardless of input magnitude.

The inverse FFT uses unscaled butterflies internally, so inverse(forward(x)) == x with no additional division step.

2. Compile-Time Bit-Reversal Table

The radix-2 FFT requires a bit-reversal permutation before the butterfly stages. On ARM Cortex-M0, which lacks a barrel shifter (each shift is a single-bit operation), computing bit-reversal at runtime is expensive.

Since N is a template parameter, FFT1D generates the entire permutation table at compile time using constexpr:

namespace detail {
    template<size_t N, size_t NumStages>
    struct BitReversalTable
    {
        size_t indices[N];
        constexpr BitReversalTable() : indices{}
        {
            for (size_t i = 0; i < N; ++i)
            {
                size_t x = i;
                size_t result = 0;
                for (size_t b = 0; b < NumStages; ++b)
                {
                    result = (result << 1) | (x & 1);
                    x >>= 1;
                }
                indices[i] = result;
            }
        }
    };
}

The table is stored in flash as a constant array. At runtime, the permutation is a linear scan with swaps – no shifts at all.

FFT Size	Table Size (bytes)	Shifts Eliminated
64	256	~384
128	512	~896
256	1,024	~2,048

3. Externally-Provided Twiddle Factors

The FFT twiddle factors are cos(-2*pi*k/N) and sin(-2*pi*k/N). For float/double, these can be computed with std::cos/std::sin. For Q-format, there are no built-in trig functions, so the values must be pre-computed externally and loaded via setTwiddleFactors().

Tinymind includes sin/cos lookup tables generated by the activation table generator (see Activation Lookup Tables), or you can compute exact twiddle values offline and embed them in your application.

Example: Floating-Point

#include "fft1d.hpp"
#include <cmath>

const size_t N = 256;
tinymind::FFT1D<double, N> fft;

// Compute twiddle factors
double cosTable[N / 2], sinTable[N / 2];
for (size_t k = 0; k < N / 2; ++k)
{
    cosTable[k] = std::cos(-2.0 * M_PI * k / N);
    sinTable[k] = std::sin(-2.0 * M_PI * k / N);
}
fft.setTwiddleFactors(cosTable, sinTable);

// Transform a signal
double real[N] = { /* sensor samples */ };
double imag[N] = {}; // zero for real-valued input

fft.forward(real, imag);

// Compute power spectrum (no sqrt needed)
double magSq[N];
tinymind::FFT1D<double, N>::magnitudeSquared(real, imag, magSq);

// Feed first N/2 bins into a classifier
classifier.feedForward(magSq);

Example: Fixed-Point (Q8.8)

#include "fft1d.hpp"

typedef tinymind::QValue<8, 8, true, tinymind::RoundUpPolicy> Q8_8;
const size_t N = 64;
tinymind::FFT1D<Q8_8, N> fft;

// Pre-computed twiddle factors for N=64
// cos(-2*pi*k/64) and sin(-2*pi*k/64) for k=0..31
// These can be generated offline and stored in flash
Q8_8 cosTable[N / 2];
Q8_8 sinTable[N / 2];

// Example: k=0 -> cos(0)=1, sin(0)=0
cosTable[0] = Q8_8(1, 0);
sinTable[0] = Q8_8(0, 0);
// k=1 -> cos(-pi/32)=0.995, sin(-pi/32)=-0.098
cosTable[1] = Q8_8(0, 254); // ~0.995
sinTable[1] = Q8_8(-1, 231); // ~-0.098
// ... remaining values pre-computed similarly ...

fft.setTwiddleFactors(cosTable, sinTable);

Q8_8 real[N]; // fill with sensor data
Q8_8 imag[N]; // zero-initialized for real input
for (size_t i = 0; i < N; ++i)
    imag[i] = Q8_8(0);

fft.forward(real, imag);

// Magnitude squared avoids sqrt (unavailable in fixed-point)
Q8_8 magSq[N];
tinymind::FFT1D<Q8_8, N>::magnitudeSquared(real, imag, magSq);

Pipeline: FFT + Neural Network

A typical embedded ML pipeline: sample sensor data, compute FFT, classify frequency bins:

#include "fft1d.hpp"
#include "neuralnet.hpp"

// 128-point FFT -> 64 magnitude bins -> small classifier
typedef tinymind::FFT1D<double, 128> FFTType;
// Neural network: 64 inputs (N/2 useful bins) -> 16 hidden -> 4 outputs
// (Configure via properties struct as usual)

FFTType fft;
// ... set twiddle factors ...

double sensorData[128];
double real[128], imag[128];
double magSq[128];
double features[64]; // first N/2 bins (Nyquist symmetry)

// Sample -> FFT -> classify
for (size_t i = 0; i < 128; ++i)
{
    real[i] = sensorData[i];
    imag[i] = 0.0;
}

fft.forward(real, imag);
FFTType::magnitudeSquared(real, imag, magSq);

// Only first N/2 bins are unique (real input symmetry)
for (size_t i = 0; i < 64; ++i)
    features[i] = magSq[i];

classifier.feedForward(features);

Inverse FFT

The inverse transform recovers time-domain data from frequency-domain representation:

fft.forward(real, imag);
// ... modify spectrum (filtering, denoising) ...
fft.inverse(real, imag);
// real[] now contains the filtered time-domain signal

The inverse uses the conjugate trick internally: IFFT(X) = conj(DFT(conj(X))). It uses unscaled butterflies so that inverse(forward(x)) == x – no manual scaling needed.

Memory Footprint

All storage is statically allocated. The FFT object stores only the twiddle factor tables:

Component	Count	Purpose
Twiddle cosines	N/2	cos(-2pik/N)
Twiddle sines	N/2	sin(-2pik/N)
Bit-reversal table	N	Compile-time, stored in flash

Working memory (caller-provided arrays):

Array	Count	Purpose
real[]	N	Real parts (input/output)
imag[]	N	Imaginary parts (input/output)
magSq[]	N	Magnitude squared (optional)

Example Footprints

FFT Size	ValueType	Object RAM	Working RAM	Flash (bit-rev table)	Total
64	Q8.8	128 B	384 B	256 B	768 B
128	Q8.8	256 B	768 B	512 B	1,536 B
256	Q8.8	512 B	1,536 B	1,024 B	3,072 B
64	Q16.16	256 B	768 B	256 B	1,280 B
128	Q16.16	512 B	1,536 B	512 B	2,560 B

A 64-point Q8.8 FFT + a small 32-input classifier fits comfortably in 2 KB of RAM – feasible on any Cortex-M0 device.

Q-Format Precision Considerations

Twiddle Factor Precision

Twiddle factors are sine/cosine values in [-1, 1]. Q8.8 gives 8 fractional bits (~0.004 resolution), which introduces ~0.2% error per butterfly. After log2(N) stages:

FFT Size	Stages	Approximate Precision Loss
64	6	~1-2 fractional bits
128	7	~2-3 fractional bits
256	8	~3 fractional bits

Recommendation: For FFT sizes above 128, use Q16.16 for better precision. Q8.8 is best for N <= 128.

Scaled Butterfly Overflow Safety

The divide-by-2 scaling at each stage means intermediate values never exceed the input range. A Q8.8 value with fixed part in [-128, 127] stays within bounds regardless of FFT size. This eliminates the need for saturation checks in the butterfly – the WrapPolicy is sufficient.

ARM Cortex-M0 Performance Estimates

For Q8.8 with single-cycle 32-bit MUL:

FFT Size	Butterfly MACs	Estimated Cycles	Time @ 48 MHz
64	192	~4,000	~83 us
128	448	~9,500	~198 us
256	1,024	~22,000	~458 us

These estimates include twiddle multiplication (2 multiplies + 2 adds per butterfly) plus bit-reversal permutation (table lookup, no shifts). Real-time processing at 1 kHz sample rate is feasible up to 256-point FFTs on a 48 MHz M0.

Use Cases

Vibration monitoring – compute FFT of accelerometer data, classify bearing fault frequencies with a small Q8.8 network
Audio feature extraction – FFT -> magnitude bins -> keyword spotting classifier, replacing MFCC pipelines that need log/DCT
Motor current analysis – detect harmonic patterns indicating motor degradation
Structural health monitoring – frequency-domain features from strain gauges or piezoelectric sensors