A Keyword-Spotting CNN for a Cortex-M
Keyword spotting (KWS) is the workload that put the “TinyML” moniker on the map: always-on wake-word detection running on a sub-dollar microcontroller from a 100 mAh battery. The examples/kws_cortex_m/ example builds a KWS-style convolutional pipeline out of TinyMind’s 2D layers, measures per-layer cycles and bytes with the new bench harness, and ships a portable port stub so the same pipeline can be moved onto a real MCU.
This tutorial walks through the pipeline architecture, the layer composition in C++, how to read the benchmark CSV, and what changes when you move from a host run to a Cortex-M target.
Why Keyword Spotting?
KWS is a canonical TinyML workload because:
- Data shape is 2D. The front-end produces an MFCC (or log-mel) spectrogram: a time-by-frequency tile. A 1D pipeline over raw audio is feasible but wastes the rich 2D structure that small CNNs are good at.
- Latency budget is tight. A human expects the wake-word to trigger in about 100 ms. Running a CNN at 10 Hz means each inference has a ~10 ms budget on a Cortex-M4 at 80 MHz – roughly 800 k cycles.
- Flash and RAM are tiny. The reference TinyMLPerf KWS target fits in ~50 KB flash and ~20 KB RAM.
The whole pipeline – including TinyMind’s training code – fits easily inside those budgets.
Pipeline Architecture
The example uses a MobileNet-style depthwise-separable block sandwiched between a small regular Conv2D front-end and a Global-Average-Pool + 1x1 dense classifier at the tail:
input [20 x 20 x 1] (synthetic MFCC-like tile)
-> Conv2D 3x3, 8 filters -> [18 x 18 x 8]
-> MaxPool2D 2x2 -> [9 x 9 x 8]
-> DepthwiseConv2D 3x3 -> [7 x 7 x 8]
-> PointwiseConv2D 8 -> 16 -> [7 x 7 x 16]
-> GlobalAvgPool2D -> [16]
-> PointwiseConv2D (dense) 16 -> 10 -> [10] (class logits)
Why this shape:
- Regular
Conv2Dfirst. The first convolution extracts generic edge/frequency-band features and is cheap enough that full cross-channel mixing is worth it. - MaxPool halves the spatial dim. Keeps the activation volume small for the next stage.
- Depthwise-separable block.
DepthwiseConv2D+PointwiseConv2Dtogether replace a fullConv2D 8->16block at roughly 1/8 the MACs for K=3. - Global Average Pool + 1x1 dense. GAP collapses the 7x7x16 feature map to a 16-vector. A
PointwiseConv2D<..., 1, 1, 16, 10>then acts as the final dense classifier. This combination replaces the big flatten-to-dense matrix that usually dominates flash on small CNNs.
All sizes are compile-time template parameters. Change the using aliases at the top of kws_cortex_m.cpp and the rest of the pipeline follows automatically.
Declaring the Layer Types
Here is the layer type chain, directly from the example:
using Value = float;
using Conv1Type = tinymind::Conv2D<Value, 20, 20, 1, 3, 3, 1, 1, 8>;
using Pool1Type = tinymind::MaxPool2D<Value,
Conv1Type::OutputHeight,
Conv1Type::OutputWidth,
8, 2, 2, 2, 2>;
using DwType = tinymind::DepthwiseConv2D<Value,
Pool1Type::OutputHeight,
Pool1Type::OutputWidth,
8, 3, 3, 1, 1>;
using PwType = tinymind::PointwiseConv2D<Value,
DwType::OutputHeight,
DwType::OutputWidth,
8, 16>;
using GapType = tinymind::GlobalAvgPool2D<Value,
PwType::OutputHeight,
PwType::OutputWidth,
16>;
using DenseType = tinymind::PointwiseConv2D<Value, 1, 1, 16, 10>;
Each layer’s output dimensions feed the next layer’s input dimensions through compile-time constants like Conv1Type::OutputHeight. If you tweak the input size from 20x20 to 40x49 (a real MFCC tile), the rest of the chain recomputes itself.
Static Allocation
Every buffer is statically allocated – no heap, no new:
Conv1Type gConv1;
Pool1Type gPool1;
DwType gDw;
PwType gPw;
GapType gGap;
DenseType gDense;
Value gInput[20 * 20 * 1];
Value gBufConv1[Conv1Type::OutputSize];
Value gBufPool1[Pool1Type::OutputSize];
Value gBufDw[DwType::OutputSize];
Value gBufPw[PwType::OutputSize];
Value gBufGap[GapType::OutputSize];
Value gBufDense[DenseType::OutputSize];
On an MCU, these land in .bss at link time. There’s no malloc, no RTOS dependency, and no possibility of a late-night out-of-memory in the field.
Forward Pass with Per-Layer Timing
The bench harness wraps each forward() call with a cycle counter read. The exact sequence:
tinymind::bench::enableCycleCounter(); // one-time DWT init on Cortex-M
tinymind::bench::writeHeader(std::cout);
const auto t0 = tinymind::bench::readCycleCounter();
gConv1.forward(gInput, gBufConv1);
const auto t1 = tinymind::bench::readCycleCounter();
gPool1.forward(gBufConv1, gBufPool1);
const auto t2 = tinymind::bench::readCycleCounter();
gDw.forward(gBufPool1, gBufDw);
const auto t3 = tinymind::bench::readCycleCounter();
// ...continues through gPw, gGap, gDense
tinymind::bench::writeRow(std::cout,
{"conv2d_3x3_8", sizeof(gConv1), sizeof(gBufConv1), t1 - t0});
// ...one row per layer
On a host build readCycleCounter() returns elapsed nanoseconds from std::chrono::steady_clock. On a Cortex-M target compiled with -DTINYMIND_BENCH_CORTEX_M, the same call reads DWT->CYCCNT for true hardware cycle counts.
Building and Running
cd examples/kws_cortex_m
make release
make run
Sample output on a typical x86 host (units shown as “cycles” are nanoseconds on the host):
name,weight_bytes,activation_bytes,cycles
conv2d_3x3_8,640,10368,28185
maxpool2d_2x2,5184,2592,716
dwconv2d_3x3,640,1568,1018
pwconv2d_8x16,1152,3136,1788
global_avgpool2d,0,64,61
dense_16x10,1360,40,79
Summary:
total weight bytes : 8976
peak activation bytes : 10368
total inference cycles : 31847
The weights are random, so the predicted class is meaningless – the point is the cycle profile and the footprint report.
Reading the CSV
weight_bytesincludes both the trained weights and the gradient buffer used for on-device training. If you only need inference, roughly half this number disappears with a future inference-only layer variant.activation_bytesis the size of the output buffer. On a tight MCU build the activation buffers can overlap (layer N’s input buffer is reusable as layer N+1’s output once its values have been consumed) – only the peak activation needs to be provisioned. For this pipeline the peak is the first conv’s 10 KB output.cycleson host is ns; on Cortex-M it’s hardware cycles from DWT.
On a Cortex-M4 at 80 MHz, 800 k cycles = 10 ms of wall time. For this pipeline size, a reasonable target is well under that budget even in pure C++.
Footprint Summary
| Component | Bytes (float32) |
|---|---|
| All layer weights + gradients | 8,976 |
| Peak activation buffer | 10,368 |
| Total static allocation | ~19 KB |
That fits in the RAM of a Cortex-M4 the size of an STM32L4 (64-128 KB) with an order of magnitude to spare. Swapping the value type to Q8.8 would quarter the weight term and halve the activation term; the MaxPool2D argmax array (used for backprop) is the next-largest term and is value-type-independent.
Porting to a Real Cortex-M
The example ships a vendor-neutral port_stub.hpp with three functions to fill in:
namespace kws_port {
bool readMicSample(int16_t& sample); // microphone front-end
void putChar(char c); // UART TX (for CSV output)
void platformInit(); // DWT init, UART baud, etc.
}
Plus a minimal UartSink that implements the operator<< overloads the bench harness needs, without pulling in <iostream>:
struct UartSink {
UartSink& operator<<(const char* s);
UartSink& operator<<(size_t v);
UartSink& operator<<(uint32_t v);
};
Build with -DTINYMIND_BENCH_CORTEX_M and point the bench harness at a UartSink instead of std::cout. The model pipeline itself is unchanged.
Replacing the synthetic input with a real MFCC extractor is a short hop: TinyMind already ships cpp/fft1d.hpp with sin/cos tables for fixed-point FFTs – that’s the expensive part of MFCC. The remaining mel-filterbank + log step is straightforward to add.
Next Steps
- Train a model with real data. Google’s Speech Commands dataset is the canonical KWS training set. Train in PyTorch, export weights via the pattern in
examples/pytorch/, and load viasetFilterWeight/setChannelWeight. - Resize the model. A real KWS CNN uses a ~40x49 MFCC tile and 4-5 depthwise-separable blocks. Swap the
usingaliases and the compile-time constants do the rest. - Add activation + batch norm. The raw layers in this tutorial don’t include an activation function or batch norm between them – they are there to measure the linear-algebra cost. For a real model insert a ReLU (an element-wise loop over each activation buffer) and a
BatchNorm2D(not yet shipped – same pattern asBatchNorm1D) between blocks. - Try Q-format. The same pipeline works unchanged with
typedef tinymind::QValue<16, 16, true> Value;. On a part without an FPU you’ll get a large speedup, and the activation memory halves.