Elman RNN: Same connection weights as MLP, plus a recurrent layer that stores previous hidden outputs and feeds them back as additional inputs.
LSTM: 4 gates (input, forget, output, cell) multiply connection weights by 4x, plus recurrent state and cell memory.
GRU: 3 gates (update, reset, candidate) multiply connection weights by 3x, plus recurrent state. ~20% smaller than LSTM.
KAN: Each edge stores B-spline coefficients (GridSize + SplineDegree = 6 per edge with G=5, k=1), plus a base weight and spline weight. Training adds gradient, delta weight, and previous delta weight for every learnable parameter.
All architectures remain well under 1.2 KB in Q8.8 fixed-point even in trainable form, making them suitable for embedded deployment.
Signal Processing Pipeline Sizes
Instance sizes in bytes for Conv1D, Pool1D, and Dropout layers. These are standalone composable layers that sit outside the neural network template.
MaxPool1D stores argmax indices (size_t per output) for backpropagation gradient routing. Size is independent of value type.
AvgPool1D
AvgPool1D is stateless (1 byte). It has no per-instance storage since average gradients are computed directly from the pool size.
Dropout
Configuration
Size (bytes)
Dropout (192 elements, 50%)
193 bytes
Dropout (32 elements, 50%)
33 bytes
Dropout (5 elements, 50%)
6 bytes
Full Pipeline: Conv1D -> MaxPool1D -> Dropout
Value Type
Conv1D (100, k=5, s=1, 4 filters)
MaxPool1D (96, p=2, s=2, 4 ch)
Dropout (192, 50%)
Total
double
384 bytes
1,536 bytes
193 bytes
2,113 bytes
Q8.8
96 bytes
1,536 bytes
193 bytes
1,825 bytes
2D Layers
All 2D layers use NHWC layout. Sizes below are in float (the type used by the examples/kws_cortex_m/ runner). Q8.8 sizes are ~4x smaller.
Configuration
float
Conv2D (20x20x1, k=3, s=1, 8 filters)
640 bytes
DepthwiseConv2D (9x9x8, k=3)
640 bytes
PointwiseConv2D (7x7, 8 -> 16)
1,152 bytes
PointwiseConv2D (1x1, 16 -> 10; dense classifier)
1,360 bytes
MaxPool2D (18x18x8, pool=2, stride=2)
5,184 bytes
GlobalAvgPool2D (7x7x16)
1 byte
Conv2D / DepthwiseConv2D / PointwiseConv2D store both weights and gradients (2x overhead) to support on-device training. MaxPool2D stores size_t argmax indices per output. GlobalAvgPool2D is stateless.
Running the pipeline in examples/kws_cortex_m/ (20x20x1 input, 10 output classes, all float):
Section
Bytes
Total weight + state bytes
8,976
Peak activation buffer
10,368 (first conv output)
Swapping the layer value type from float to Q8.8 roughly quarters the Conv / DW / PW / Dense storage and halves each activation buffer. The MaxPool2D argmax array is size_t-indexed and is independent of value type – on a tight MCU target it becomes the dominant term and is the best candidate for the next round of footprint work.