Network Instance Size Comparison

All Architectures: MLP vs LSTM vs GRU vs KAN

Using double (2->N->1)

Instance sizes in bytes for XOR-class network configurations using double as the value type.

Architecture Hidden Neurons Trainable Non-trainable Training Overhead
MLP (2->5->1) 5 1,008 bytes 360 bytes +648 bytes (+180%)
Elman RNN (2->3->1) 3 1,056 bytes 384 bytes +672 bytes (+175%)
LSTM (2->3->1) 3 3,024 bytes 960 bytes +2,064 bytes (+215%)
GRU (2->3->1) 3 2,400 bytes 792 bytes +1,608 bytes (+203%)
KAN (2->5->1, G=5, k=1) 5 4,208 bytes 1,256 bytes +2,952 bytes (+235%)

Using Q8.8 Fixed-Point (XOR Configuration)

Instance sizes in bytes for XOR network configurations using QValue<8,8,true> (signed Q8.8 fixed-point, 2 bytes per value).

Architecture Hidden Neurons Trainable Non-trainable Training Overhead
MLP (2->3->1) 3 328 bytes 144 bytes +184 bytes (+128%)
Elman RNN (2->3->1) 3 472 bytes 192 bytes +280 bytes (+146%)
LSTM (2->3->1) 3 952 bytes 384 bytes +568 bytes (+148%)
GRU (2->3->1) 3 808 bytes 336 bytes +472 bytes (+140%)
KAN (2->5->1, G=5, k=1) 5 1,192 bytes 416 bytes +776 bytes (+187%)

Relative Size (vs MLP)

  Trainable Non-trainable
double    
Elman / MLP 1.0x 1.1x
LSTM / MLP 3.0x 2.7x
GRU / MLP 2.4x 2.2x
KAN / MLP 4.2x 3.5x
Q8.8    
Elman / MLP 1.4x 1.3x
LSTM / MLP 2.9x 2.7x
GRU / MLP 2.5x 2.3x
KAN / MLP 3.6x 2.9x

GRU vs LSTM

  Trainable Non-trainable
double: GRU / LSTM 79% 83%
Q8.8: GRU / LSTM 85% 88%

GRU uses 3 gates (update, reset, candidate) versus LSTM’s 4 gates (input, forget, output, cell candidate), saving ~15-21% memory per hidden neuron.

Why Each Architecture is Larger

  • MLP: One weight per connection. Minimal storage.
  • Elman RNN: Same connection weights as MLP, plus a recurrent layer that stores previous hidden outputs and feeds them back as additional inputs.
  • LSTM: 4 gates (input, forget, output, cell) multiply connection weights by 4x, plus recurrent state and cell memory.
  • GRU: 3 gates (update, reset, candidate) multiply connection weights by 3x, plus recurrent state. ~20% smaller than LSTM.
  • KAN: Each edge stores B-spline coefficients (GridSize + SplineDegree = 6 per edge with G=5, k=1), plus a base weight and spline weight. Training adds gradient, delta weight, and previous delta weight for every learnable parameter.

All architectures remain well under 1.2 KB in Q8.8 fixed-point even in trainable form, making them suitable for embedded deployment.

Signal Processing Pipeline Sizes

Instance sizes in bytes for Conv1D, Pool1D, and Dropout layers. These are standalone composable layers that sit outside the neural network template.

Conv1D

Configuration double Q8.8
Conv1D (100, kernel=5, stride=2, 8 filters) 768 bytes 192 bytes
Conv1D (100, kernel=5, stride=1, 4 filters) 384 bytes 96 bytes

MaxPool1D

Configuration Size (bytes)
MaxPool1D (96 input, pool=2, stride=2, 4 channels) 1,536 bytes
MaxPool1D (48 input, pool=2, stride=2, 1 channel) 192 bytes
MaxPool1D (6 input, pool=2, stride=2, 1 channel) 24 bytes

MaxPool1D stores argmax indices (size_t per output) for backpropagation gradient routing. Size is independent of value type.

AvgPool1D

AvgPool1D is stateless (1 byte). It has no per-instance storage since average gradients are computed directly from the pool size.

Dropout

Configuration Size (bytes)
Dropout (192 elements, 50%) 193 bytes
Dropout (32 elements, 50%) 33 bytes
Dropout (5 elements, 50%) 6 bytes

Full Pipeline: Conv1D -> MaxPool1D -> Dropout

Value Type Conv1D (100, k=5, s=1, 4 filters) MaxPool1D (96, p=2, s=2, 4 ch) Dropout (192, 50%) Total
double 384 bytes 1,536 bytes 193 bytes 2,113 bytes
Q8.8 96 bytes 1,536 bytes 193 bytes 1,825 bytes

2D Layers

All 2D layers use NHWC layout. Sizes below are in float (the type used by the examples/kws_cortex_m/ runner). Q8.8 sizes are ~4x smaller.

Configuration float
Conv2D (20x20x1, k=3, s=1, 8 filters) 640 bytes
DepthwiseConv2D (9x9x8, k=3) 640 bytes
PointwiseConv2D (7x7, 8 -> 16) 1,152 bytes
PointwiseConv2D (1x1, 16 -> 10; dense classifier) 1,360 bytes
MaxPool2D (18x18x8, pool=2, stride=2) 5,184 bytes
GlobalAvgPool2D (7x7x16) 1 byte

Conv2D / DepthwiseConv2D / PointwiseConv2D store both weights and gradients (2x overhead) to support on-device training. MaxPool2D stores size_t argmax indices per output. GlobalAvgPool2D is stateless.

Full Pipeline: Conv2D -> MaxPool2D -> DepthwiseConv2D -> PointwiseConv2D -> GlobalAvgPool2D -> Dense

Running the pipeline in examples/kws_cortex_m/ (20x20x1 input, 10 output classes, all float):

Section Bytes
Total weight + state bytes 8,976
Peak activation buffer 10,368 (first conv output)

Swapping the layer value type from float to Q8.8 roughly quarters the Conv / DW / PW / Dense storage and halves each activation buffer. The MaxPool2D argmax array is size_t-indexed and is independent of value type – on a tight MCU target it becomes the dominant term and is the best candidate for the next round of footprint work.

Binary and Ternary Dense Layer Sizes

BinaryDense (Trainable)

Configuration double Q8.8
BinaryDense (4, 2) 168 bytes 44 bytes
BinaryDense (16, 8) 2,192 bytes 560 bytes
BinaryDense (64, 16) 16,768 bytes 4,288 bytes

TernaryDense (Trainable)

Configuration double Q8.8
TernaryDense (4, 2, 50%) 168 bytes 44 bytes
TernaryDense (16, 8, 50%) 2,208 bytes 576 bytes
TernaryDense (64, 16, 50%) 16,896 bytes 4,416 bytes

Inference-Only Packed Weight Storage

Configuration Binary (packed) Ternary (packed) Full double Full Q8.8
64x16 weights + biases 256 bytes 384 bytes 8,320 bytes 2,080 bytes
32x32 weights + biases 384 bytes 512 bytes 8,448 bytes 2,112 bytes

Weight Storage Compression Ratios (64x16 layer)

Storage Bytes vs double vs Q8.8
Full double 8,192 bytes 1x
Full Q8.8 2,048 bytes 4x 1x
Packed binary (1-bit) 128 bytes 64x 16x
Packed ternary (2-bit) 256 bytes 32x 8x

Back to top

Dan McLeran — danmcleran@gmail.com — MIT License

This site uses Just the Docs, a documentation theme for Jekyll.