Quantized Neural Networks
Tinymind provides two extreme-quantization layer types for ultra-low-power inference: BinaryDense and TernaryDense. These layers replace full-precision multiply-accumulate operations with bitwise logic (XNOR + popcount for binary) or conditional add/subtract/skip (for ternary), achieving massive memory reduction and eliminating multiplication entirely from the forward pass.
- BinaryDense: Weights and activations constrained to {-1, +1}. 32x memory reduction via 1-bit packing.
- TernaryDense: Weights constrained to {-1, 0, +1}. 16x memory reduction via 2-bit packing. Zero weights are skipped, providing sparsity.
Both layers support training via the Straight-Through Estimator (STE) and work with both fixed-point and floating-point value types.
Why Extreme Quantization on Embedded?
On the smallest microcontrollers, even Q8.8 fixed-point may be too large for wide layers. Binary and ternary quantization make previously impossible deployments feasible:
| 64x16 layer weights | double | Q8.8 | Binary (1-bit) | Ternary (2-bit) |
|---|---|---|---|---|
| Weight storage | 8,192 bytes | 2,048 bytes | 128 bytes | 256 bytes |
A layer that would consume 8 KB in full precision fits in 128 bytes with binary packing – small enough for an ARM Cortex-M0+ with 4 KB of RAM to run multiple layers simultaneously.
BinaryDense
Template Declaration
template<
typename ValueType,
size_t InputSize,
size_t OutputSize>
class BinaryDense
How It Works
Forward Pass
- Binarize inputs:
sign(x)maps each input to +1 or -1 - Pack: Both inputs and weights are stored as single bits
- XNOR: Bitwise XNOR gives 1 where input and weight have the same sign
- Popcount: Count the set bits to get the number of agreements
- Dot product:
output = 2 * popcount(XNOR(input, weight)) - InputSize + bias
No multiplication is performed.
Training (Straight-Through Estimator)
During training, real-valued “latent” weights are maintained alongside the packed binary weights. The STE passes gradients through the sign() binarization as if it were the identity function, with gradients clipped to zero for latent weights outside [-1, +1].
Example
#include "binarylayer.hpp"
tinymind::BinaryDense<double, 4, 2> layer;
// Set latent weights
layer.setLatentWeight(0, 0, 0.5); // binarizes to +1
layer.setLatentWeight(0, 1, 0.3); // binarizes to +1
layer.setLatentWeight(0, 2, -0.7); // binarizes to -1
layer.setLatentWeight(0, 3, -0.2); // binarizes to -1
layer.setBias(0, 0.0);
layer.binarizeWeights(); // pack into bits
double input[4] = {1.0, -1.0, 1.0, -1.0};
double output[2];
layer.forward(input, output);
TernaryDense
Template Declaration
template<
typename ValueType,
size_t InputSize,
size_t OutputSize,
unsigned ThresholdPercent = 50>
class TernaryDense
How It Works
Ternarization
Weights are quantized to {-1, 0, +1} based on a threshold:
- Compute the mean absolute weight:
mean_abs = mean(|w|) - Apply threshold:
threshold = ThresholdPercent/100 * mean_abs - For each weight: if
|w| < threshold-> 0, elsesign(w)-> +1 or -1
Forward Pass
For each output neuron: weight = +1: add input; weight = -1: subtract input; weight = 0: skip.
No multiplication is performed. Zero weights provide natural sparsity.
Example
#include "ternarylayer.hpp"
tinymind::TernaryDense<double, 4, 2, 50> layer;
layer.setLatentWeight(0, 0, 0.9); // -> +1
layer.setLatentWeight(0, 1, 0.01); // -> 0 (pruned)
layer.setLatentWeight(0, 2, -0.8); // -> -1
layer.setLatentWeight(0, 3, 0.02); // -> 0 (pruned)
layer.setBias(0, 0.0);
layer.ternarizeWeights();
double input[4] = {2.0, 3.0, 4.0, 5.0};
double output[2];
layer.forward(input, output);
// output[0] = (+1)*2 + 0*3 + (-1)*4 + 0*5 = -2.0
Fixed-Point Support
Both layers work with Q-format fixed-point types:
typedef tinymind::QValue<8, 8, true, tinymind::RoundUpPolicy> ValueType;
tinymind::BinaryDense<ValueType, 4, 1> binaryLayer;
tinymind::TernaryDense<ValueType, 4, 1, 10> ternaryLayer;
Compression Ratios (64x16 layer)
| Storage | Bytes | vs double | vs Q8.8 |
|---|---|---|---|
Full double | 8,192 bytes | 1x | – |
| Full Q8.8 | 2,048 bytes | 4x | 1x |
| Packed binary (1-bit) | 128 bytes | 64x | 16x |
| Packed ternary (2-bit) | 256 bytes | 32x | 8x |
When To Use Binary vs Ternary
- BinaryDense: Maximum compression (32x). Best when the problem is simple enough that {-1, +1} weights suffice. XNOR+popcount is extremely fast on hardware with popcount instructions.
- TernaryDense: Slightly less compression (16x) but supports pruning via zero weights. The sparsity can skip operations entirely, and the ability to “turn off” connections gives more expressiveness than pure binary.
Both are best suited for the later (wider) layers of a network where weight storage dominates memory.