Mixed Precision

TinyMind composes its three numeric pipelines through a small set of pointwise converters that live at layer boundaries, plus a software half-precision storage tier. A single network can run an int8 affine CNN frontend, hand off to a Q-format LSTM head, hand off again to an fp16 attention block, and project back to int8 for the classifier — every layer keeps the runtime cost of its own grid, the bridges only run once per tensor crossing.

Three pipelines, one model

Pipeline Storage Where it lives When it wins
QValue Q-format int8 / int16 / int32 / int64 with a compile-time binary point cpp/qformat.hpp + cpp/neuralnet.hpp Trainable on-MCU, single global grid, no per-tensor metadata
Float float / double Same templates, different ValueType Host development, training
Int8 affine int8 weights + int8 activations + per-tensor (scale, zero_point) cpp/q*.hpp family TFLite-shape inference, multi-grid (each tensor picks its own range)

The qbridge converters tie the three together. The simd_neon_fp16.hpp backend adds vector specializations for fp16 storage on Arm hardware that supports it; this page covers the storage tier and the converters.

qbridge.hpp — pointwise converters

cpp/qbridge.hpp provides single-value and buffer-batch converters at layer boundaries. Float at runtime, no <cmath> (rounding via sign-aware cast). Gated on TINYMIND_ENABLE_FLOAT; freestanding-safe at STD=0.

Int8 affine ↔ float

Helper Direction
affineDequantize<Src>(q, scale, zp) int8 affine → float
affineQuantize<Dst>(x, scale, zp, qmin, qmax) float → int8 affine
affineDequantizeBuffer(src, dst, n, scale, zp) buffer
affineQuantizeBuffer(src, dst, n, scale, zp, qmin, qmax) buffer

Q-format ↔ float

Helper Direction
qValueToFloat<QV>(q) QValue<Q, F> → float
floatToQValue<QV>(x) float → QValue<Q, F>

Q-format ↔ int8 affine

Helper Direction
qValueToAffine<QV, Dst>(q, scale, zp, qmin, qmax) QValue → int8 affine
affineToQValue<QV, Src>(q, scale, zp) int8 affine → QValue

Half-precision (gated on TINYMIND_ENABLE_FP16=1)

Helper Direction
affineI8ToFp16(q, scale, zp) int8 affine → fp16
fp16ToAffineI8(h, scale, zp, qmin, qmax) fp16 → int8 affine
affineI8ToBf16(q, scale, zp) int8 affine → bf16
bf16ToAffineI8(h, scale, zp, qmin, qmax) bf16 → int8 affine

Buffer-batch versions of every variant follow the same naming.

fp16_t and bf16_t storage tier

cpp/include/tinymind_fp16.hpp provides software-only fp16_t (IEEE 754 binary16) and bf16_t (bfloat16) storage structs wrapping uint16_t. The conversion helpers (floatToFp16 / fp16ToFloat, floatToBf16 / bf16ToFloat) handle normals, subnormals, Inf, and NaN.

This is a storage tier, not an arithmetic tier. The structs are 16-bit; arithmetic happens by promoting to float at the call site. On targets that ship vector fp16 arithmetic (NEON FEAT_FP16 via TINYMIND_ENABLE_SIMD_NEON_FP16=1, AVX-512 fp16 extensions) the promote-then-MAC pattern is a near-noop. On every other target the scalar promote-store pair is the cost of admission for fp16 storage on an MCU.

Gates:

  • TINYMIND_ENABLE_FP16=1 — pulls in the storage types and the conversion helpers.
  • Conversion helpers additionally require TINYMIND_ENABLE_FLOAT=1.

The unit_test/embedded/Makefile exercises this corner as fp16_freestanding (FLOAT=1 FP16=1 QUANT=1 STD=0) to confirm the half-precision and bridge headers stay freestanding-clean.

Mixed-precision exemplar — mixed_precision_kws

examples/mixed_precision_kws/ wires the qbridge converters in production shape:

input  [S=8][E=8]   float
   ----[ int8 frontend ]----------------------------
   QDense  E -> E (one call per sequence step)
   qrelu                                  -> [S][E] int8
   ----[ qbridge: affineI8 -> fp16 ]----------------
                                          -> [S][E] fp16
   ----[ fp16 attention head ]----------------------
   Linear (ReLU-kernel) self-attention with residual
   skip from the post-relu feature buffer, then
   mean-pool over S                       -> [E] fp16
   ----[ qbridge: fp16 -> affineI8 ]----------------
                                          -> [E] int8
   ----[ int8 classifier ]--------------------------
   QDense  E -> NUM_CLASSES               -> [NUM_CLASSES] int8 logits

The precision-tier pattern — int8 front + classifier bracketing an fp16 head — is the load-bearing piece. Real KWS deployments have softmax classifiers and richer encoders; the int8 / fp16 / int8 sandwich survives every substitution.

When to bridge

  • int8 → fp16 → int8 around an attention block. Linear self-attention has an inner Q' KV matmul whose dynamic range is hard to pin down at calibration time without losing accuracy. fp16 in the middle absorbs the range, the surrounding int8 keeps storage and the conv MACs cheap.
  • int8 → float → int8 around a softmax. When deploying on a target with float MACs but limited integer throughput, the softmax can run in float between two int8 layers without disturbing the deployable shape.
  • Q-format ↔ int8 within a hybrid model. When migrating an existing QValue-based network to int8 incrementally, qValueToAffine / affineToQValue let you swap one layer at a time and validate parity at each step.

What this is not

  • Not QAT. Mixed precision is a deployment story, not a training story.
  • Not fp16 arithmetic. The library treats fp16 as a storage tier; inner arithmetic promotes to float. The vector fp16 ISA gates (SIMD_NEON_FP16, AVX-512 fp16) get there on hardware that supports it, but the library does not synthesize fp16 software arithmetic.
  • Not int4. Storage is int8 / int16 / int32 / fp16 / bf16 / float / double. Sub-byte storage is out of scope.

See Also


Back to top

Dan McLeran — danmcleran@gmail.com — MIT License

This site uses Just the Docs, a documentation theme for Jekyll.