Getting Started with Fixed-Point Arithmetic¶
Fixed-point arithmetic is floating-point with the decimal point nailed to one
place. You trade dynamic range for speed, determinism, and hardware
compatibility. This guide builds intuition from first principles, then shows
every concept applied with doppler.arith.
1. The implied binary point¶
A binary integer has bits with place values 2⁰, 2¹, 2², …. Fixed-point shifts those place values left or right by agreeing — in software — that a certain bit position is "the one's place." The agreement is implicit: the hardware stores plain integers; your code knows the scale.
Bit index: 7 6 5 4 3 2 1 0
Place value: 128 64 32 16 8 4 2 1 ← ordinary int8_t
Bit index: 7 6 5 4 3 2 1 0
Place value: 1 ½ ¼ ⅛ 1/16 … ← if the point sits ABOVE bit 7
↑
integer part is 0 bits wide;
all bits are fractional
The Q-format notation makes the agreement explicit.
2. Q notation — Qm.n¶
Qm.n means: m bits for the signed integer part (including the sign bit),
n bits for the fractional part, stored in an (m+n)-bit two's complement
integer.
| Format | Stored type | Integer bits | Fraction bits | Range | Resolution (LSB) |
|---|---|---|---|---|---|
| Q8 (Q1.7) | int8_t |
1 | 7 | [−1, +1 − 2⁻⁷] | 2⁻⁷ ≈ 0.0078 |
| Q15 (Q1.15) | int16_t |
1 | 15 | [−1, +1 − 2⁻¹⁵] | 2⁻¹⁵ ≈ 3.05 × 10⁻⁵ |
The real value of a raw integer k stored in Qm.n is:
So k = 16384 in Q15 represents 16384 × 2⁻¹⁵ = 0.5.
import numpy as np
from doppler.arith import add_q15
# Encode 0.5 as Q15
half_q15 = np.int16(16384) # 0.5 × 32768
# Verify: raw integer divided by scale
print(half_q15 / 32768) # 0.5
Why not reach +1?
Two's complement gives 2ⁿ negative values but only 2ⁿ − 1 non-negative
values. int16_t covers −32768 to +32767, so Q15 spans [−1, +32767/32768].
The missing upper bound is a fundamental property of the representation, not
a bug.
3. Scaling — converting float to fixed-point¶
To encode a real value x in Qm.n, multiply by 2ⁿ and round:
To decode back to float, divide by 2ⁿ:
# Float → Q15 → float round-trip
x = 0.707 # -3 dBFS
scale = 32768 # 2^15
k = np.int16(round(x * scale)) # 23170
recovered = k / scale # 0.706970... (quantisation error < 1 LSB)
# Float → Q8 → float
scale8 = 128 # 2^7
k8 = np.int8(round(x * scale8)) # 91
recovered8 = k8 / scale8 # 0.7109... (larger error — 7 fraction bits)
The difference between x and recovered is the quantisation error,
bounded by ½ LSB when rounding (one full LSB when truncating).
4. Range and representable values¶
The representable range of a Qm.n integer type is determined entirely by the integer range of the storage type.
| Format | Storage | Integer range | Real range |
|---|---|---|---|
| Q8 | int8_t |
[−128, 127] | [−1.0, +127/128] |
| Q15 | int16_t |
[−32768, 32767] | [−1.0, +32767/32768] |
Any real value outside this range cannot be represented — it overflows.
5. Overflow vs saturation¶
Overflow happens when an arithmetic result exceeds the storage type's range. In two's complement the bits wrap: the result "rolls over" and the sign can flip without warning.
Q15 example: 0.75 + 0.75 = 1.5 (out of range)
0.75 = 0x6000 = 24576
+ 0.75 = 0x6000 = 24576
────────────────────────
0xC000 = -16384 ← int16 overflow! Wrong sign.
Saturation clamps the result at the representable limits instead:
doppler.arith saturates all add, subtract, multiply, and shift operations.
import numpy as np
from doppler.arith import add_q15
Q15_MAX = np.int16(32767)
a = np.array([Q15_MAX], dtype=np.int16)
b = np.array([1], dtype=np.int16)
print(add_q15(a, b)[0]) # 32767 — clamped, not wrapped
When NOT to saturate
Integrating accumulators in CIC filters intentionally rely on two's
complement wrap-around: the overflow at each integrator cancels in the
subsequent comb section. AccQ15 and AccQ8 use a wider integer
accumulator (int64_t and int32_t respectively) instead of Q-format
saturation, so the partial sums stay exact until you dump() them.
6. Bit growth on arithmetic operations¶
Every arithmetic operation changes the number of bits you need to represent the exact result.
6a. Addition and subtraction¶
Adding two n-bit numbers requires n+1 bits to hold every possible result without overflow. In general, adding k numbers of n bits each requires n + ⌈log₂ k⌉ bits.
Example: adding 4 Q15 values
Each value fits in 16 bits.
Exact sum needs 16 + ⌈log₂ 4⌉ = 16 + 2 = 18 bits.
Accumulating 1024 Q15 values needs 16 + 10 = 26 bits.
Accumulating 2³² values needs 16 + 32 = 48 bits (fits in int64).
from doppler.arith import AccQ15
acc = AccQ15() # int64_t internal accumulator
data = np.ones(1024, dtype=np.int16) * 32767 # all at full scale
acc.steps(data)
print(acc.dump()) # 33,553,408 — far outside int16 range, exact in int64
6b. Multiplication¶
Multiplying a Qm.n value by a Qp.q value produces a Q(m+p).(n+q) result. The product needs m+n+p+q bits for the integer part AND the fraction part.
Q15 × Q15 → Q1.15 × Q1.15 → Q2.30
Each operand: 1 sign + 15 fraction = 16 bits
Product: 2 sign + 30 fraction = 32 bits (but stored in int32 or int64)
To return to Q15: shift right 15, saturate to int16
The 15-bit right shift is what mul_q15 does after rounding:
from doppler.arith import mul_q15
half = np.array([16384], dtype=np.int16) # 0.5 in Q15
quarter = mul_q15(half, half) # 0.5 × 0.5 = 0.25
print(quarter[0]) # 8192 (= 0.25 × 32768)
print(quarter[0] / 32768) # 0.25
6c. The dot product¶
dot_q15(a, b) returns the raw Q30 accumulation as int64 — no shift, no
saturation. This lets you inspect the full precision before deciding how to
normalise.
from doppler.arith import dot_q15, shr_i64
a = np.full(4, 16384, dtype=np.int16) # [0.5, 0.5, 0.5, 0.5]
b = np.full(4, 16384, dtype=np.int16)
raw = dot_q15(a, b) # Q30: 4 × (0.5 × 0.5 × 32768²) = 536_870_912
print(raw) # 536870912
# Normalise to Q15: shift right 15
q15_scalar = shr_i64(np.array([raw], dtype=np.int64), 15)
print(q15_scalar[0]) # 16384 (= 1.0 in Q15 — correct: Σ 0.5² = 1.0)
7. Required accumulator width (no precision loss)¶
The accumulator must be wide enough to hold the sum of all products before any normalising shift.
| Operation | Operand type | Product width | Max terms | Min accumulator |
|---|---|---|---|---|
dot_q15 |
Q15 (int16) | Q30 (int32) | ≤ 2³³ | int64 |
dot_q8 |
Q8 (int8) | Q14 (int16) | ≤ 2¹⁷ | int32 |
AccQ15.madd |
Q15 | Q30 | ≤ 2³³ | int64 |
AccQ8.madd |
Q8 | Q14 | ≤ 2¹⁷ | int32 |
The guard-bits formula:
accumulator bits needed = product bits + ⌈log₂(max_terms)⌉
For dot_q15 with n=1024:
product bits = 30 (Q15×Q15 → Q30, fits in int32)
guard bits = ⌈log₂ 1024⌉ = 10
total needed = 40 bits → use int64 (64 bits — plenty of headroom)
For dot_q8 with n=1024:
product bits = 14 (Q8×Q8 → Q14, fits in int16)
guard bits = 10
total needed = 24 bits → use int32 (32 bits)
8. Truncation vs rounding¶
After a multiply or shift you must decide how to discard the fractional bits you are throwing away.
| Mode | Rule | Error range | Bias |
|---|---|---|---|
| Truncation (floor) | x >> n |
[−1 LSB, 0) | negative |
| Round half-up | (x + 2^(n−1)) >> n |
[−½ LSB, +½ LSB] | slight positive |
| Round half-even (banker's) | round to nearest even | [−½ LSB, +½ LSB] | none |
doppler.arith uses round-half-up throughout: the bias of ½ LSB is added
before the shift. This is the standard choice for DSP and matches the
behaviour of most hardware DSPs.
Q15 multiply, round-half-up:
raw product = a × b (int32, Q30)
bias = 2^14 = 16384
rounded = (raw + bias) >> 15 (int16, Q15)
Truncation is faster but introduces a systematic downward bias on all results. For a FIR filter this shifts the DC gain by ~−0.5 LSB per output sample.
from doppler.arith import shr_q15
# Truncation (floor): 3/4 is exactly between two Q15 values → floor gives 0.5
a = np.array([24576], dtype=np.int16) # 0.75 in Q15
# shr_q15 uses round-half-up: (3/4) >> 1 → (24576 + 16384) >> 1 = 20480
print(shr_q15(a, 1)[0]) # 20480 / 32768 = 0.625 (= 0.75/2, rounded)
# Manual truncation (C right-shift behaviour):
print(np.int16(np.int32(24576) >> 1)) # 12288 / 32768 = 0.375 (biased low)
9. Saturation arithmetic — a closer look¶
Saturation replaces the modular wrap of two's complement with a clamp. Every
add, sub, mul, and shl in doppler.arith saturates.
Why shifts can overflow too¶
A left shift multiplies by a power of two. Shifting a Q15 value left by 1 doubles it, which can overflow:
from doppler.arith import shl_q15
a = np.array([24576, -24576], dtype=np.int16) # ±0.75
# Without saturation: 24576 << 1 = 49152 > 32767 → wraps to -16384
# With saturation: → clamps to 32767
print(shl_q15(a, 1)) # [ 32767, -32768]
The sign-flip hazard¶
Without saturation, overflow is invisible — the number looks valid but has the wrong sign. Saturation makes clipping audible/visible without the far worse artefact of a full sign flip.
10. Working with the doppler.arith API¶
Module-level functions (stateless)¶
All stateless operations take NumPy arrays and return a new array (or scalar).
import numpy as np
from doppler.arith import (
add_q15, sub_q15, mul_q15, dot_q15,
shl_q15, shr_q15,
shr_i64,
)
# --- Q15 example: FIR output sample ---
# Compute one output sample of a symmetric FIR with coefficient vector h
# (Q15) applied to a delay line x (Q15).
def fir_sample(x: np.ndarray, h: np.ndarray) -> np.int16:
"""Compute Σ h[k]·x[k] in Q15 with full precision accumulation."""
raw_q30 = dot_q15(x, h) # int64, Q30
q30_arr = np.array([raw_q30], dtype=np.int64)
q15_arr = shr_i64(q30_arr, 15) # normalise to Q15
# Saturate to int16 range
return np.int16(min(max(q15_arr[0], -32768), 32767))
h = np.array([4096, 8192, 16384, 8192, 4096], dtype=np.int16) # symmetric
x = np.array([8192, 8192, 16384, 8192, 4096], dtype=np.int16) # delay line
y = fir_sample(x, h)
print(y, '→', y / 32768)
Stateful accumulators¶
AccQ15 and AccQ8 maintain a running sum across calls. Use madd for
multiply-accumulate (the MAC operation at the heart of every FIR filter).
from doppler.arith import AccQ15
# --- Running MAC: feed one block at a time ---
mac = AccQ15()
for block in blocks: # streaming blocks of int16
coeffs = get_coeffs_for_block() # matched filter
mac.madd(block, coeffs) # acc += Σ block[i] × coeffs[i]
raw_sum = mac.dump() # int64, Q30; resets accumulator
# Normalise to Q15 scalar
q15_out = np.int16(np.clip(
(raw_sum + (1 << 14)) >> 15, -32768, 32767
))
Q8 vs Q15 — when to choose which¶
| Criterion | Q8 (int8) | Q15 (int16) |
|---|---|---|
| Resolution | 7 fraction bits (~0.8% of full scale) | 15 fraction bits (~0.003%) |
| SQNR (thermal noise limit) | ~48 dB | ~96 dB |
| Memory / bandwidth | 1 byte/sample | 2 bytes/sample |
| AVX2 throughput | 32 int8 per register | 16 int16 per register |
| Typical use | rough quantization, neural-net weights | audio, SDR, precision DSP |
A 6 dB/bit rule of thumb: each additional bit buys ~6 dB of signal-to-quantisation-noise ratio (SQNR).
11. Summary: the five rules¶
-
Know your Q. Track the binary point through every operation. If two operands have different Q, align them before adding.
-
Widen on multiply. Q15 × Q15 = Q30. Keep the product in at least int32 before shifting back. Use int64 for accumulators.
-
Round, don't truncate. Add
2^(n−1)before the right-shift to get round-half-up. It costs one addition; it buys unbiased output. -
Saturate at the output. Carry full precision through the pipeline; only clamp at the final output word. Intermediate saturation destroys precision.
-
Size your accumulator. Accumulating k products of n-bit width needs n + ⌈log₂ k⌉ bits. For k = 1024 Q15 products that is 26 bits — int32 is tight; int64 is safe.
12. Further reading¶
doppler.arithmodule —add_q15,sub_q15,mul_q15,dot_q15,shl_q15,shr_q15,AccQ15,AccQ8, and Q8 counterparts.- Quantization design note — doppler's encoding
conventions for the
cvtmodule (Q15 ↔ float, UQ15, I16). - ADC gallery walkthrough — 3–8 bit quantisation noise visualised: time-domain staircase and 6 dB/bit noise-floor descent.
- HBDecimQ15 gallery walkthrough — Q15 halfband
decimator: the
_mm256_madd_epi16inner loop and symmetric-fold trick.