Getting Started with Fixed-Point Arithmetic¶

Fixed-point arithmetic is floating-point with the decimal point nailed to one place. You trade dynamic range for speed, determinism, and hardware compatibility. This guide builds intuition from first principles, then shows every concept applied with doppler.arith.

1. The implied binary point¶

A binary integer has bits with place values 2⁰, 2¹, 2², …. Fixed-point shifts those place values left or right by agreeing — in software — that a certain bit position is "the one's place." The agreement is implicit: the hardware stores plain integers; your code knows the scale.

Bit index:  7   6   5   4   3   2   1   0
Place value: 128  64  32  16   8   4   2   1    ← ordinary int8_t

Bit index:  7   6   5   4   3   2   1   0
Place value:  1  ½  ¼  ⅛  1/16 …              ← if the point sits ABOVE bit 7
             ↑
          integer part is 0 bits wide;
          all bits are fractional

The Q-format notation makes the agreement explicit.

2. Q notation — Qm.n¶

Qm.n means: m bits for the signed integer part (including the sign bit), n bits for the fractional part, stored in an (m+n)-bit two's complement integer.

Format	Stored type	Integer bits	Fraction bits	Range	Resolution (LSB)
Q8 (Q1.7)	`int8_t`	1	7	[−1, +1 − 2⁻⁷]	2⁻⁷ ≈ 0.0078
Q15 (Q1.15)	`int16_t`	1	15	[−1, +1 − 2⁻¹⁵]	2⁻¹⁵ ≈ 3.05 × 10⁻⁵

The real value of a raw integer k stored in Qm.n is:

real_value = k × 2⁻ⁿ

So k = 16384 in Q15 represents 16384 × 2⁻¹⁵ = 0.5.

import numpy as np
from doppler.arith import add_q15

# Encode 0.5 as Q15
half_q15 = np.int16(16384)     # 0.5 × 32768

# Verify: raw integer divided by scale
print(half_q15 / 32768)        # 0.5

Why not reach +1?

Two's complement gives 2ⁿ negative values but only 2ⁿ − 1 non-negative values. int16_t covers −32768 to +32767, so Q15 spans [−1, +32767/32768]. The missing upper bound is a fundamental property of the representation, not a bug.

3. Scaling — converting float to fixed-point¶

To encode a real value x in Qm.n, multiply by 2ⁿ and round:

k = round(x × 2ⁿ)

To decode back to float, divide by 2ⁿ:

x = k / 2ⁿ

# Float → Q15 → float round-trip
x = 0.707           # -3 dBFS
scale = 32768       # 2^15

k = np.int16(round(x * scale))     # 23170
recovered = k / scale               # 0.706970... (quantisation error < 1 LSB)

# Float → Q8 → float
scale8 = 128        # 2^7
k8 = np.int8(round(x * scale8))    # 91
recovered8 = k8 / scale8           # 0.7109...  (larger error — 7 fraction bits)

The difference between x and recovered is the quantisation error, bounded by ½ LSB when rounding (one full LSB when truncating).

4. Range and representable values¶

The representable range of a Qm.n integer type is determined entirely by the integer range of the storage type.

Format	Storage	Integer range	Real range
Q8	`int8_t`	[−128, 127]	[−1.0, +127/128]
Q15	`int16_t`	[−32768, 32767]	[−1.0, +32767/32768]

Any real value outside this range cannot be represented — it overflows.

5. Overflow vs saturation¶

Overflow happens when an arithmetic result exceeds the storage type's range. In two's complement the bits wrap: the result "rolls over" and the sign can flip without warning.

Q15 example: 0.75 + 0.75 = 1.5  (out of range)

  0.75 = 0x6000 =  24576
+ 0.75 = 0x6000 =  24576
────────────────────────
         0xC000 = -16384  ← int16 overflow!  Wrong sign.

Saturation clamps the result at the representable limits instead:

0.75 + 0.75  →  saturate  →  0x7FFF = 32767 = +0.9999...

doppler.arith saturates all add, subtract, multiply, and shift operations.

import numpy as np
from doppler.arith import add_q15

Q15_MAX = np.int16(32767)

a = np.array([Q15_MAX], dtype=np.int16)
b = np.array([1],       dtype=np.int16)

print(add_q15(a, b)[0])   # 32767 — clamped, not wrapped

When NOT to saturate

Integrating accumulators in CIC filters intentionally rely on two's complement wrap-around: the overflow at each integrator cancels in the subsequent comb section. AccQ15 and AccQ8 use a wider integer accumulator (int64_t and int32_t respectively) instead of Q-format saturation, so the partial sums stay exact until you dump() them.

6. Bit growth on arithmetic operations¶

Every arithmetic operation changes the number of bits you need to represent the exact result.

6a. Addition and subtraction¶

Adding two n-bit numbers requires n+1 bits to hold every possible result without overflow. In general, adding k numbers of n bits each requires n + ⌈log₂ k⌉ bits.

Example: adding 4 Q15 values

  Each value fits in 16 bits.
  Exact sum needs 16 + ⌈log₂ 4⌉ = 16 + 2 = 18 bits.
  Accumulating 1024 Q15 values needs 16 + 10 = 26 bits.
  Accumulating 2³² values needs 16 + 32 = 48 bits (fits in int64).

from doppler.arith import AccQ15

acc = AccQ15()                          # int64_t internal accumulator
data = np.ones(1024, dtype=np.int16) * 32767   # all at full scale
acc.steps(data)
print(acc.dump())   # 33,553,408 — far outside int16 range, exact in int64

6b. Multiplication¶

Multiplying a Qm.n value by a Qp.q value produces a Q(m+p).(n+q) result. The product needs m+n+p+q bits for the integer part AND the fraction part.

Q15 × Q15 → Q1.15 × Q1.15 → Q2.30

  Each operand: 1 sign + 15 fraction = 16 bits
  Product:      2 sign + 30 fraction = 32 bits (but stored in int32 or int64)
  To return to Q15: shift right 15, saturate to int16

The 15-bit right shift is what mul_q15 does after rounding:

from doppler.arith import mul_q15

half = np.array([16384], dtype=np.int16)    # 0.5 in Q15
quarter = mul_q15(half, half)               # 0.5 × 0.5 = 0.25
print(quarter[0])                           # 8192  (= 0.25 × 32768)
print(quarter[0] / 32768)                   # 0.25

6c. The dot product¶

dot_q15(a, b) returns the raw Q30 accumulation as int64 — no shift, no saturation. This lets you inspect the full precision before deciding how to normalise.

from doppler.arith import dot_q15, shr_i64

a = np.full(4, 16384, dtype=np.int16)   # [0.5, 0.5, 0.5, 0.5]
b = np.full(4, 16384, dtype=np.int16)

raw = dot_q15(a, b)          # Q30: 4 × (0.5 × 0.5 × 32768²) = 536_870_912
print(raw)                   # 536870912

# Normalise to Q15: shift right 15
q15_scalar = shr_i64(np.array([raw], dtype=np.int64), 15)
print(q15_scalar[0])         # 16384  (= 1.0 in Q15 — correct: Σ 0.5² = 1.0)

7. Required accumulator width (no precision loss)¶

The accumulator must be wide enough to hold the sum of all products before any normalising shift.

Operation	Operand type	Product width	Max terms	Min accumulator
`dot_q15`	Q15 (int16)	Q30 (int32)	≤ 2³³	int64
`dot_q8`	Q8 (int8)	Q14 (int16)	≤ 2¹⁷	int32
`AccQ15.madd`	Q15	Q30	≤ 2³³	int64
`AccQ8.madd`	Q8	Q14	≤ 2¹⁷	int32

The guard-bits formula:

accumulator bits needed = product bits + ⌈log₂(max_terms)⌉

For dot_q15 with n=1024:
  product bits = 30 (Q15×Q15 → Q30, fits in int32)
  guard bits   = ⌈log₂ 1024⌉ = 10
  total needed = 40 bits → use int64 (64 bits — plenty of headroom)

For dot_q8 with n=1024:
  product bits = 14 (Q8×Q8 → Q14, fits in int16)
  guard bits   = 10
  total needed = 24 bits → use int32 (32 bits)

8. Truncation vs rounding¶

After a multiply or shift you must decide how to discard the fractional bits you are throwing away.

Mode	Rule	Error range	Bias
Truncation (floor)	`x >> n`	[−1 LSB, 0)	negative
Round half-up	`(x + 2^(n−1)) >> n`	[−½ LSB, +½ LSB]	slight positive
Round half-even (banker's)	round to nearest even	[−½ LSB, +½ LSB]	none

doppler.arith uses round-half-up throughout: the bias of ½ LSB is added before the shift. This is the standard choice for DSP and matches the behaviour of most hardware DSPs.

Q15 multiply, round-half-up:

  raw product = a × b                      (int32, Q30)
  bias        = 2^14 = 16384
  rounded     = (raw + bias) >> 15         (int16, Q15)

Truncation is faster but introduces a systematic downward bias on all results. For a FIR filter this shifts the DC gain by ~−0.5 LSB per output sample.

from doppler.arith import shr_q15

# Truncation (floor): 3/4 is exactly between two Q15 values → floor gives 0.5
a = np.array([24576], dtype=np.int16)  # 0.75 in Q15

# shr_q15 uses round-half-up: (3/4) >> 1 → (24576 + 16384) >> 1 = 20480
print(shr_q15(a, 1)[0])                # 20480 / 32768 = 0.625  (= 0.75/2, rounded)

# Manual truncation (C right-shift behaviour):
print(np.int16(np.int32(24576) >> 1))  # 12288 / 32768 = 0.375  (biased low)

9. Saturation arithmetic — a closer look¶

Saturation replaces the modular wrap of two's complement with a clamp. Every add, sub, mul, and shl in doppler.arith saturates.

Why shifts can overflow too¶

A left shift multiplies by a power of two. Shifting a Q15 value left by 1 doubles it, which can overflow:

from doppler.arith import shl_q15

a = np.array([24576, -24576], dtype=np.int16)   # ±0.75

# Without saturation: 24576 << 1 = 49152 > 32767 → wraps to -16384
# With saturation:                            → clamps to 32767
print(shl_q15(a, 1))   # [ 32767, -32768]

The sign-flip hazard¶

Without saturation, overflow is invisible — the number looks valid but has the wrong sign. Saturation makes clipping audible/visible without the far worse artefact of a full sign flip.

10. Working with the doppler.arith API¶

Module-level functions (stateless)¶

All stateless operations take NumPy arrays and return a new array (or scalar).

import numpy as np
from doppler.arith import (
    add_q15, sub_q15, mul_q15, dot_q15,
    shl_q15, shr_q15,
    shr_i64,
)

# --- Q15 example: FIR output sample ---
# Compute one output sample of a symmetric FIR with coefficient vector h
# (Q15) applied to a delay line x (Q15).
def fir_sample(x: np.ndarray, h: np.ndarray) -> np.int16:
    """Compute Σ h[k]·x[k] in Q15 with full precision accumulation."""
    raw_q30 = dot_q15(x, h)                    # int64, Q30
    q30_arr = np.array([raw_q30], dtype=np.int64)
    q15_arr = shr_i64(q30_arr, 15)             # normalise to Q15
    # Saturate to int16 range
    return np.int16(min(max(q15_arr[0], -32768), 32767))

h = np.array([4096, 8192, 16384, 8192, 4096], dtype=np.int16)  # symmetric
x = np.array([8192, 8192, 16384, 8192, 4096], dtype=np.int16)  # delay line
y = fir_sample(x, h)
print(y, '→', y / 32768)

Stateful accumulators¶

AccQ15 and AccQ8 maintain a running sum across calls. Use madd for multiply-accumulate (the MAC operation at the heart of every FIR filter).

from doppler.arith import AccQ15

# --- Running MAC: feed one block at a time ---
mac = AccQ15()

for block in blocks:                    # streaming blocks of int16
    coeffs = get_coeffs_for_block()     # matched filter
    mac.madd(block, coeffs)             # acc += Σ block[i] × coeffs[i]

raw_sum = mac.dump()                    # int64, Q30; resets accumulator

# Normalise to Q15 scalar
q15_out = np.int16(np.clip(
    (raw_sum + (1 << 14)) >> 15, -32768, 32767
))

Q8 vs Q15 — when to choose which¶

Criterion	Q8 (int8)	Q15 (int16)
Resolution	7 fraction bits (~0.8% of full scale)	15 fraction bits (~0.003%)
SQNR (thermal noise limit)	~48 dB	~96 dB
Memory / bandwidth	1 byte/sample	2 bytes/sample
AVX2 throughput	32 int8 per register	16 int16 per register
Typical use	rough quantization, neural-net weights	audio, SDR, precision DSP

A 6 dB/bit rule of thumb: each additional bit buys ~6 dB of signal-to-quantisation-noise ratio (SQNR).

11. Summary: the five rules¶

Know your Q. Track the binary point through every operation. If two operands have different Q, align them before adding.
Widen on multiply. Q15 × Q15 = Q30. Keep the product in at least int32 before shifting back. Use int64 for accumulators.
Round, don't truncate. Add 2^(n−1) before the right-shift to get round-half-up. It costs one addition; it buys unbiased output.
Saturate at the output. Carry full precision through the pipeline; only clamp at the final output word. Intermediate saturation destroys precision.
Size your accumulator. Accumulating k products of n-bit width needs n + ⌈log₂ k⌉ bits. For k = 1024 Q15 products that is 26 bits — int32 is tight; int64 is safe.

12. Further reading¶

doppler.arith module — add_q15, sub_q15, mul_q15, dot_q15, shl_q15, shr_q15, AccQ15, AccQ8, and Q8 counterparts.
Quantization design note — doppler's encoding conventions for the cvt module (Q15 ↔ float, UQ15, I16).
ADC gallery walkthrough — 3–8 bit quantisation noise visualised: time-domain staircase and 6 dB/bit noise-floor descent.
HBDecimQ15 gallery walkthrough — Q15 halfband decimator: the _mm256_madd_epi16 inner loop and symmetric-fold trick.