Digital Down-Converter

A DDC shifts a signal from a carrier frequency to DC and optionally decimates it. The doppler C library provides dp_ddc_t plus the building blocks to assemble custom pipelines. This page documents the practical architectures, the trade-offs between them, and measured throughput so you can pick the right one.

Signal chain overview

                    ┌─────────────────────────────────────────────┐
  in (fs_in) ──────►│  NCO mix ──► [HB ÷2] ──► DPMFS resample   │──► out (fs_out)
                    └─────────────────────────────────────────────┘

Three stages, each optional or reorderable:

Stage	C type	Purpose
NCO mix	`dp_nco_t`	Multiply by e^{j2πf_n·t} — shift carrier to DC
Halfband ÷2	`dp_hbdecim_cf32_t`	Cheap factor-of-2 decimation
DPMFS resample	`dp_resamp_dpmfs_t`	Continuously-variable rate conversion

The default dp_ddc_create chains NCO + DPMFS with built-in M=3 N=19 Kaiser-DPMFS coefficients (passband ≤ 0.4·fs_out, stopband ≥ 0.6·fs_out, 60 dB rejection). A halfband stage can be inserted before or after to trade filter cost against DPMFS work.

Architecture A — Plain DDC (default)

CF32 in ──► NCO ──► DPMFS (0.4/0.6, M=3 N=19, rate r) ──► CF32 out

dp_ddc_create(norm_freq, num_in, rate) with default coefficients. No design step required. One allocation, no intermediate buffers.

Best for: prototype, any decimation rate, single-stage simplicity.

Architecture B — Halfband → DDC (complex input)

CF32 in ──► HB ÷2 ──► NCO ──► DPMFS (0.4/0.6, M=3 N=19, rate 2r) ──► CF32 out

The halfband (N=19, 60 dB) decimates by 2 first. The DPMFS then runs on half the samples at twice the effective rate (same filter coefficients, same total output rate).

The halfband cost (~427 MSa/s for N=19) is far below the DPMFS cost, so the pipeline time is dominated by DPMFS work on N/2 input samples.

Best for: complex IQ input, decimation ≥ 2×. Dominant choice.

Architecture C — DDC (wide-band) → Halfband

CF32 in ──► NCO ──► DPMFS (0.2/0.8, M=3 N=7, rate 2r) ──► HB ÷2 ──► CF32 out

A 0.2/0.8 DPMFS (3× wider transition band) needs only N=7 taps instead of N=19. But it runs at rate 2r — double the output — and those extra samples must pass through the halfband. The wider intermediate buffer erases the tap-count savings in practice.

MAC analysis (per input sample):
  Architecture A  0.4/0.6 at r:     (M+1) × N × r   = 4 × 19 × r = 76·r
  Architecture C  0.2/0.8 at 2r:    (M+1) × N × 2r  = 4 ×  7 × 2r = 56·r
                                                                 ─ 26 % fewer MACs
  … but memory traffic for 2× intermediate buffer negates the saving.

Best for: not recommended over Architecture B for complex input.

Architecture D — Real input: NCO → HB → DPMFS

Real in ──► NCO ──► HB ÷2 ──► DPMFS (0.4/0.6, M=3 N=19, rate 2r) ──► CF32 out

When the ADC produces real samples (not complex IQ), the DDC chain is different in two important ways.

NCO is 2× cheaper

Input type	NCO multiply	MACs/sample
Complex IQ	(I+jQ)(cos+j·sin)	4 mul + 2 add
Real	x·cos + j·x·sin	2 mul

The NCO must come first to convert real → complex.

HB does dual duty: decimation + image rejection

A real input signal has conjugate-symmetric spectrum: X(−f) = X*(f). After NCO mixing at frequency f_n, the desired signal lands at DC but a mirror image appears at −2f_n. The halfband's stopband (≥ 3·fs/8 for a standard 60 dB Kaiser design) naturally covers this image whenever f_n > fs/8, which is almost always satisfied. No separate image-rejection filter is needed.

For complex IQ input there is no real-input image — HB is only decimating. For real input it is decimating and filtering, which means the stage always earns its keep regardless of decimation rate.

Additional 2× saving in the HB (not yet implemented)

After the NCO, the I and Q channels are independently real:

I[n] = x[n] · cos(2πf_n·n)
Q[n] = x[n] · sin(2πf_n·n)

Applying the complex HB to this pair is identical to running two separate real halfbands. A real halfband costs N/4 MACs per output sample (halfband symmetry reduces N taps to N/2 effective, pre-summing reduces again to N/4).

Complex HB on general IQ:          2 × (N/2) real MACs = N   MACs/output
Two real HBs on NCO(real) output:  2 × (N/4) real MACs = N/2 MACs/output
                                                         ─ 2× cheaper

This optimisation requires a dp_hbdecim_r2cf32_t variant (real-pair input, complex output) that does not yet exist in the library.

Architecture D2 — Real input: zero-multiply band capture + fine NCO

Real in ──► Modified HB (fs/4 shift embedded) ──► Fine NCO (at fs/2) ──► DPMFS ──► CF32 out
              zero extra multiplications              arbitrary carrier tune
              entire [0, fs/4] band preserved         1 MAC per original input sample

This is the optimal architecture for any real ADC input, not a special case. It beats Architecture D in two places simultaneously and requires no fixed IF.

Step 1 — Band capture: real → complex via modified halfband

Mixing by fs/4 then decimating by 2 is a lossless real-to-complex conversion. A real signal at sample rate fs has its unique content in [0, fs/2]; the fs/4 shift maps [0, fs/4] to [−fs/4, +fs/4], and the halfband passes the whole thing. No information is discarded — every carrier in [0, fs/4] is present in the complex output at fs/2.

The fs/4 mix multiplies by:

e^{−j·(π/2)·n}  =  { 1,  −j,  −1,  +j,  1,  −j,  −1,  +j, … }

No multiplications — only sign negations.

Embedding in the halfband (no extra coefficient loads)

The combined mix-then-halfband output at decimation index m:

z[m] = Σ_k  h[k] · x[2m−k] · e^{−j·(π/2)·(2m−k)}

      = e^{−jπm} · Σ_k  [h[k] · e^{j·(π/2)·k}]  · x[2m−k]
            ↑                       ↑
      output correction        g[k] = h[k] · e^{j·(π/2)·k}
      (±1 alternating, free)   precomputed at construction

The per-tap rotation applied to the halfband polyphase branches:

Branch	Taps	Rotation	Result
FIR branch	even k = 0, 2, 4, …	`{+1, −1, +1, −1, …}`	sign-flip every other tap, encoded once
Delay branch	odd k = 1, 3, 5, …	`{+j, −j, +j, …}`	pure imaginary → Q channel

The delay branch has only one non-zero tap (centre, gain 0.5) by the halfband property. So Q is a single delayed real sample — one multiply. I is the FIR branch with alternating-sign coefficients — N/4 real MACs.

No extra coefficient loads at runtime. The sign flips are baked into h_fir_modified[] at construction. The output correction e^{−jπm} = (−1)^m is one sign flip per output — free on any SIMD unit.

Step 2 — Fine NCO at fs/2 (arbitrary carrier)

The complex output of the modified halfband contains the full [0, fs/4] band centered around DC. A conventional NCO running at fs/2 shifts any carrier within that band to DC:

fine_freq = carrier_freq / (fs/2)     ← normalised to output rate

This NCO is complex-input/complex-output (4 MACs + 2 adds per sample), but it runs at fs/2, costing the equivalent of 2 MACs per original input sample — the same as Architecture D's real-input NCO but at half the rate, so 1 MAC/original sample effective.

Cost comparison: D vs D2

Per original input sample (N=19 halfband FIR branch):

Stage	Arch D	Arch D2
Full-rate NCO	2 MACs (real→complex)	—
Halfband	N/2 MACs (complex HB on NCO output)	N/4 MACs (real modified HB)
Fine NCO at fs/2	—	2 MACs × ½ = 1 MAC
Total	2 + N/2 ≈ 11.5	N/4 + 1 ≈ 5.75

Architecture D2 is approximately 2× cheaper than Architecture D for real input, regardless of carrier frequency or decimation rate.

What a `dp_hbdecim_r2cf32_t` would do

Construction:
  h_fir_modified[k] = h_fir[k] · (−1)^k      ← sign-flip every other tap

Per output sample:
  I[m] = Σ_k  h_fir_modified[k] · x[2m − 2k]   ← N/4 real MACs
  Q[m] = 0.5 · x[2m − centre_offset]            ← 1 multiply
  apply (−1)^m to both I and Q                  ← 1 sign flip each

Output: CF32 at fs/2, full [0, fs/4] band centered near DC

Followed by a fine NCO (at fs/2) then DPMFS for remaining decimation.

Architecture E — Coarse/fine NCO split (high decimation)

Real/CF32 in ──► Complex DPMFS (coarse NCO baked in) ──► Fine NCO ──► CF32 out
                   h_rot[k] = h[k]·e^{-j2πf_coarse·k}   at fs_out

The idea

A full-rate NCO multiplies every input sample. For high decimation rates, most of that work is discarded by the filter. The coarse/fine split moves almost all of the NCO work to the output rate, which can be 100× lower.

Derivation. Polyphase DDC with decimation D and prototype taps h[k]:

y[m] = Σ_k  h[k] · x[mD-k] · e^{j2πf_c(mD-k)}

      = e^{j2πf_c·mD}  ·  Σ_k  [h[k]·e^{-j2πf_c·k}]  ·  x[mD-k]
           ↑                              ↑
     fine NCO (once per output)     h_rot: complex filter taps
     runs at fs_out = r·fs_in             precomputed, static between retunes

The prototype filter h[k] is rotated tap-by-tap by e^{-j2πf_c·k} to produce complex coefficients h_rot[k]. The per-output phase correction e^{j2πf_c·mD} is a trivial NCO at the output sample rate — 100× slower than the input rate for 100× decimation.

For the coarse/fine split, let f_c = f_coarse + f_fine:

f_coarse is quantized to a grid (e.g. a multiple of fs_out, or any value that changes only on channel reassignment). The rotated filter h_rot is recomputed once per retune.
f_fine is the residual — a full-precision NCO at fs_out correcting the quantisation error or tracking a slowly drifting carrier.

The DPMFS representation handles this directly: rotate the prototype h[k] before calling fit_dpmfs. The polynomial fit naturally absorbs the complex rotation; the resulting c0/c1 arrays are complex float32.

Cost analysis

Per input sample, real input, HB → DPMFS pipeline:

Architecture B (HB → NCO@fs_in/2 → real DPMFS):
  NCO cost:     4 MACs/sample × N_in/2  =  2 · N_in  MACs
  DPMFS cost:   76 MACs/output × N_out  = 76 · N_in · r  MACs
  Total:        2 + 76r  MACs/input

Architecture E (HB → complex DPMFS, fine NCO@fs_out):
  DPMFS cost:   152 MACs/output × N_out = 152 · N_in · r  MACs
  Fine NCO:     ~6 MACs/output          ≈  6 · N_in · r  MACs
  Total:        158r  MACs/input

Break-even: 2 + 76·r = 158·r → r ≈ 1/38 → D ≈ 38×

Decimation	Arch B	Arch E	Δ
4×	21.0	39.5	−88% (B wins)
10×	9.6	15.8	−65% (B wins)
38×	4.0	4.2	≈break-even
50×	3.5	3.2	+10% (E wins)
100×	2.8	1.6	+43% (E wins)

Retune cost

Retuning means recomputing h_rot[k] = h[k]·e^{-j2πf_coarse·k} and re-fitting c0/c1. For M=3 N=19 this touches (M+1)×N = 76 complex multiplies + the polynomial fit — a one-time cost amortised over every block until the next retune.

For continuous fine tuning (AFC, Doppler tracking) use f_coarse as a coarse channel assignment that rarely changes, and let f_fine absorb all rapid variation at output rate.

Implementation status

Requires a complex-coefficient variant of the DPMFS resampler. The existing dp_resamp_dpmfs_create takes const float *c0, *c1; a dp_resamp_dpmfs_cf32_create taking const dp_cf32_t *c0, *c1 would complete this architecture. The design tooling (Python fit_dpmfs) works on complex-valued banks today — the C runtime is the missing piece.

Performance (Release build, x86-64, WSL2)

Complex IQ input — Architecture A vs B vs C

Block = 65536 samples · 200 iterations · M=3 N=19 (A, B) or N=7 (C).

Total rate	Label	(A) Plain DDC	(B) HB→DDC	(C) DDC→HB
0.50	2× decim	61 MSa/s	335 MSa/s	70 MSa/s
0.25	4× decim	70 MSa/s	76 MSa/s	62 MSa/s
0.125	8× decim	71 MSa/s	102 MSa/s	66 MSa/s
0.10	10× decim	72 MSa/s	97 MSa/s	74 MSa/s
0.01	100× decim	85 MSa/s	116 MSa/s	80 MSa/s

Architecture B wins at every rate. The 2× case is dramatic: after the halfband, the DDC degenerates to pure NCO bypass on N/2 samples, which is essentially free. Architecture C is slower than plain DDC for most rates because the 2× intermediate buffer increases memory traffic more than the reduced tap count saves.

-march=native gives negligible improvement — DPMFS is memory-latency bound, not FLOP-bound.

NCO-only (bypass, no resampler)

Config	MSa/s
Complex NCO (complex IQ input)	~1215
Real NCO (real input, 2× cheaper)	~2400 (projected)

The NCO multiply alone runs at ~1.2 GSa/s for complex input; real input halves the multiply count so expect ~2.4 GSa/s.

Decision guide

Is your input real (single ADC channel)?
  YES ─► Architecture D2: modified HB + fine NCO@fs/2 + DPMFS
  │        ~2× cheaper than Arch D at any carrier, any decimation rate
  │        (zero-multiply band capture; fine NCO at half-rate)
  │        └─ Decimation > 38× after the HB?
  │              ─► Architecture E variant: embed fine NCO into complex DPMFS
  │                   fine NCO at fs_out                 (+43% at 100×)
  │
  NO (complex IQ)
  │
  ├─ Total decimation = 1× (bypass)
  │     ─► Architecture A without resampler  (plain NCO mix)
  │
  ├─ Total decimation 2× – 38×
  │     ─► Architecture B: HB → DDC          (dominant choice)
  │
  └─ Total decimation > 38×
        ─► Architecture E: complex DPMFS (coarse NCO embedded),
             fine NCO at fs_out           (+43% at 100× vs B)

Code examples

Architecture A — one call

dp_ddc_t *ddc = dp_ddc_create(-0.1f, 4096, 0.25);  // 4× decim, shift 0.1 to DC

dp_cf32_t out[dp_ddc_max_out(ddc)];
size_t n = dp_ddc_execute(ddc, in, 4096, out, dp_ddc_max_out(ddc));

dp_ddc_destroy(ddc);

Architecture B — halfband then DDC

/* HB: 60 dB Kaiser, N=19 FIR branch — design with
 *   doppler.polyphase.kaiser_prototype(phases=2)            */
dp_hbdecim_cf32_t *hb  = dp_hbdecim_cf32_create(N_hb, h_fir);
dp_ddc_t          *ddc = dp_ddc_create(norm_freq, num_in / 2, rate * 2.0);

dp_cf32_t mid[num_in / 2 + N_hb + 2];
dp_cf32_t out[dp_ddc_max_out(ddc)];

size_t n_mid = dp_hbdecim_cf32_execute(hb, in, num_in, mid, sizeof mid / sizeof mid[0]);
size_t n_out = dp_ddc_execute(ddc, mid, n_mid, out, dp_ddc_max_out(ddc));

dp_hbdecim_cf32_destroy(hb);
dp_ddc_destroy(ddc);

Architecture D — real input

/* NCO converts real → complex at full rate.
 * dp_nco_execute_r2cf32 not yet available — use execute_cf32
 * with the real part only (Q input = 0) as an interim.      */
dp_nco_t          *nco = dp_nco_create(norm_freq);
dp_hbdecim_cf32_t *hb  = dp_hbdecim_cf32_create(N_hb, h_fir);
dp_ddc_t          *ddc = dp_ddc_create(0.0f, num_in / 2, rate * 2.0);
  /* norm_freq=0: NCO already applied; DDC used for DPMFS only */

!!! note "Real-input NCO" A dedicated dp_nco_execute_r2cf32 (real → complex, 2 MACs/sample) and dp_hbdecim_r2cf32_t (real-pair HB, N/2 MACs/output) are planned. Until then, load real samples into the .i field of dp_cf32_t and zero the .q field before passing to dp_nco_execute_cf32.