Ohmic Audio

⚙️ ENGINEER LEVEL: Audio Coding Theory

Perceptual Coding Fundamentals

MP3, AAC, and similar codecs don't randomly discard audio data — they use psychoacoustic models to identify what you won't hear and discard that.

Core principle: If a loud sound at one frequency masks a quieter sound at a nearby frequency, code the quiet sound with fewer bits. The ear can't hear the resulting error.

Simultaneous masking:

A masker at frequency fm with level Lm masks a signal at frequency f_s if:

L_s < L_m − spread(f_s − f_m)

Where spread() is the spread-of-masking function, roughly: - −10 dB/octave above the masker - −25 dB/octave below the masker

Temporal masking:

Masking doesn't just happen simultaneously — it extends in time: - Pre-masking: Up to 5 ms before masker onset - Post-masking: Up to 200 ms after masker offset

This is why a sudden loud sound can mask quieter sounds that follow it — ears take time to "recover."

Encoding steps:

  1. Analysis filterbank: Divide signal into frequency subbands (576 subbands for MP3's MDCT)
  2. Psychoacoustic model: Calculate masking threshold for current frame
  3. Bit allocation: Allocate bits so quantization noise stays below masking threshold
  4. Quantization: Apply, check against threshold, re-allocate if needed
  5. Entropy coding: Huffman coding for further compression
  6. Frame packing: Assemble into bitstream

Why lossy codecs fail:

Masking model is an approximation. Failures occur when: - Complex signal defies simple masking model - Transients cause pre-masking overestimates - Very low bitrate forces noise above masking threshold - Specific frequencies with unusual masking behavior

Result: Pre-echo (artifact before transient), metallic shimmer on complex material, pumping artifacts on sustained tones.

MDCT (Modified Discrete Cosine Transform)

The transform at the heart of MP3, AAC, and most modern audio codecs.

MDCT definition:

X[k] = Σ x[n] × cos[π/N × (n + N/2 + 1/2) × (k + 1/2)]

For n = 0 to N-1, k = 0 to N/2 - 1

Properties:

Window functions in MDCT:

Before MDCT, signal is multiplied by a window function to reduce spectral leakage.

MP3 uses: Kaiser-Bessel-derived window for long blocks; Hann window for short blocks.

Block switching:

Pre-echo artifact:

If a transient occurs near end of long block, the entire block gets coded together. Quantization noise from the transient "spreads" to the quiet region before it — audible as a pre-echo artifact.

Short blocks and block switching reduce this significantly; good encoders (LAME at -V0, Apple AAC, FDK-AAC) minimize pre-echo through careful block selection algorithms.


5.4 Video Integration: Displays, Cameras, and Navigation