Lisa: Lightweight Yet Superb Neural Speech Coding

Jiankai Huang, Junteng Zhang, Ming Lu, Xun Cao, and Zhan Ma

Nanjing University, Nanjing, China

Lisa

Neural speech coding has recently achieved remarkable progress at low and ultra-low bitrates, yet its efficiency remains constrained by the limited ability to learn compact representations. To address this challenge, we introduce Lisa, a lightweight neural speech codec that enhances both feature representation and quantization. First, Lisa employs a causal frequency-domain encoder–decoder equipped with Inception Residual Blocks (IRB) to better exploit multi-scale correlations. Second, we propose Regulated Residual Vector Quantization (R-RVQ), which explicitly modulates residuals into quantization-friendly forms, enabling more effective and compact multi-stage representation. Experimental results demonstrate that Lisa surpasses existing neural speech codecs in coding efficiency, while retaining low complexity suitable for real-time speech communication and streaming.

Experimental Results: LibriTTS Test-Clean Reconstruction @ 16kHz Audio

Sample 1

Ground Truth

Lisa (Ours) @ 1500 bps

Lisa (Ours) @ 500 bps

Encodec @ 1500 bps

DAC @ 1500 bps

SpeechTokenizer @ 1500 bps

SemantiCodec @ 1400 bps

FunCodec @ 1500 bps

Mimi @ 1000 bps

MUFFIN @ 1350 bps

FunCodec @ 500 bps

SemantiCodec @ 650 bps

WavTokenizer @ 500 bps

Sample 2

Ground Truth

Lisa (Ours) @ 1500 bps

Lisa (Ours) @ 500 bps

Encodec @ 1500 bps

DAC @ 1500 bps

SpeechTokenizer @ 1500 bps

SemantiCodec @ 1400 bps

FunCodec @ 1500 bps

Mimi @ 1000 bps

MUFFIN @ 1350 bps

FunCodec @ 500 bps

SemantiCodec @ 650 bps

WavTokenizer @ 500 bps

Sample 3

Ground Truth

Lisa (Ours) @ 1500 bps

Lisa (Ours) @ 500 bps

Encodec @ 1500 bps

DAC @ 1500 bps

SpeechTokenizer @ 1500 bps

SemantiCodec @ 1400 bps

FunCodec @ 1500 bps

Mimi @ 1000 bps

MUFFIN @ 1350 bps

FunCodec @ 500 bps

SemantiCodec @ 650 bps

WavTokenizer @ 500 bps

Sample 4

Ground Truth

Lisa (Ours) @ 1500 bps

Lisa (Ours) @ 500 bps

Encodec @ 1500 bps

DAC @ 1500 bps

SpeechTokenizer @ 1500 bps

SemantiCodec @ 1400 bps

FunCodec @ 1500 bps

Mimi @ 1000 bps

MUFFIN @ 1350 bps

FunCodec @ 500 bps

SemantiCodec @ 650 bps

WavTokenizer @ 500 bps