Lisa: Lightweight Yet Superb Neural Speech Coding



Jiankai Huang, Junteng Zhang, Ming Lu, Xun Cao, and Zhan Ma

Nanjing University, Nanjing, China
Lisa

Neural speech coding has recently achieved remarkable progress at low and ultra-low bitrates, yet its efficiency remains constrained by the limited ability to learn compact representations. To address this challenge, we introduce Lisa, a lightweight neural speech codec that enhances both feature representation and quantization. First, Lisa employs a causal frequency-domain encoder–decoder equipped with Inception Residual Blocks (IRB) to better exploit multi-scale correlations. Second, we propose Regulated Residual Vector Quantization (R-RVQ), which explicitly modulates residuals into quantization-friendly forms, enabling more effective and compact multi-stage representation. Experimental results demonstrate that Lisa surpasses existing neural speech codecs in coding efficiency, while retaining low complexity suitable for real-time speech communication and streaming.



Experimental Results: LibriTTS Test-Clean Reconstruction @ 16kHz Audio

Sample 1
Ground Truth
Lisa (Ours) @ 1500 bps
Lisa (Ours) @ 500 bps
Encodec @ 1500 bps
DAC @ 1500 bps
SpeechTokenizer @ 1500 bps
SemantiCodec @ 1400 bps
FunCodec @ 1500 bps
Mimi @ 1000 bps
MUFFIN @ 1350 bps
FunCodec @ 500 bps
SemantiCodec @ 650 bps
WavTokenizer @ 500 bps
Sample 2
Ground Truth
Lisa (Ours) @ 1500 bps
Lisa (Ours) @ 500 bps
Encodec @ 1500 bps
DAC @ 1500 bps
SpeechTokenizer @ 1500 bps
SemantiCodec @ 1400 bps
FunCodec @ 1500 bps
Mimi @ 1000 bps
MUFFIN @ 1350 bps
FunCodec @ 500 bps
SemantiCodec @ 650 bps
WavTokenizer @ 500 bps
Sample 3
Ground Truth
Lisa (Ours) @ 1500 bps
Lisa (Ours) @ 500 bps
Encodec @ 1500 bps
DAC @ 1500 bps
SpeechTokenizer @ 1500 bps
SemantiCodec @ 1400 bps
FunCodec @ 1500 bps
Mimi @ 1000 bps
MUFFIN @ 1350 bps
FunCodec @ 500 bps
SemantiCodec @ 650 bps
WavTokenizer @ 500 bps
Sample 4
Ground Truth
Lisa (Ours) @ 1500 bps
Lisa (Ours) @ 500 bps
Encodec @ 1500 bps
DAC @ 1500 bps
SpeechTokenizer @ 1500 bps
SemantiCodec @ 1400 bps
FunCodec @ 1500 bps
Mimi @ 1000 bps
MUFFIN @ 1350 bps
FunCodec @ 500 bps
SemantiCodec @ 650 bps
WavTokenizer @ 500 bps