LongCat-Audio-Codec – Meituan’s Open-Source Speech Codec Solution
What is LongCat-Audio-Codec?
LongCat-Audio-Codec is an open-source speech codec solution developed by Meituan’s LongCat team, specifically designed for Speech Large Language Models (Speech LLMs). It introduces a dual-token extraction mechanism that processes both semantic and acoustic information in parallel, ensuring a balanced representation between speech understanding and acoustic feature preservation—an issue traditional codecs struggle to handle.
Its low-latency streaming decoder supports real-time interaction, keeping decoding delays within a few hundred milliseconds—ideal for applications such as in-car voice assistants and real-time translation. LongCat-Audio-Codec achieves high-fidelity audio reconstruction at ultra-low bitrates through integrated super-resolution design, improving both sampling rate and naturalness of the output audio.
The system provides a complete token generation and reconstruction toolchain, supporting flexible codebook configurations that can be adjusted for different downstream tasks and scenarios. Moreover, its multi-stage training strategy further optimizes the balance between high compression rates and high audio quality.
Main Features of LongCat-Audio-Codec
1. Parallel Semantic and Acoustic Tokenization:
Maps raw audio signals into parallel semantic and acoustic token sequences, capturing both linguistic content and acoustic features such as tone, timbre, and prosody.
2. Low-Latency Streaming Decoding:
Adopts a frame-level incremental processing mechanism to achieve low-latency decoding, enabling real-time interactive audio experiences.
3. Ultra-Low Bitrate with High Fidelity:
Reconstructs high-fidelity audio even under extremely low bitrates. Integrated super-resolution design enhances the sampling rate and naturalness of output audio.
Technical Principles of LongCat-Audio-Codec
1. Dual Semantic-Acoustic Token Extraction:
Utilizes a bidirectional Transformer architecture to extract semantic tokens that capture core linguistic information, while enhanced quantization techniques extract acoustic tokens that retain paralinguistic elements (e.g., intonation, rhythm, timbre). This dual-token design effectively balances semantic understanding and acoustic quality.
2. Low-Latency Streaming Decoding:
Implements a frame-level incremental decoding approach, minimizing dependency on future tokens and keeping decoding latency at the hundred-millisecond level, suitable for real-time interaction.
3. Ultra-Low Bitrate, High Fidelity, and Super-Resolution Integration:
Optimized model design and training strategies enable high-fidelity audio reconstruction at low bitrates, while embedding super-resolution modules within the decoder enhances both sampling rate and perceptual audio quality.
4. Flexible Acoustic Codebook Configuration:
Allows customization of the number of acoustic codebooks according to downstream task requirements, supporting various scenarios such as single-speaker or multi-speaker environments.
5. Multi-Stage Training Strategy:
Employs a multi-phase training process to meet diverse objectives: high-quality reconstruction under high compression, natural audio synthesis, and personalized adaptation for different applications.
Project Links
-
GitHub Repository: https://github.com/meituan-longcat/LongCat-Audio-Codec
-
Hugging Face Model Hub: https://huggingface.co/meituan-longcat/LongCat-Audio-Codec
Application Scenarios of LongCat-Audio-Codec
1. Smart Speakers:
Enhances the real-time responsiveness and naturalness of speech interaction, enabling faster and more accurate voice command processing.
2. In-Car Voice Assistants:
Provides low-latency voice feedback suitable for in-vehicle environments, improving driver experience and safety through responsive speech interactions.
3. Real-Time Translation:
Supports high-quality, real-time speech translation with minimal latency, improving the fluidity of multilingual communication.
4. Speech Recognition and Synthesis:
Offers efficient audio processing for ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) systems, improving recognition accuracy and naturalness of synthesized speech.
5. Long-Form Audio Modeling:
Enables efficient encoding and decoding of long-duration audio content, applicable to use cases such as audiobooks, podcasts, and lectures.
6. Multilingual Speech Processing:
Supports multilingual audio modeling, offering a technical foundation for cross-lingual speech applications such as translation and voice cloning.