FLM-Audio – A full-duplex audio dialogue model open-sourced by BAAI (Beijing Academy of Artificial Intelligence)

What is FLM-Audio？

FLM-Audio is a native full-duplex audio dialogue large model jointly released by the Beijing Academy of Artificial Intelligence (BAAI), Spin Matrix, and Nanyang Technological University, Singapore. It supports both Chinese and English. With its native full-duplex architecture, the model merges listening, speaking, and monologue channels at each timestep, avoiding the high-latency issues of traditional time-division multiplexing schemes. Its unique natural monologue design and dual-training paradigm enable human-like conversational flow and effectively solve asynchronous alignment challenges. Despite being trained on only 1 million hours of audio data, FLM-Audio achieves high-quality, agile, and natural responses, while maintaining robustness against noise and user interruptions.

Key Features of FLM-Audio

Full-duplex voice interaction: Enables “listening while speaking.” Users can interrupt at any time, and the model instantly pauses, processes the new input, and responds smoothly with low latency.
Multilingual support: Supports both Chinese and English to meet diverse user needs.
Natural speech modeling: Simulates human rhythm through “natural monologues,” with dual training to align linguistic and acoustic semantics, balancing low latency with strong language modeling performance.
Efficient training with less data: Trained with only ~1 million hours of audio to build a 7B parameter model that remains robust and natural even in noisy or frequently interrupted environments.
Strong robustness: Handles background noise and interruptions effectively, pausing output instantly, accurately interpreting new input, and responding seamlessly.
Fully open-source: Paper, model weights, and code are publicly available, supporting local deployment and secondary development for research and applications.

Technical Principles of FLM-Audio

Native full-duplex architecture: Supports simultaneous speech input and output for real-time “listen while speaking” interaction.
Natural monologue training: Uses continuous sentence segments with pauses instead of word-by-word alignment, simulating realistic human speech patterns.
Dual-training strategy: Trains by alternating monologues at the start and end of audio segments, enhancing alignment between language and acoustic semantics.
Data-efficient training: Achieves high-parameter modeling with only ~1M hours of audio, optimizing architecture and training methods to deliver low latency and high robustness.

Project Resources

GitHub Repository: https://github.com/cofe-ai/flm-audio
HuggingFace Model Hub: https://huggingface.co/CofeAI/FLM-Audio
arXiv Paper: https://arxiv.org/pdf/2509.02521

Application Scenarios

Online education: AI tutors provide real-time answers to students with natural and efficient interaction.
Gaming & VR: NPCs engage in continuous, interruptible conversations for greater immersion.
Intelligent customer service: Low-latency dialogue reduces waiting times, improving efficiency and user satisfaction.
AI companionship: Provides more human-like conversational experiences for stronger emotional connection.
Voice assistants: Enables natural voice interactions in smart home and smart office environments.
Meeting assistance: Supports real-time translation, transcription, and interaction in multi-party meetings, boosting efficiency.