Step-Audio 2 mini – StepStar’s Open-Source End-to-End Speech Model

What is Step-Audio 2 mini?

Step-Audio 2 mini is an open-source, end-to-end speech large model released by StepFun. Breaking away from traditional speech model architectures, it adopts a truly end-to-end multimodal framework that directly transforms raw audio input into speech response output. This design reduces latency and enables the model to understand paralinguistic information and non-speech signals. By incorporating chain-of-thought reasoning and reinforcement learning joint optimization, the model achieves fine-grained understanding and response to emotions, intonation, and other nuances. It also supports external tools such as web retrieval, effectively mitigating hallucination issues and enhancing multi-scenario scalability.

In terms of performance, Step-Audio 2 mini has achieved state-of-the-art (SOTA) results on multiple international benchmark datasets. For example, it ranked first among open-source end-to-end speech models on the MMAU general multimodal audio understanding benchmark with a score of 73.2; on URO Bench, which measures spoken dialogue capabilities, it achieved the highest scores in both basic and professional tracks among open-source end-to-end speech models; in Chinese-English translation tasks, it significantly outperformed GPT-4o Audio and other open-source speech models; and in speech recognition tasks, it ranked first across multiple languages and dialects, exceeding other open-source models by more than 15%.

Main Features of Step-Audio 2 mini

Audio Understanding: Accurately comprehends diverse audio content, including natural sounds, music, and speech, while capturing paralinguistic cues such as emotions and intonation, enabling perception of subtle context.
Speech Recognition: Excels in multilingual and multi-dialect speech recognition with high accuracy, quickly converting speech into text for diverse language environments.
Speech Translation: Supports speech-to-speech translation, enabling multilingual communication (e.g., Chinese-English) and bridging language barriers.
Emotion and Paralinguistic Analysis: Analyzes emotions (anger, joy, sadness, etc.) and non-verbal signals (laughter, sighs, etc.), enabling more natural human-computer interaction.
Voice Dialogue: Demonstrates strong conversational ability, engaging in fluent voice interactions, understanding complex queries, and providing appropriate responses. Applicable to intelligent customer service and voice assistants.
Tool Integration: Supports operations like real-time web search, retrieving up-to-date information to provide comprehensive and accurate answers.
Content Creation: Assists in generating audio content such as podcasts and audiobooks, serving as inspiration and material for creators.

Technical Principles of Step-Audio 2 mini

True End-to-End Multimodal Architecture: Breaks away from the traditional three-stage speech model structure by directly converting raw audio input into speech response output, simplifying the pipeline, reducing latency, and effectively processing paralinguistic and non-speech information.
Chain-of-Thought Reasoning + Reinforcement Learning: For the first time in an end-to-end speech model, chain-of-thought reasoning is combined with reinforcement learning for joint optimization, enabling refined understanding, reasoning, and natural responses to paralinguistic and non-speech signals such as emotion, intonation, and music.
Audio Knowledge Augmentation: Incorporates external tools such as web retrieval to mitigate hallucination, enhance multi-scenario adaptability, and provide accurate, up-to-date responses.

Project Resources

GitHub Repository: https://github.com/stepfun-ai/Step-Audio2
Hugging Face Model Hub: https://huggingface.co/stepfun-ai/Step-Audio-2-mini
Demo Access: https://realtime-console.stepfun.com

Application Scenarios of Step-Audio 2 mini

Intelligent Voice Assistants: Provides convenient voice interaction services, such as smart home control and office assistance, by executing operations through voice commands.
Customer Service: Enhances efficiency and user experience by quickly and accurately understanding customer inquiries and offering solutions.
Speech Translation: Enables real-time speech-to-speech translation, bridging language gaps for international communication, business meetings, and beyond.
Audio Content Creation: Supports creators in producing audio content such as podcasts and audiobooks, offering inspiration and content generation.
Education: Facilitates personalized learning experiences in language learning and online education through interactive voice communication.
Healthcare: Assists in medical consultation and rehabilitation therapy by providing health advice and emotional support through voice dialogue.