Step-Audio-AQAA – An end-to-end large audio language model developed by StepFun

What is Step-Audio-AQAA?

Step-Audio-AQAA is an end-to-end large audio language model developed by the StepFun team, specifically designed for Audio Query-Audio Answer (AQAA) tasks. It can directly process audio input to generate natural and accurate speech responses without relying on traditional Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) modules, simplifying the system architecture and eliminating cascading errors.

The training process of Step-Audio-AQAA includes multimodal pre-training, supervised fine-tuning (SFT), direct preference optimization (DPO), and model merging. Through these methods, the model demonstrates excellent performance in complex tasks such as speech emotion control, role-playing, and logical reasoning. In the StepEval-Audio-360 benchmark test, Step-Audio-AQAA outperformed existing LALM models across multiple key dimensions, showcasing its strong potential in end-to-end speech interaction.

Step-Audio-AQAA Key Features

Direct Audio Processing: Generates speech responses directly from raw audio input, bypassing traditional ASR and TTS pipelines.
Seamless Voice Interaction: Supports voice-to-voice interaction, allowing users to ask questions via speech and receive spoken answers, enhancing naturalness and fluency.
Emotion & Tone Control: Adjusts emotion (e.g., happiness, sadness, seriousness) and intonation at the sentence level.
Speech Rate Control: Allows users to modify the speed of responses for different scenarios.
Timbre & Pitch Adjustment: Adapts voice characteristics (timbre, pitch) based on user instructions, suitable for role-playing or contextual needs.
Multilingual Support: Works with Chinese, English, Japanese, and more.
Dialect Support: Covers Sichuan dialect, Cantonese, and other Chinese dialects for regional adaptability.
Emotional Speech Generation: Produces context-aware emotional responses.
Role-Playing: Simulates roles like customer service agents, teachers, or friends with role-appropriate speech.
Logical Reasoning & Q&A: Handles complex reasoning and knowledge-based questions with accurate spoken answers.
High-Fidelity Speech Output: Uses a neural vocoder to generate natural, high-quality speech waveforms.
Speech Coherence: Maintains fluency and consistency in long sentences or paragraphs.
Mixed Text & Audio Output: Supports interleaved text and speech responses based on user preference.
Multimodal Input Understanding: Processes hybrid inputs (speech + text) and generates corresponding speech replies.

Technical Principles

Dual-Codebook Audio Tokenizer
- Converts audio signals into structured token sequences.
- Two tokenizers:
  - Linguistic Tokenizer: Extracts phonemes & linguistic attributes (sampled at 16.7 Hz, codebook size 1024).
  - Semantic Tokenizer: Captures acoustic features (emotion, tone) (sampled at 25 Hz, codebook size 4096).
- Effectively captures complex speech information.
Backbone LLM
- Uses Step-Omni, a 130B-parameter multimodal LLM pre-trained on text, speech, and image data.
- Embeds dual-codebook tokens into a unified vector space for deep semantic understanding via Transformer blocks.
Neural Vocoder
- Synthesizes high-quality speech waveforms from generated audio tokens.
- Based on a U-Net architecture with ResNet-1D layers + Transformer blocks, efficiently converting discrete tokens into continuous speech.

Project Links

Hugging Face Model Hub: https://huggingface.co/stepfun-ai/Step-Audio-AQAA
arXiv Paper: https://arxiv.org/pdf/2506.08967

Applications

Emotional Companion Robots: Adapts responses based on user mood, providing emotional support.
Multilingual Customer Service: Handles dialect-based queries and supports multiple languages.
Game NPC Interaction: Generates real-time emotional speech for dynamic in-game dialogues.
Smart Voice Assistants: Enables voice-based queries & reminders (e.g., schedules, info retrieval).
Education & Entertainment: Used for voice-based teaching, storytelling, poetry recitation, with flexible text/speech output switching.