AudioStory – an audio generation model released by Tencent ARC

What is AudioStory？

AudioStory is an audio generation technology released by Tencent ARC Lab. It can generate high-quality long-form narrative audio based on natural language descriptions. By adopting a divide-and-conquer strategy, it breaks down complex narrative requests into sequential subtasks and coordinates semantics with audio details through a decoupling-bridging mechanism. Its end-to-end training approach enhances model synergy, resulting in audio with clear temporal logic and emotional layers.

Main Features of AudioStory

Automatic Video Dubbing: Users can upload a silent video and describe the desired audio style. AudioStory automatically analyzes the video content and generates synchronized background tracks with a unified style.
Intelligent Audio Continuation: Given an audio clip, AudioStory can infer the following scene and generate reasonable continuations. For example, from a coach’s voice during basketball training, it can add player footsteps and ball dribbling sounds.
Audiobook Creation: It provides high-quality audio content for audiobooks, generating narrative audio with temporal logic and emotional depth based on text descriptions, helping listeners immerse themselves in the story.
Game Sound Effect Production: It generates immersive sound effects for games, creating matching audio based on scene descriptions to enhance the player’s experience.
Intelligent Podcasting: It helps podcast creators quickly generate audio content. Based on topic descriptions, it produces relevant audio clips to improve creative efficiency.

Technical Principles of AudioStory

Divide-and-Conquer Strategy: Breaks complex narrative requests into ordered subtasks, generates corresponding audio segments, and arranges them along a timeline to ensure coherence and logical flow.
Decoupling-Bridging Mechanism: Splits collaboration between the large language model and the audio generator into two components—bridge queries and residual queries—for intra-event semantic alignment and cross-event consistency, improving generation quality.
End-to-End Training: Uses a unified training approach to jointly optimize instruction comprehension and audio generation, strengthening model collaboration and overall performance.
Dual-Channel Mechanism with Semantic and Residual Tokens: Processes macro-level narratives and micro-level sound details separately, aligning the two to ensure that generated audio is both logically consistent and rich in detail.
Three-Stage Progressive Training: From single-sound generation, to audio coordination, to long-form narrative generation, this stepwise training improves performance and adaptability, enabling the model to handle complex long-form audio tasks.

Project Links for AudioStory

GitHub Repository: https://github.com/TencentARC/AudioStory
Paper: https://arxiv.org/pdf/2508.20088

Application Scenarios of AudioStory

Video Dubbing: Automatically analyzes silent videos and generates background soundtracks matching user-described audio styles.
Audio Continuation: Infers follow-up scenes and supplements audio clips reasonably, e.g., adding player footsteps to basketball training sounds.
Audiobook Creation: Generates audio with logical sequence and emotional depth based on text, enhancing the listening experience.
Game Sound Effect Generation: Produces immersive sound effects aligned with game scene descriptions, improving the gaming experience.