AudioStory – an audio generation model released by Tencent ARC
What is AudioStory?
AudioStory is an audio generation technology released by Tencent ARC Lab. It can generate high-quality long-form narrative audio based on natural language descriptions. By adopting a divide-and-conquer strategy, it breaks down complex narrative requests into sequential subtasks and coordinates semantics with audio details through a decoupling-bridging mechanism. Its end-to-end training approach enhances model synergy, resulting in audio with clear temporal logic and emotional layers.

Main Features of AudioStory
- 
Automatic Video Dubbing: Users can upload a silent video and describe the desired audio style. AudioStory automatically analyzes the video content and generates synchronized background tracks with a unified style. 
- 
Intelligent Audio Continuation: Given an audio clip, AudioStory can infer the following scene and generate reasonable continuations. For example, from a coach’s voice during basketball training, it can add player footsteps and ball dribbling sounds. 
- 
Audiobook Creation: It provides high-quality audio content for audiobooks, generating narrative audio with temporal logic and emotional depth based on text descriptions, helping listeners immerse themselves in the story. 
- 
Game Sound Effect Production: It generates immersive sound effects for games, creating matching audio based on scene descriptions to enhance the player’s experience. 
- 
Intelligent Podcasting: It helps podcast creators quickly generate audio content. Based on topic descriptions, it produces relevant audio clips to improve creative efficiency. 
Technical Principles of AudioStory
- 
Divide-and-Conquer Strategy: Breaks complex narrative requests into ordered subtasks, generates corresponding audio segments, and arranges them along a timeline to ensure coherence and logical flow. 
- 
Decoupling-Bridging Mechanism: Splits collaboration between the large language model and the audio generator into two components—bridge queries and residual queries—for intra-event semantic alignment and cross-event consistency, improving generation quality. 
- 
End-to-End Training: Uses a unified training approach to jointly optimize instruction comprehension and audio generation, strengthening model collaboration and overall performance. 
- 
Dual-Channel Mechanism with Semantic and Residual Tokens: Processes macro-level narratives and micro-level sound details separately, aligning the two to ensure that generated audio is both logically consistent and rich in detail. 
- 
Three-Stage Progressive Training: From single-sound generation, to audio coordination, to long-form narrative generation, this stepwise training improves performance and adaptability, enabling the model to handle complex long-form audio tasks. 
Project Links for AudioStory
- 
GitHub Repository: https://github.com/TencentARC/AudioStory 
Application Scenarios of AudioStory
- 
Video Dubbing: Automatically analyzes silent videos and generates background soundtracks matching user-described audio styles. 
- 
Audio Continuation: Infers follow-up scenes and supplements audio clips reasonably, e.g., adding player footsteps to basketball training sounds. 
- 
Audiobook Creation: Generates audio with logical sequence and emotional depth based on text, enhancing the listening experience. 
- 
Game Sound Effect Generation: Produces immersive sound effects aligned with game scene descriptions, improving the gaming experience. 
 
                 
                 
                