AudioStory – an audio generation model released by Tencent ARC

AI Tools updated 4d ago dongdong
19 0

What is AudioStory?

AudioStory is an audio generation technology released by Tencent ARC Lab. It can generate high-quality long-form narrative audio based on natural language descriptions. By adopting a divide-and-conquer strategy, it breaks down complex narrative requests into sequential subtasks and coordinates semantics with audio details through a decoupling-bridging mechanism. Its end-to-end training approach enhances model synergy, resulting in audio with clear temporal logic and emotional layers.

AudioStory – an audio generation model released by Tencent ARC


Main Features of AudioStory

  • Automatic Video Dubbing: Users can upload a silent video and describe the desired audio style. AudioStory automatically analyzes the video content and generates synchronized background tracks with a unified style.

  • Intelligent Audio Continuation: Given an audio clip, AudioStory can infer the following scene and generate reasonable continuations. For example, from a coach’s voice during basketball training, it can add player footsteps and ball dribbling sounds.

  • Audiobook Creation: It provides high-quality audio content for audiobooks, generating narrative audio with temporal logic and emotional depth based on text descriptions, helping listeners immerse themselves in the story.

  • Game Sound Effect Production: It generates immersive sound effects for games, creating matching audio based on scene descriptions to enhance the player’s experience.

  • Intelligent Podcasting: It helps podcast creators quickly generate audio content. Based on topic descriptions, it produces relevant audio clips to improve creative efficiency.


Technical Principles of AudioStory

  • Divide-and-Conquer Strategy: Breaks complex narrative requests into ordered subtasks, generates corresponding audio segments, and arranges them along a timeline to ensure coherence and logical flow.

  • Decoupling-Bridging Mechanism: Splits collaboration between the large language model and the audio generator into two components—bridge queries and residual queries—for intra-event semantic alignment and cross-event consistency, improving generation quality.

  • End-to-End Training: Uses a unified training approach to jointly optimize instruction comprehension and audio generation, strengthening model collaboration and overall performance.

  • Dual-Channel Mechanism with Semantic and Residual Tokens: Processes macro-level narratives and micro-level sound details separately, aligning the two to ensure that generated audio is both logically consistent and rich in detail.

  • Three-Stage Progressive Training: From single-sound generation, to audio coordination, to long-form narrative generation, this stepwise training improves performance and adaptability, enabling the model to handle complex long-form audio tasks.


Project Links for AudioStory


Application Scenarios of AudioStory

  • Video Dubbing: Automatically analyzes silent videos and generates background soundtracks matching user-described audio styles.

  • Audio Continuation: Infers follow-up scenes and supplements audio clips reasonably, e.g., adding player footsteps to basketball training sounds.

  • Audiobook Creation: Generates audio with logical sequence and emotional depth based on text, enhancing the listening experience.

  • Game Sound Effect Generation: Produces immersive sound effects aligned with game scene descriptions, improving the gaming experience.

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...