ThinkSound – Alibaba’s Tongyi’s first Chain-of-Thought (CoT) audio generation model

What is ThinkSound？

ThinkSound is the first Chain-of-Thought (CoT) audio generation model developed by Alibaba’s Tongyi Speech Team. It is designed for video dubbing, generating customized, frame-level sound effects that precisely match visual content. By introducing CoT reasoning, ThinkSound addresses challenges that traditional models face in capturing fine-grained visual dynamics and spatial relationships. It enables the AI to think step-by-step—like a professional sound designer—to produce high-fidelity audio synchronized with video. The model is driven by a three-stage reasoning chain, covering basic sound effect generation, object-level interactions, and instruction-based editing. ThinkSound is trained on the AudioCoT dataset, which includes audio data annotated with reasoning chains. On the VGGSound benchmark, ThinkSound outperforms six leading methods (Seeing&Hearing, V-AURA, FoleyCrafter, Frieren, V2A-Mapper, and MMAudio), showcasing its superior performance.

ThinkSound – Alibaba's Tongyi's first Chain-of-Thought (CoT) audio generation model

Key Features of ThinkSound

Basic Sound Effect Generation: Automatically generates background sound effects that semantically and temporally align with video content, providing a foundational audio layer for video scenes.
Interactive Object-Level Refinement: Allows users to click on specific objects in the video to fine-tune and optimize their corresponding sound effects, making audio more precisely aligned with visual elements.
Instruction-Driven Audio Editing: Supports natural language commands to add, remove, or modify specific sound elements in the audio, catering to a wide range of creative needs.

Technical Principles of ThinkSound

Chain-of-Thought Reasoning: Decomposes the audio generation process into sequential reasoning steps, including visual motion analysis, acoustic attribute inference, and temporally structured sound composition—mimicking the workflow of professional Foley artists.
Multimodal Large Language Model (MLLM): Uses models like VideoLLaMA2 to extract spatiotemporal and semantic information from video, forming structured CoT reasoning chains that guide audio generation.
Unified Audio Foundation Model: Built on conditional flow-matching techniques, the model fuses video, text, and audio context to produce high-fidelity sound. It supports flexible input combinations for various generation and editing tasks.
Dataset Support: Trained with the AudioCoT dataset, which includes structurally annotated reasoning chains, helping the model better understand and generate audio-visual alignments.

Project Links for ThinkSound

Official Website: https://thinksound-project.github.io/
GitHub Repository: https://github.com/liuhuadai/ThinkSound
HuggingFace Model Hub: https://huggingface.co/liuhuadai/ThinkSound
arXiv Technical Paper: https://arxiv.org/pdf/2506.21448

Application Scenarios for ThinkSound

Film & Video Production: Automatically generates realistic background and scene-specific sound effects for movies, TV shows, and short videos—enhancing immersion and syncing audio with visuals.
Game Development: Produces dynamic environmental and interactive sound effects for games, enhancing immersion and interactivity to elevate the gaming experience.
Advertising & Marketing: Generates engaging sound effects and background music for promotional videos and social media content, boosting engagement, brand impact, and shareability.
Education & Training: Adds relevant sound effects to online educational videos or simulation environments, aiding comprehension and retention, and improving learning outcomes.
Virtual & Augmented Reality (VR/AR): Creates sound effects that seamlessly match virtual environments, enhancing immersion, interactivity, and personalization in VR/AR experiences.