Kling-Foley – Kling AI’s Multimodal Video-to-Sound Effect Generation Model

What is Kling-Foley?

Kling-Foley is a multimodal video-to-audio generation model developed by Kling AI. By taking video and optional text prompts as conditional inputs, the model can generate high-quality, temporally aligned stereo audio that semantically matches the video content—including sound effects, background music, and other audio types. It supports audio generation of arbitrary length. Built on a flow-matching architecture controlled by multimodal input, Kling-Foley uses feature fusion and specialized modules to achieve precise audio-video alignment. Trained on a large-scale, proprietary multimodal dataset, Kling-Foley delivers state-of-the-art audio generation performance and provides a highly efficient and high-quality solution for video content creation.

Kling-Foley – Kling AI's Multimodal Video-to-Sound Effect Generation Model

Key Features of Kling-Foley

High-Quality Audio Generation: Generates stereo audio that is semantically aligned and temporally synchronized with the video content, including various types of sound effects and background music, fulfilling diverse audio needs across scenarios.
Arbitrary-Length Audio Support: Capable of generating audio of any length, dynamically adapting to the duration of the input video.
Stereo Rendering: Supports spatially aware stereo sound rendering, modeling the directionality of audio sources for enhanced spatial realism and immersive experience.

Technical Architecture of Kling-Foley

Multimodal Controlled Flow Matching Model:
Kling-Foley adopts a flow-matching model with multimodal control. Video frames, text prompts, and timestamps are fed into a multimodal joint conditioning module for fusion, then processed by the MMDit module. This design enables the model to better understand and generate audio aligned with the video content.
Modular Processing Pipeline:
After multimodal fusion, features are passed through the MMDit module to predict latent representations via a VAE (Variational Autoencoder). A pretrained Mel decoder reconstructs single-channel Mel spectrograms, which are then converted into stereo spectrograms using the Mono2Stereo module. Finally, a vocoder generates the output waveform.
Visual-Semantic and Audio-Video Alignment Modules:
Includes modules for extracting visual semantic features and synchronizing audio with video at the frame level. These modules ensure strong temporal and semantic alignment between audio and video.
Discrete Duration Embedding:
Introduces discrete duration embedding as part of the global conditioning mechanism, enabling the model to better handle inputs of varying lengths and generate audio that matches the video duration.
Universal Latent Audio Codec:
Kling-Foley uses a universal latent audio codec to model diverse audio types such as sound effects, speech, singing, and music. The core is a Mel-VAE, which jointly trains the Mel encoder, decoder, and discriminator to learn a continuous and complete latent space, significantly improving audio representation quality.

Project Links

Project Website: https://klingfoley.github.io/Kling-Foley/
GitHub Repository: https://github.com/klingfoley/Kling-Foley
arXiv Paper: https://www.arxiv.org/pdf/2506.19774

Application Scenarios

Video Content Creation:
Provides precisely matched sound effects and background music for animation, short videos, advertisements, etc., enhancing content appeal and production efficiency.
Game Development:
Generates realistic environmental and action-based sound effects—such as gunshots, character movements, and ambient sounds—boosting game immersion and player engagement.
Education and Training:
Adds appropriate sound effects and background music to educational videos and virtual training simulations, increasing realism and learning effectiveness.
Film and Television Production:
Produces high-quality audio effects and soundtracks for movies, dramas, and TV shows, enhancing the overall audio quality and emotional impact of the narrative.
Social Media:
Enables users to quickly add matching audio and background music to their shared videos, improving content attractiveness and engagement.