AudioFly – iFLYTEK’s Open-Source Text-to-Sound Effects Model

What is AudioFly？

AudioFly is an open-source AI model from iFLYTEK for generating sound effects from text. Built on a latent diffusion model (LDM) architecture with 1 billion parameters, it is trained on large open datasets such as AudioSet, AudioCaps, TUT, as well as proprietary internal data. AudioFly can generate high-quality audio from text descriptions, with a sampling rate of up to 44.1kHz, producing sound effects that closely match the textual input. The model performs exceptionally well in both single-event and multi-event scenarios, achieving state-of-the-art results on the AudioCaps dataset. AudioFly is suitable for applications such as short video dubbing and audio story generation, opening up limitless possibilities for sound creation.

Key Features of AudioFly

Text-to-sound generation: Generates corresponding sound effects based on user-provided text descriptions. For example, inputting “thunder roaring in the distance” will produce the matching thunder sound effect.
High-quality audio output: Generates audio at a 44.1kHz sampling rate with clear sound, suitable for various applications.
Multi-scenario support: Supports both single-event sounds (e.g., “dog barking”) and multi-event scenarios (e.g., “dog barking and wind blowing”), accurately reflecting the described content.
Efficient generation: Built on an advanced diffusion model architecture, the generation process is efficient and responsive to user requests.

Technical Principles of AudioFly

Latent Diffusion Model (LDM) architecture: AudioFly uses a latent diffusion model, a deep learning-based generative framework. The model generates target audio by progressively removing noise, similar to diffusion processes in image generation.
Large-scale data training: Trained on extensive open datasets (AudioSet, AudioCaps, TUT) as well as proprietary internal datasets covering diverse sounds and scenarios, enabling the model to generate a wide variety of audio effects.
Feature alignment: The training objective ensures that the generated audio closely matches the characteristics of real audio while aligning closely with the textual description.

Project Link

ModelScope: https://modelscope.cn/models/iflytek/AudioFly

Use Cases of AudioFly

Short video dubbing: Quickly generate matching sound effects for short videos, enhancing viewer engagement and immersion.
Audio story creation: Generate sound effects from text to enrich the atmosphere and emotional expression of stories.
Film and TV sound production: Assist production teams in rapidly generating required sound effects, improving efficiency.
Game sound design: Produce real-time sound effects for game environments, enhancing player immersion and experience.
Advertising and marketing: Generate custom sound effects for ads or audio content, increasing their appeal and memorability.