SongBloom – A Full-Length Song Generation Model Developed by Tencent AI Lab

What is SongBloom？

SongBloom is a full-length song generation framework developed by Tencent AI Lab, combining autoregressive sketching with diffusion-based refinement techniques. Through an Interleaved Generation paradigm, the model alternates between generating semantic and acoustic contexts to produce high-quality, complete songs. With just a 10-second audio sample and the corresponding lyrics, SongBloom can generate a 2-minute-30-second stereo track at 48kHz. It delivers state-of-the-art (SOTA) performance in both audio quality and lyric alignment and has been officially open-sourced.

Key Features of SongBloom

Efficient Song Generation: Generates a full 2-minute-30-second song using only a 10-second audio sample and its lyrics, supporting stereo output at 48kHz high quality.
Innovative Generation Paradigm: Employs an Interleaved Generation framework that alternates between semantic and acoustic generation using autoregressive sketching and diffusion-based refinement, optimizing overall structure and sound quality.
Superior Audio Quality and Accuracy: Achieves near-SOTA performance in both audio fidelity and lyric alignment, outperforming existing open-source models.
Open Source and User-Friendly: The project is open-sourced with comprehensive documentation, multiple model versions, and low-VRAM support, making it easy for users to deploy and experiment.
Broad Application Potential: Provides powerful tools for music creation and audio production, greatly enhancing creative efficiency and inspiring new musical ideas.

Technical Principles of SongBloom

Interleaved Generation Paradigm: Alternates between generating semantic and acoustic contexts, dynamically adjusting the process to optimize song structure and sound quality.
Autoregressive Sketching: Uses an autoregressive model to generate a music “sketch,” ensuring coherent structure and accurate phoneme alignment.
Diffusion-Based Refinement: Applies a diffusion model to refine the generated sketch into high-fidelity audio, improving detail and realism.
Hybrid Discrete-Continuous Output: Combines discrete sketch tokens and VAE latent outputs for balanced control over structure and sound quality.
Multimodal Input Fusion: Integrates lyrics and audio samples through multimodal fusion, enabling precise and context-aware music generation.

Project Resources

GitHub Repository: https://github.com/tencent-ailab/SongBloom
Hugging Face Model Hub: https://huggingface.co/CypressYang/SongBloom
arXiv Paper: https://arxiv.org/pdf/2506.07634
Online Demo: https://cypress-yang.github.io/SongBloom_demo/

Application Scenarios of SongBloom

Music Creation: Assists musicians and creators in rapidly generating high-quality song foundations, inspiring exploration of new styles and creative directions.
Audio Production: Supports film, gaming, and advertising industries by quickly generating background scores or theme songs, enhancing production efficiency.
Education: Serves as a music education tool, helping students understand song structure and composition processes while stimulating creative learning.
Entertainment: Enables users on social media and short-video platforms to generate personalized music content, boosting engagement and creativity.
Commercial Use: Allows brands and enterprises to generate customized music for marketing, events, and promotions, strengthening brand identity and influence.