ACE-Step – A music generation foundational model open-sourced by ACE Studio in collaboration with StepJupiter

What is ACE-Step?

ACE-Step is an open-source foundational music generation model jointly developed by ACE Studio and StepFun. Through an innovative architecture that combines diffusion models, Deep Compressed Autoencoders (DCAE), and lightweight linear transformers, ACE-Step enables fast, coherent, and controllable music creation. It can generate high-quality music much faster than traditional LLM-based approaches—up to 15 times faster. ACE-Step supports various musical styles, languages, and control features, making it a powerful tool for musicians, producers, and content creators. It serves as a foundational model for a wide range of music generation tasks.

Key Features of ACE-Step

Fast Composition:
Generates high-quality music quickly—for example, a 4-minute track can be synthesized in just 20 seconds on an A100 GPU.
Diverse Styles:
Supports a wide range of popular music genres such as pop, rock, electronic, and jazz, as well as lyrics in multiple languages.
Variant Generation:
By adjusting the noise ratio, users can generate diverse variations of a musical piece.
Inpainting (Repainting):
Allows selective regeneration of specific segments, such as changing style, lyrics, or vocals while preserving other elements.
Lyric Editing:
Enables partial lyric modifications without affecting the melody or instrumental backing.
Multilingual Support:
Supports 19 languages, with particularly strong performance in 10 languages including English, Chinese, Russian, Spanish, and Japanese.
Lyric2Vocal:
Uses LoRA fine-tuning to generate human vocals directly from lyrics.
Text2Samples:
Generates music samples and loops to help producers quickly create instrument loops, sound effects, and more.

Technical Principles of ACE-Step

Diffusion Model:
Utilizes stepwise denoising for data generation. Traditional diffusion models often struggle with long-term structural coherence, which ACE-Step addresses through its innovative architecture.
Deep Compressed Autoencoder (DCAE):
Efficiently compresses and decompresses audio data, preserving fine-grained audio detail while reducing computational cost.
Lightweight Linear Transformer:
Processes musical sequence information, ensuring coherence in melody, harmony, and rhythm.
Semantic Alignment:
Uses MERT (Music Embedding Representation) and m-hubert to align semantic representations (REPA) during training, enabling faster convergence and higher generation quality.
Training Optimization:
Semantic alignment and optimized training strategies allow ACE-Step to balance generation speed and coherence, producing high-quality music efficiently.

ACE-Step Project Links

Official Website: https://ace-step.github.io/
GitHub Repository: https://github.com/ace-step/ACE-Step
HuggingFace Model Hub: https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B
Online Demo: https://huggingface.co/spaces/ACE-Step/ACE-Step

Application Scenarios of ACE-Step

Music Creation:
Quickly generates melodies and lyrics to inspire new compositions.
Vocal Generation:
Creates human vocal audio directly from lyrics, ideal for vocal demos.
Music Production:
Produces instrumental loops and sound effects to enrich music production.
Multilingual Composition:
Supports cross-language music creation for global audiences.
Music Education:
Serves as a teaching tool to help learners understand music composition and production.