Microsoft open-sources TTS model VibeVoice, capable of generating up to 90 minutes of speech
Microsoft has open-sourced the text-to-speech (TTS) model VibeVoice-1.5B, which can generate up to 90 minutes of natural speech with up to four speakers, supporting cross-lingual and singing synthesis. The model is built on the 1.5B-parameter Qwen2.5 language model and integrates both acoustic and semantic tokenizers, operating at a low frame rate of 7.5 Hz.
© Copyright Notice
The copyright of the article belongs to the author. Please do not reprint without permission.
Related Posts
No comments yet...