Vui – A lightweight open-source speech conversation model by Fluxions-AI

What is Vui?

Vui is a lightweight open-source speech dialogue model developed by the Fluxions-AI team, based on the LLaMA architecture. The model has been trained on 40,000 hours of conversational data and can simulate realistic speech elements such as filler words, laughter, and pauses, providing an immersive interactive experience. Vui offers three versions: the base model (general-purpose), single-speaker model (context-aware), and dual-speaker model (two-person interaction). It is suitable for voice assistants, podcast generation, education and training, and other scenarios. The model supports local deployment, can run on consumer-grade devices, and has low resource consumption, addressing the pain points of traditional speech models being heavy, unnatural, and hard to deploy.

Key Features of Vui

Realistic Voice Interaction: Accurately simulates filler words like “um,” “hmm,” laughter, hesitation, and other non-verbal elements to make conversations more natural and immersive.
Multiple Models for Different Scenarios: Provides Vui.BASE (base model), Vui.ABRAHAM (single-speaker, context-aware), and Vui.COHOST (dual-speaker interaction), suitable for general dialogue, single-person context-aware conversations, and two-person interactive dialogues respectively.
Lightweight Design and Local Deployment: The model is lightweight and can run on consumer devices such as standard PCs and laptops, with low resource usage. It does not require cloud computing, making local deployment easy and cost-effective while reducing dependency on network connectivity.

Technical Principles of Vui

Based on LLaMA Architecture: Vui is built on the LLaMA Transformer model, which is efficient and offers good performance even at smaller model scales, supporting Vui’s lightweight design.
Audio Token Prediction: The model generates speech by predicting sequences of audio tokens. It breaks down audio signals into tokens and, trained on vast conversational data, predicts the next audio token to produce smooth and natural speech.
Extensive Conversational Training: With 40,000 hours of dialogue training, Vui has accumulated rich linguistic and speech features, enabling understanding and generation of complex semantics and emotional expression, achieving highly natural voice interactions.

Project Links

GitHub Repository: https://github.com/fluxions-ai/vui
Online Demo: https://huggingface.co/spaces/fluxions/vui-space

Application Scenarios

Voice Assistants: Used in personal assistants and intelligent customer service, providing natural and smooth voice interactions to help users query information, manage schedules, or answer customer questions.
Podcast Generation: Quickly generates interview or debate-style dual-speaker audio, enhancing authenticity and appeal, assisting podcast creators in efficient production.
Content Creation: For video dubbing, audiobook production, or audio storytelling, adding natural voice elements to enhance realism and engagement.
Education and Training: Simulates realistic dialogue scenarios and generates teaching audio, aiding language learning and interactive teaching to improve student interest and effectiveness.
Smart Home and IoT: Integrated into smart home and IoT devices to provide natural voice control, allowing users to operate devices and query information via speech easily.