MoshiVis – Kyutai: An Open-Source Multimodal Real-Time Speech Model

What is MoshiVis?

MoshiVis is an open-source multimodal speech model launched by Kyutai. It is developed based on the Moshi real-time conversational speech model and has added visual input functionality. It can achieve natural and real-time voice interaction with images, combining speech and visual information to allow users to communicate with the model about image content through voice. The model adds approximately 206M adapter parameters on top of Moshi’s 7B base architecture and integrates a 400M PaliGemma2 visual encoder. Through cross-attention mechanisms and gating mechanisms, MoshiVis can naturally integrate visual information into the speech stream while maintaining low latency and a natural conversational style. It supports three backends: PyTorch, Rust, and MLX, and it is recommended to use the Web UI frontend for interaction.

The main functions of MoshiVis

Visual Input Function: MoshiVis can receive image inputs and integrate with voice interaction. Users can communicate with the model about the content of images through voice commands, such as asking about the scenes, objects, characters, etc., in the images.
Real-time Interaction: The model supports real-time voice interaction, allowing users to converse with the model naturally without waiting for lengthy processing times.
Multimodal Fusion: MoshiVis combines visual information with voice streams through a cross-attention mechanism, enabling the model to process both voice and visual inputs simultaneously.
Low Latency and Natural Conversation: MoshiVis maintains low latency when processing image and voice information, ensuring real-time interaction. The model inherits the natural conversational style of Moshi, generating smooth and fluent voice responses.
Multi-backend Adaptability: MoshiVis supports three backends: PyTorch, Rust, and MLX. Users can choose the appropriate backend for deployment based on their needs. The Web UI frontend is recommended for interaction.
Accessibility Application: MoshiVis is suitable for accessible AI interfaces, helping visually impaired individuals understand visual scenes through voice interaction.

The technical principle of MoshiVis

Multimodal Fusion Mechanism: MoshiVis integrates a lightweight cross-attention module, injecting visual information from the visual encoder into Moshi’s speech tagging stream. This enables the model to process both speech and visual inputs simultaneously, facilitating interaction between speech and image content. Specifically, the visual encoder extracts image features, which are then fused with the speech stream through a cross-attention mechanism. As a result, the model can understand image content and generate speech responses relevant to it.
Dynamic Gating Mechanism: To better handle the transition between visual inputs and non-visual conversational topics, MoshiVis introduces a dynamic gating mechanism. This mechanism dynamically adjusts the influence of visual information based on the context of the conversation, ensuring that the model fully utilizes visual input when discussing image-related topics while reducing visual interference in other topics. This improves the naturalness and fluency of the conversation.
Parameter-Efficient Fine-Tuning: MoshiVis adopts a single-stage, parameter-efficient fine-tuning process. During training, the model is trained on a mixture of image-text and image-speech samples, reducing training costs and enhancing the model’s adaptability. This reduces the need for large-scale image-speech paired data while preserving the prosodic features of the speech model, such as the speaker’s tone.

The project address of MoshiVis

Project website: kyutai.org/moshivis
Github repository: https://github.com/kyutai-labs/moshivis
arXiv technical paper: https://arxiv.org/pdf/2503.15633

Application scenarios of MoshiVis

Elderly Assistance: For elderly people with poor eyesight or mobility difficulties, MoshiVis can serve as a smart assistant to help them identify objects, read text, or obtain environmental information.
Smart Home Control: In a smart home environment, users can issue voice commands to let MoshiVis recognize devices or scenes in the room and perform corresponding control operations.
Visual-Assisted Learning: In the field of education, MoshiVis can help students learn image content through voice interaction, such as identifying animals and plants, historical relics, etc.
Social Media Interaction: Users can upload pictures, and MoshiVis will generate interesting descriptions or comments through voice to enhance the interactivity of social media.
Industrial Inspection: In an industrial environment, MoshiVis can help workers check equipment status and identify fault locations through voice interaction.