VeOmni – ByteDance’s Open-Source Fully-Modal PyTorch-Native Training Framework

What is VeOmni？

VeOmni is an open-source, fully-modal distributed training framework released by ByteDance’s Seed team, designed natively on PyTorch. VeOmni adopts a model-centric approach that decouples distributed parallelism logic from model computation, supporting flexible combinations of multiple parallel strategies (such as FSDP, SP, EP) and efficiently scaling to ultra-long sequences and large-scale Mixture-of-Experts (MoE) models. It provides lightweight full-modal interfaces to simplify multi-modal encoder-decoder integration and incorporates optimizations like dynamic batching and efficient operators, significantly improving training efficiency and stability. VeOmni has been applied in multiple cutting-edge projects, supporting research and development of full-modal large models.

Key Features of VeOmni

Support for Full-Modal Model Training: VeOmni can train models across any modality (text, image, audio, video, etc.), covering tasks from single-modal to full-modal scenarios.
Efficient Distributed Training: Supports flexible combinations of parallel strategies (FSDP, SP, EP) and scales efficiently across large GPU clusters.
Ultra-Long Sequence Support: Handles sequences up to 192K, suitable for high-resolution images, long videos, and other complex multi-modal data.
Lightweight Interfaces & Usability: Quick integration of multi-modal encoder-decoders, simplifying model development workflows.
System-Level Optimizations: Integrates dynamic batching, efficient operators, recomputation and memory optimizations, and ByteCheckpoint, enhancing training efficiency and stability.
Training Stability: Demonstrates stable convergence in complex multi-modal tasks, suitable for practical applications.
Flexible Model Extension: Supports various architectures (MoE, Transformer, etc.), allowing customization of model components to meet diverse research and development needs.

Technical Principles of VeOmni

Model-System Decoupling: Separates model definition from distributed training logic, fully decoupling model code from parallel strategies. Users can configure parallel strategies via high-level APIs without modifying model code.
Distributed Parallel Strategies: Shards model parameters, gradients, and optimizer states across devices, reducing memory load on individual GPUs. Optimizes communication through split activation tensors to support ultra-long sequences. MoE experts are distributed across devices to enhance training efficiency. Parallel_state based on DeviceMesh simplifies management of n-D parallel strategies, allowing flexible combinations.
Lightweight Full-Modal Interface: Uses HuggingFace-style interfaces, enabling users to integrate multi-modal encoder-decoders quickly by implementing unified functions (e.g., lm_encode, lm_generate).
System-Level Optimizations: Incorporates dynamic batching, efficient operators, recomputation and memory optimizations, and ByteCheckpoint to comprehensively improve efficiency and stability.

Project Links

GitHub Repository: https://github.com/ByteDance-Seed/VeOmni
arXiv Paper: https://arxiv.org/pdf/2508.02317

Application Scenarios

Multi-Modal Content Generation: Generate images or videos from text descriptions, or produce textual descriptions for images or videos, widely used in creative design and content creation.
Multi-Modal Understanding & Q&A: Answer visual questions by combining image and text inputs, or handle complex multi-modal question-answering tasks, enhancing intelligent interaction experiences.
Multi-Modal Agents: Develop virtual assistants and multi-modal robots that interact with users and perform tasks using voice, text, and visual information.
Content Creation & Editing: Generate creative design elements from text descriptions, assist content review, and improve efficiency in content creation and editing.
Education & Training: Provide virtual training platforms, enhancing interactivity and effectiveness in educational and training scenarios.