Cosmos-Reason1 – A series of multimodal large language models launched by NVIDIA
What is Cosmos-Reason1?
Cosmos-Reason1 is a series of multimodal large language models introduced by NVIDIA, designed to understand the physical world based on physical common sense and embodied reasoning. Cosmos-Reason1 includes two models: Cosmos-Reason1-8B and Cosmos-Reason1-56B. These models perceive the world through visual inputs, generate natural language responses after long-chain reasoning, and encompass explanatory insights as well as embodied decisions (such as next-step actions). The training process consists of four stages: visual pre-training, general supervised fine-tuning, physical AI fine-tuning, and reinforcement learning. Leveraging carefully curated data and reinforcement learning, Cosmos-Reason1 demonstrates outstanding performance in physical common sense and embodied reasoning benchmarks.
The main functions of Cosmos-Reason1
- Physical Common Sense Understanding: Understand the basic knowledge of the physical world, such as space, time, and fundamental physical laws, to judge the rationality of events.
- Embodied Reasoning: Generate reasonable decisions and action plans for embodied agents (e.g., robots, autonomous vehicles) based on physical common sense.
- Chain-of-Thought Reasoning: Generate detailed reasoning processes using chain-of-thought reasoning to enhance the transparency and interpretability of decisions.
- Multimodal Input Processing: Support video input, combine visual information with language instructions for reasoning, and generate natural language responses.
The Technical Principles of Cosmos-Reason1
- Hierarchical Ontology: Define a hierarchical ontology of physical common sense, covering three main categories: space, time, and fundamental physics, further subdivided into 16 subcategories.
- 2D Ontology: Design a 2D ontology for embodied reasoning, covering four key reasoning abilities for five types of embodied agents.
- Multimodal Architecture: Based on a decoder-only multimodal architecture, the input video is processed by a visual encoder, aligned with text token embeddings, and then fed into the LLM.
- The four training stages of the model:
◦ Visual pre-training: Align the visual and text modalities.
◦ General Supervised Fine-Tuning (SFT): Improve the model’s performance on general visual-language tasks.
◦ Physical AI SFT: Enhance the model’s physical common sense and embodied reasoning abilities using specialized data.
◦ Physical AI Reinforcement Learning (RL): Further optimize the model’s reasoning abilities based on rule-based rewards. - Reinforcement Learning: Design a rule-based reward mechanism for multiple-choice questions to improve the model’s performance in physical common sense and embodied reasoning tasks through reinforcement learning.
The project address of Cosmos-Reason1
- Project official website: https://research.nvidia.com/labs/dir/cosmos-reason1/
- GitHub repository: https://github.com/nvidia-cosmos/cosmos-reason1
- arXiv technical paper: https://arxiv.org/pdf/2503.15558
Application scenarios of Cosmos-Reason1
- Robot Operation: Assist robots in understanding task objectives, generating operation plans, and completing complex actions such as grasping and assembly.
- Autonomous Driving: Process road videos, predict traffic dynamics, and generate safe driving decisions, such as evading and lane changing.
- Intelligent Surveillance: Monitor videos in real-time for abnormal behaviors, such as people falling or equipment malfunctions, and issue timely alerts.
- Virtual Reality (VR)/Augmented Reality (AR): Generate interactive responses based on virtual environment inputs to enhance user immersion.
- Education and Training: Assist teaching and vocational skill training by explaining physical phenomena or operational processes through video.