Cosmos-Reason1 – A series of multimodal large language models launched by NVIDIA

What is Cosmos-Reason1?

Cosmos-Reason1 is a series of multimodal large language models introduced by NVIDIA, designed to understand the physical world based on physical common sense and embodied reasoning. Cosmos-Reason1 includes two models: Cosmos-Reason1-8B and Cosmos-Reason1-56B. These models perceive the world through visual inputs, generate natural language responses after long-chain reasoning, and encompass explanatory insights as well as embodied decisions (such as next-step actions). The training process consists of four stages: visual pre-training, general supervised fine-tuning, physical AI fine-tuning, and reinforcement learning. Leveraging carefully curated data and reinforcement learning, Cosmos-Reason1 demonstrates outstanding performance in physical common sense and embodied reasoning benchmarks. Cosmos-Reason1 – A series of multimodal large language models launched by NVIDIA

The main functions of Cosmos-Reason1

Physical Common Sense Understanding: Understand the basic knowledge of the physical world, such as space, time, and fundamental physical laws, to judge the rationality of events.
Embodied Reasoning: Generate reasonable decisions and action plans for embodied agents (e.g., robots, autonomous vehicles) based on physical common sense.
Chain-of-Thought Reasoning: Generate detailed reasoning processes using chain-of-thought reasoning to enhance the transparency and interpretability of decisions.
Multimodal Input Processing: Support video input, combine visual information with language instructions for reasoning, and generate natural language responses.

The Technical Principles of Cosmos-Reason1

Hierarchical Ontology: Define a hierarchical ontology of physical common sense, covering three main categories: space, time, and fundamental physics, further subdivided into 16 subcategories.
2D Ontology: Design a 2D ontology for embodied reasoning, covering four key reasoning abilities for five types of embodied agents.
Multimodal Architecture: Based on a decoder-only multimodal architecture, the input video is processed by a visual encoder, aligned with text token embeddings, and then fed into the LLM.
The four training stages of the model:
◦ Visual pre-training: Align the visual and text modalities.
◦ General Supervised Fine-Tuning (SFT): Improve the model’s performance on general visual-language tasks.
◦ Physical AI SFT: Enhance the model’s physical common sense and embodied reasoning abilities using specialized data.
◦ Physical AI Reinforcement Learning (RL): Further optimize the model’s reasoning abilities based on rule-based rewards.
Reinforcement Learning: Design a rule-based reward mechanism for multiple-choice questions to improve the model’s performance in physical common sense and embodied reasoning tasks through reinforcement learning.

The project address of Cosmos-Reason1

Project official website: https://research.nvidia.com/labs/dir/cosmos-reason1/
GitHub repository: https://github.com/nvidia-cosmos/cosmos-reason1
arXiv technical paper: https://arxiv.org/pdf/2503.15558

Application scenarios of Cosmos-Reason1

Robot Operation: Assist robots in understanding task objectives, generating operation plans, and completing complex actions such as grasping and assembly.
Autonomous Driving: Process road videos, predict traffic dynamics, and generate safe driving decisions, such as evading and lane changing.
Intelligent Surveillance: Monitor videos in real-time for abnormal behaviors, such as people falling or equipment malfunctions, and issue timely alerts.
Virtual Reality (VR)/Augmented Reality (AR): Generate interactive responses based on virtual environment inputs to enhance user immersion.
Education and Training: Assist teaching and vocational skill training by explaining physical phenomena or operational processes through video.