Cosmos-Reason1 – A series of multimodal large language models launched by NVIDIA

AI Tools posted 4w ago dongdong
17 0

What is Cosmos-Reason1?

Cosmos-Reason1 is a series of multimodal large language models introduced by NVIDIA, designed to understand the physical world based on physical common sense and embodied reasoning. Cosmos-Reason1 includes two models: Cosmos-Reason1-8B and Cosmos-Reason1-56B. These models perceive the world through visual inputs, generate natural language responses after long-chain reasoning, and encompass explanatory insights as well as embodied decisions (such as next-step actions). The training process consists of four stages: visual pre-training, general supervised fine-tuning, physical AI fine-tuning, and reinforcement learning. Leveraging carefully curated data and reinforcement learning, Cosmos-Reason1 demonstrates outstanding performance in physical common sense and embodied reasoning benchmarks.Cosmos-Reason1 – A series of multimodal large language models launched by NVIDIA

The main functions of Cosmos-Reason1

  • Physical Common Sense Understanding: Understand the basic knowledge of the physical world, such as space, time, and fundamental physical laws, to judge the rationality of events.
  • Embodied Reasoning: Generate reasonable decisions and action plans for embodied agents (e.g., robots, autonomous vehicles) based on physical common sense.
  • Chain-of-Thought Reasoning: Generate detailed reasoning processes using chain-of-thought reasoning to enhance the transparency and interpretability of decisions.
  • Multimodal Input Processing: Support video input, combine visual information with language instructions for reasoning, and generate natural language responses.

The Technical Principles of Cosmos-Reason1

  • Hierarchical Ontology: Define a hierarchical ontology of physical common sense, covering three main categories: space, time, and fundamental physics, further subdivided into 16 subcategories.
  • 2D Ontology: Design a 2D ontology for embodied reasoning, covering four key reasoning abilities for five types of embodied agents.
  • Multimodal Architecture: Based on a decoder-only multimodal architecture, the input video is processed by a visual encoder, aligned with text token embeddings, and then fed into the LLM.
  • The four training stages of the model:
    ◦ Visual pre-training: Align the visual and text modalities.
    ◦ General Supervised Fine-Tuning (SFT): Improve the model’s performance on general visual-language tasks.
    ◦ Physical AI SFT: Enhance the model’s physical common sense and embodied reasoning abilities using specialized data.
    ◦ Physical AI Reinforcement Learning (RL): Further optimize the model’s reasoning abilities based on rule-based rewards.
  • Reinforcement Learning: Design a rule-based reward mechanism for multiple-choice questions to improve the model’s performance in physical common sense and embodied reasoning tasks through reinforcement learning.

The project address of Cosmos-Reason1

Application scenarios of Cosmos-Reason1

  • Robot Operation: Assist robots in understanding task objectives, generating operation plans, and completing complex actions such as grasping and assembly.
  • Autonomous Driving: Process road videos, predict traffic dynamics, and generate safe driving decisions, such as evading and lane changing.
  • Intelligent Surveillance: Monitor videos in real-time for abnormal behaviors, such as people falling or equipment malfunctions, and issue timely alerts.
  • Virtual Reality (VR)/Augmented Reality (AR): Generate interactive responses based on virtual environment inputs to enhance user immersion.
  • Education and Training: Assist teaching and vocational skill training by explaining physical phenomena or operational processes through video.
© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...