Embodied Reasoner – An embodied interactive reasoning model jointly launched by Zhejiang University and institutions including Alibaba

What is Embodied Reasoner?

Embodied Reasoner is a novel embodied interactive reasoning model developed by Zhejiang University, the Institute of Software at the Chinese Academy of Sciences, and Alibaba Group. It completes complex tasks through the collaboration of visual search, reasoning, and action. The model employs a three-stage training method—imitation learning, self-exploration, and self-correction—to generate diverse thought processes (such as situational analysis, spatial reasoning, and self-reflection). It performs efficient planning and reasoning based on interaction history and spatial layout.

In various tasks within the AI2-THOR simulator, Embodied Reasoner significantly outperforms existing visual reasoning models, excelling at handling long-horizon tasks while reducing redundant search and logical inconsistency.

Key Capabilities of Embodied Reasoner

Visual Search and Object Localization: Searches for hidden or visible objects in complex environments and locates targets according to task requirements.
Reasoning and Planning: Generates diverse thought processes (e.g., situational analysis, spatial reasoning, and self-reflection) to formulate efficient action strategies.
Action Execution: Executes actions based on reasoning results, such as navigation, grasping, and placement, to complete tasks.
Self-Correction and Learning: Utilizes reflection and correction mechanisms to avoid redundant search and logical inconsistency, improving task success rates.
Complex Task Handling: Specializes in handling long-horizon, multi-step composite tasks.

Technical Foundations of Embodied Reasoner

Data Engine: Automatically generates task instructions and corresponding “observation–thinking–action” trajectories based on task templates and scene metadata, incorporating rich thought processes and interaction images.
Three-Stage Training:
- Imitation Learning: Fine-tunes on synthesized trajectories to learn basic interaction skills.
- Self-Exploration (Rejection Sampling): Enhances the model’s exploration capability through sampling and evaluating generated trajectories.
- Self-Correction (Reflective Adjustment): Introduces abnormal states and reflective corrections to improve model adaptability.
Multimodal Interaction: Combines visual inputs (images) and language outputs (thoughts and actions) to enable efficient environmental interaction and task execution.
Reasoning Mechanism: Simulates human reasoning by generating long sequences of thought, improving performance in complex tasks.

Project Links

Official Website: https://embodied-reasoner.github.io/
GitHub Repository: https://github.com/zwq2018/embodied_reasoner
HuggingFace Dataset: https://huggingface.co/datasets/zwq2018/embodied_reasoner
arXiv Paper: https://arxiv.org/pdf/2503.21696

Application Scenarios

Smart Homes: Assists users in locating items and operating appliances at home.
Warehousing and Logistics: Automatically locates and transports goods in warehouses, optimizing inventory management.
Medical Assistance: Helps medical staff find and organize items in hospitals or nursing homes.
Industrial Automation: Performs complex operational tasks in factories, such as parts handling and equipment maintenance.
Education and Research: Serves as an educational tool for teaching task planning or researching human-computer interaction and robotic intelligence.