InternVLA·N1 – An open-source end-to-end dual-system navigation large model from Shanghai AI Lab
What is InternVLA·N1?
InternVLA·N1 is an open-source end-to-end dual-system navigation large model developed by Shanghai Artificial Intelligence Laboratory. It adopts a dual-system architecture: System 2 is responsible for understanding language instructions and planning long-range paths, while System 1 focuses on high-frequency responses and agile obstacle avoidance. The model is trained entirely on synthetic data, leveraging large-scale digital scene assets and massive multimodal corpora to achieve a cost-efficient and effective training process. On multiple mainstream benchmarks, InternVLA·N1 has demonstrated outstanding performance with internationally leading scores and strong zero-shot generalization ability. It can achieve real-world “cross-building long-distance” instruction-following navigation and agile obstacle avoidance in dense environments.
Key Features
-
Language Understanding & Path Planning: System 2 interprets natural language instructions and, based on visual observations, predicts the next target pixel in the image to achieve long-range spatial reasoning and planning.
-
Agile Obstacle Avoidance & Execution: System 1 handles high-frequency environmental responses to enable agile obstacle avoidance, ensuring accurate arrival at target locations.
-
Synthetic Data-Driven Training: Entirely trained on synthetic data, combining large-scale digital assets with vast multimodal corpora to achieve low-cost and efficient training.
-
Zero-shot Generalization: Despite being trained only on synthetic data, the model achieves 60Hz instruction-following navigation across buildings and agile obstacle avoidance in dense real-world environments, showcasing powerful generalization ability.
-
Multi-Scenario Adaptability: Achieves top scores on multiple mainstream benchmarks, making it applicable to diverse and complex scenarios.
Technical Principles of InternVLA·N1
-
Dual-System Architecture: System 2 focuses on language understanding and long-range spatial reasoning and planning, while System 1 specializes in high-frequency responses and agile obstacle avoidance. Together, they enable efficient navigation.
-
Asynchronous Inference Mechanism: System 1 and System 2 operate asynchronously—System 1 responds frequently to environmental changes for obstacle avoidance, while System 2 concentrates on long-range reasoning and planning, reducing latency and complexity.
-
Fully Synthetic Data Training: Training is entirely synthetic, leveraging large-scale digital scenes and multimodal corpora, combined with efficient data synthesis technologies, to achieve cost-effective training.
-
Two-Stage Curriculum Training: The training process includes a pre-training phase, where System 2 is supervised and fine-tuned for accurate path planning, followed by a joint-tuning phase, where System 1 and System 2 collaborate to optimize overall navigation performance.
-
Multimodal Fusion: The model fuses visual and language information through large multimodal models, enhancing its ability to understand complex environments and execute navigation tasks with high accuracy in real-world scenarios.
Project Resources
-
Official Website: https://internrobotics.github.io/internvla-n1.github.io/
-
GitHub Repository: https://github.com/InternRobotics/InternNav
-
Hugging Face Models: https://huggingface.co/InternRobotics/InternVLA-N1
-
Technical Paper: PDF Link
Application Scenarios of InternVLA·N1
-
Intelligent Robot Navigation: Provides service and logistics robots with efficient navigation, enabling them to follow voice commands, move autonomously, avoid obstacles, and complete tasks in complex environments.
-
Autonomous Driving Assistance: Assists vehicles with path planning and obstacle avoidance to enhance safety and reliability in autonomous driving systems.
-
Virtual & Augmented Reality: Enhances VR and AR applications by enabling natural, immersive interactions such as navigating virtual environments via voice commands.
-
Smart Security Patrols: Supports intelligent surveillance with vision-language fusion, enabling patrols and rapid response to anomalies.
-
Industrial Automation: Provides navigation and operational guidance for automated equipment, improving production efficiency and workplace safety.
-
Smart Tour Guide Services: Delivers personalized navigation and guided explanations in museums, exhibitions, and similar venues, enriching visitor experiences.