What is NEO?
NEO is a brand-new multimodal model architecture jointly developed by SenseTime and Nanyang Technological University. As the first Native Vision-Language Model (Native VLM), NEO breaks through the limitations of traditional multimodal models through deep architectural innovations. Its core innovations include:
-
Native Patch Embedding, which captures image details more precisely;
-
Native 3D Rotary Position Embedding (Native-RoPE), designed to naturally align with the inherent structures of images and text;
-
Native Multi-Head Attention, which enhances the model’s ability to understand complex cross-modal relationships.
NEO delivers outstanding data efficiency, performance, and inference cost-effectiveness. It achieves top-tier visual perception with significantly less training data and attains excellent results across multiple authoritative benchmarks. SenseTime has open-sourced 2B and 9B versions of NEO, accelerating the industrial adoption of native multimodal technology and helping define the next-generation multimodal standard.

NEO — Main Capabilities
Native multimodal integration:
NEO deeply fuses image and text modalities at the architectural level, breaking free from modular constraints found in traditional multimodal systems and enabling more natural handling of mixed image-text content.
Efficient data utilization:
NEO achieves top-tier visual perception with only a modest amount of data (e.g., 390 million image-text pairs), significantly boosting data efficiency and reducing training costs.
Outstanding performance:
Across numerous authoritative benchmarks, NEO demonstrates excellent performance in image understanding, text generation, and image-text reasoning, consistently producing high-quality outputs.
High cost-effective inference:
Especially in small- and medium-scale configurations (0.6B–8B parameters), NEO delivers strong edge-side deployment capabilities and efficient inference, making it suitable for a wide range of real-world applications.
Open-source collaboration and extensibility:
SenseTime has open-sourced 2B and 9B versions of NEO, encouraging developers and researchers to extend work based on this architecture and accelerating the industrialization of multimodal technology.
NEO — Technical Principles
Native Patch Embedding:
Maps image pixels into the model through a bottom-up continuous embedding process, avoiding discrete tokenization and capturing image details more accurately for improved visual modeling.
Native 3D Rotary Position Embedding (Native-RoPE):
Innovatively decouples the 3D spatial-temporal frequency allocation of images and text—assigning high-frequency encodings to images and low-frequency encodings to text—to better accommodate the natural structures of both modalities and support complex spatial reasoning.
Native Multi-Head Attention:
Implements both autoregressive text attention and bidirectional visual attention within a unified framework, enhancing the model’s understanding of intricate image-text relationships and supporting complex multimodal reasoning tasks.
Bottom-up architectural innovation:
NEO achieves deep multimodal fusion at the architectural level rather than through modular stacking, fundamentally overcoming the performance bottlenecks of traditional multimodal models and improving overall model efficiency.
Efficient training and inference:
With optimized architecture design, NEO delivers higher efficiency during both training and inference. In small- and medium-parameter settings, it achieves lower computation costs and faster inference speeds, making it suitable for broad deployment.
NEO — Project Links
NEO — Application Scenarios
Image and text generation:
NEO can generate high-quality images from text prompts or produce accurate textual descriptions from image content, supporting creative design, content creation, and more.
Intelligent search and recommendation:
By understanding deep semantic relationships across modalities, NEO enables more precise search results and personalized recommendations.
Multimodal question answering:
NEO is capable of answering questions involving both images and text, making it suitable for education, customer service, and other fields.
Autonomous driving and robotic vision:
Its advanced image understanding can support scene perception, object detection, and environment understanding for intelligent vehicles and robots.
Industrial inspection and monitoring:
NEO can quickly and accurately detect anomalies and defects in images, making it useful for quality control and industrial monitoring systems.
Medical image analysis:
NEO can assist doctors by analyzing medical images and integrating textual patient records to provide more comprehensive diagnostic recommendations.