FG-CLIP 2 – 360 Open-Source Bilingual Fine-Grained Vision-Language Alignment Model

What is FG-CLIP 2？

FG-CLIP 2 is an open-source bilingual fine-grained vision-language alignment model developed by 360, designed to solve the problem of precise alignment between visual and linguistic information. It represents a major breakthrough in vision-language understanding, especially excelling in both Chinese and English tasks. The model adopts a hierarchical alignment architecture, which progressively enhances its understanding of image details through global semantic alignment and fine-grained vision-language learning. It also introduces a dynamic attention mechanism that intelligently focuses on key regions within images, enabling better handling of complex vision-language tasks. In multiple authoritative benchmark tests, FG-CLIP 2 has outperformed leading models such as Google’s SigLIP 2 and Meta’s MetaCLIP 2, establishing itself as one of the most powerful vision-language models in the world.

Main Features of FG-CLIP 2

Fine-Grained Vision-Language Understanding:
Accurately understands detailed visual information, including object attributes and spatial relationships, addressing the limitations of traditional models in fine-grained recognition.
Bilingual Support:
Excels in both Chinese and English tasks, achieving true native bilingual support.
Hierarchical Alignment Architecture:
Captures both macro-level scenes and micro-level details simultaneously, enhancing the model’s ability to interpret image nuances.
Dynamic Attention Mechanism:
Intelligently focuses on key regions in images for improved performance on complex vision-language tasks.
Optimized Bilingual Collaboration Strategy:
Solves the imbalance between Chinese and English understanding, improving overall performance across bilingual tasks.
Powerful Performance:
Outperforms Google’s SigLIP 2 and Meta’s MetaCLIP 2 across 29 authoritative public benchmark datasets, ranking among the top global models.
High-Concurrency Response Speed:
Utilizes an explicit dual-tower architecture where image and text features can be pre-computed and cached, achieving millisecond-level response in high-concurrency scenarios.
Adaptive Input Resolution:
A dynamic resolution mechanism enables the model to flexibly process inputs of varying sizes, improving adaptability and robustness.
Rich Open-Source Resources:
Offers code, model weights, and detailed training datasets, greatly benefiting researchers and developers.

Technical Principles of FG-CLIP 2

Hierarchical Alignment Architecture:
Improves image detail understanding through global semantic alignment and fine-grained vision-language learning.
Dynamic Attention Mechanism:
Focuses intelligently on key visual regions, enhancing performance on complex tasks.
Bilingual Collaboration Strategy:
Optimizes the balance between Chinese and English comprehension, boosting overall bilingual performance.
Multimodal Data Training:
Trained on large-scale Chinese-English image-text pairs to enhance cross-lingual generalization.
Fine-Grained Supervised Learning:
Introduces supervision signals such as region-text matching and long-description modeling to strengthen fine-grained visual understanding.
Intra-Text Contrastive Learning:
Applies intra-modal text contrastive loss to better distinguish semantically similar descriptions.
Hard Negative Sample Training:
Incorporates “hard negatives” generated by large models to further enhance model robustness and accuracy.
Dynamic Resolution Mechanism:
Enables adaptive processing of inputs with varying resolutions, improving flexibility and versatility.

Project Links

Official Website: https://360cvgroup.github.io/FG-CLIP/
GitHub Repository: https://github.com/360CVGroup/FG-CLIP
arXiv Paper: https://arxiv.org/pdf/2510.10921

Application Scenarios of FG-CLIP 2

Home Robotics:
Accurately understands and executes complex home instructions such as “pick up the phone with a cracked screen on the coffee table,” improving the practicality of home robots.
Security Monitoring:
Quickly locates and identifies targets, e.g., “find the suspicious person wearing a black cap,” increasing efficiency and accuracy in surveillance systems.
E-commerce:
Enhances text-to-image retrieval accuracy and reduces multilingual labeling costs, optimizing user experience through precise understanding of product descriptions.
Autonomous Driving:
Accurately recognizes road objects and scenes, such as “detect obstacles in the front lane,” improving driving safety.
Medical Imaging:
Assists doctors in diagnostic imaging tasks, such as “identify abnormal areas in X-ray images,” improving diagnostic accuracy and efficiency.
Education:
Powers intelligent educational tools that can “recognize objects in pictures and provide related knowledge,” enriching teaching content and interactivity.