HunyuanOCR – Tencent Hunyuan’s End-to-End OCR Vision-Language Model
What is HunyuanOCR?
HunyuanOCR is an open-source end-to-end OCR vision-language model developed by Tencent’s Hunyuan team. Built on Hunyuan’s native multimodal architecture, it achieves state-of-the-art performance on multiple OCR tasks with only 1B parameters. Its lightweight and efficient design enables single-instruction, single-inference execution to produce optimal results—far more streamlined than traditional cascaded OCR pipelines. It supports 100+ languages, handling both single-language and mixed-language documents with ease.HunyuanOCR covers all classic OCR tasks, including text detection and recognition, complex document parsing, open-field information extraction, video subtitle extraction, and supports end-to-end photo translation and document Q&A.

Key Features of HunyuanOCR
1. Text Detection and Recognition
Detects and recognizes text within images, outputting both textual content and bounding-box coordinates. Works across diverse scenarios including documents, artistic text, street scenes, and handwriting.
2. Complex Document Parsing
Processes multilingual documents and converts them into digital formats. Text is arranged in reading order; formulas are expressed in LaTeX; tables are formatted as HTML.
3. Open-Field Information Extraction
Extracts key fields from common cards, certificates, and receipts (e.g., name, address, organization) and outputs them in structured JSON, enabling easy downstream processing.
4. Video Subtitle Extraction
Automatically extracts subtitles from video frames, supporting both single-language and bilingual subtitles—useful for content production and translation workflows.
5. Image Text Translation
Supports translating text from 14 smaller languages (German, Spanish, Japanese, etc.) into Chinese or English, as well as Chinese ↔ English translation for cross-language document processing.
Technical Principles of HunyuanOCR
End-to-End Architecture
Uses a fully end-to-end training and inference paradigm, producing results directly from images without complex cascaded steps—boosting both efficiency and accuracy.
Multimodal Fusion
Built on Hunyuan’s native multimodal architecture, deeply integrating visual and linguistic features for stronger understanding and extraction capabilities.
High-Quality Data Training
Trained on large-scale, high-quality application-oriented datasets, combined with online reinforcement learning, enabling strong performance and robust generalization.
Lightweight Design
With only 1B parameters, the model is highly efficient, reducing computation and deployment cost while maintaining SOTA performance—ideal for diverse hardware setups.
Multi-Language Support
Supports 100+ languages, including mixed-language documents, enabling global-grade OCR applications.
Project Links
-
Official website: https://hunyuan.tencent.com/vision/zh?tabIndex=0
-
GitHub repository: https://github.com/Tencent-Hunyuan/HunyuanOCR
-
HuggingFace model: https://huggingface.co/tencent/HunyuanOCR
-
Technical report: https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf
-
Online demo: https://huggingface.co/spaces/tencent/HunyuanOCR
Application Scenarios
Document Processing
Digitizing scanned or photographed multilingual documents, including extraction of text, formulas (LaTeX), and tables (HTML).
Receipt & Invoice Field Extraction
Accurately extracts key fields (amount, date, serial number, etc.) from receipts or invoices for accounting or automated workflows.
Video Subtitle Extraction
Extracts subtitles from videos automatically—both single-language and bilingual—supporting video creation, localization, and editing.
Photo Translation
Provides photo-based translation for various smaller languages into Chinese/English, suitable for travel, study, or cross-culture communication.
Information Extraction
Extracts structured fields from IDs, cards, and business cards (e.g., name, address), supporting various output formats.
Video Content Production
Helps creators extract on-screen text for subtitle generation, content indexing, or further analysis.
Education and Learning
Supports students and researchers by extracting key information from textbooks, papers, or notes—useful for multilingual study and research.