MonkeyOCR – A document parsing model jointly developed by Huazhong University of Science and Technology and Kingsoft Office
What is MonkeyOCR?
MonkeyOCR is a document parsing model jointly developed by Huazhong University of Science and Technology and Kingsoft Office. It is designed to efficiently convert unstructured document content into structured information. By leveraging precise layout analysis, content recognition, and logical sequencing, MonkeyOCR significantly improves the accuracy and efficiency of document parsing. Compared to traditional methods, it excels in handling complex documents such as those containing formulas and tables, achieving an average performance improvement of 5.1%, with a 15.0% boost in formula parsing and an 8.6% boost in table parsing. MonkeyOCR also demonstrates exceptional multi-page processing speed, reaching 0.84 pages per second—far surpassing comparable tools. It supports a wide range of document types including academic papers, textbooks, and newspapers, and handles multiple languages, offering robust support for document digitization and automation.
Key Features of MonkeyOCR
-
Document Parsing and Structuring:
Converts unstructured content—including text, tables, formulas, and images—from various document formats (PDFs, images, etc.) into structured, machine-readable information. -
Multilingual Support:
Supports multiple languages, including Chinese and English. -
Efficient Complex Document Handling:
Performs excellently with complex documents featuring formulas, tables, and multi-column layouts. -
Fast Multi-Page Document Processing:
Achieves a processing speed of 0.84 pages per second, outperforming other tools like MinerU (0.65 pages/s) and Qwen2.5-VL-7B (0.12 pages/s). -
Flexible Deployment and Scalability:
Can be efficiently deployed on a single NVIDIA 3090 GPU, meeting the needs of various scales.
Technical Principles Behind MonkeyOCR
-
Structure-Recognition-Relation (SRR) Triplet Paradigm:
Utilizes a YOLO-based document layout detector to identify the position and category of key elements (text blocks, tables, formulas, images, etc.). Each detected region undergoes content recognition through a large multimodal model (LMM) for high-precision end-to-end extraction. A block-level reading order prediction mechanism is used to determine the logical relationships between elements, reconstructing the document’s semantic structure. -
MonkeyDoc Dataset:
MonkeyDoc is the most comprehensive document parsing dataset to date, containing 3.9 million instances covering over a dozen document types in both Chinese and English. Built via a multi-stage pipeline combining meticulous manual annotations, programmatic synthesis, and model-assisted auto-labeling, it is used to train and evaluate MonkeyOCR to ensure strong generalization in diverse and complex document scenarios. -
Model Optimization and Deployment:
Trained using the AdamW optimizer and cosine learning rate scheduling, along with large-scale datasets to balance accuracy and efficiency. With LMDeploy, MonkeyOCR runs efficiently on a single NVIDIA 3090 GPU, supporting fast inference and scalable deployment.
Project Resources for MonkeyOCR
-
GitHub Repository: https://github.com/Yuliang-Liu/MonkeyOCR
-
Hugging Face Model Hub: https://huggingface.co/echo840/MonkeyOCR
-
arXiv Technical Paper: https://arxiv.org/pdf/2506.05218
-
Online Demo: http://vlrlabmonkey.xyz:7685/
Application Scenarios for MonkeyOCR
-
Automated Business Processes:
Automates data extraction and structuring for internal business documents such as contracts, reports, and invoices—boosting efficiency and reducing manual effort. -
Digital Archiving:
Supports digitization and archival of paper-based documents for libraries and archives, enabling long-term preservation and easy retrieval. -
Smart Education:
Enables parsing of textbooks, exam papers, and academic articles, extracting content for use in online learning platforms and educational resource repositories. -
Medical Record Management:
Assists hospitals in parsing medical records and test reports to extract key information for electronic health records (EHR) systems, enhancing data management efficiency. -
Academic Research:
Helps researchers extract key information from large volumes of academic literature for reviews and data analysis, supporting scholarly work.