Nanonets-OCR-s – An OCR model developed by Nanonets

What is Nanonets-OCR-s?

Nanonets-OCR-s (Nanonets OCR Small) is an OCR model developed by Nanonets that converts images into structured Markdown format. It is designed to extract and intelligently process complex document elements such as LaTeX equations, image descriptions, signatures, watermarks, checkboxes, and complex tables. Powered by deep learning, the model is trained on large datasets and supports various document types including research papers, financial documents, and medical forms. The Markdown output is optimized for consumption by large language models (LLMs), making it widely useful across academic, legal, financial, and enterprise domains—significantly enhancing document processing efficiency and accuracy.

Key Features of Nanonets-OCR-s

LaTeX Equation Recognition: Automatically converts mathematical formulas and equations into proper LaTeX syntax, supporting both inline and display math.
Intelligent Image Description: Uses structured tags to describe images in documents, making them readable by LLMs. It can describe single or multiple images (e.g., logos, charts, graphs, QR codes) and predicts descriptions within <img> tags and page numbers in <page_number> tags.
Signature Detection and Isolation: Identifies and isolates signatures in documents—critical for legal and business contexts. Signature content is predicted within <signature> tags.
Watermark Extraction: Similar to signature detection, the model can detect and extract watermark text, outputting it within <watermark> tags.
Smart Checkbox Handling: Translates checkboxes and radio buttons in forms into standardized Unicode symbols. Checkbox states are predicted within <checkbox> tags.
Complex Table Extraction: Extracts complex tables and converts them into Markdown and HTML table formats.

Technical Principles Behind Nanonets-OCR-s

Vision-Language Model (VLM): The model is based on a VLM that jointly understands visual content (e.g., images, charts, tables) and language content (text), enabling accurate recognition of document structure and semantics.
Dataset Curation & Training: Trained on a curated dataset of over 250,000 document pages spanning various formats—research papers, financial records, legal files, medical forms, tax documents, receipts, and invoices. Both synthetic and manually labeled datasets were used: initial training on synthetic data for scale, followed by fine-tuning on real, manually annotated documents to enhance real-world performance.
Base Model Selection: The foundational model is Qwen2.5-VL-3B, which was fine-tuned on the curated dataset for enhanced performance in document-specific OCR tasks.
Intelligent Content Recognition & Semantic Tagging: The model semantically identifies document components and converts unstructured content into context-rich, structured Markdown—providing high-quality inputs for downstream tasks.
Model Optimization & Customization: Continuous model refinement during training ensures robustness across various document types and scenarios. Functional adjustments are made to meet specific application needs with high accuracy and reliability.

Project Links

Official Site: https://nanonets.com/research/nanonets-ocr-s/
HuggingFace Model Page: https://huggingface.co/nanonets/Nanonets-OCR-s

Application Scenarios of Nanonets-OCR-s

Academic Paper Digitization: Converts academic papers with LaTeX equations and tables into structured Markdown, facilitating organization, citation, and analysis for researchers.
Research Material Management: Quickly extracts key information from papers—such as experimental data, charts, and conclusions—for easy reference and comparison.
Scholarly Publishing: Assists publishers in converting paper or PDF documents into web-ready formats, improving accessibility and searchability.
Legal Document Analysis: Identifies and extracts key clauses, legal citations, and statutes from legal documents, enhancing efficiency in legal research and case preparation.
Financial Report Processing: Extracts data from financial statements—such as income, expenses, and balance sheets—for financial analysis and automated report generation.