Dolphin – A document parsing large model open-sourced by ByteDance

What is Dolphin？

Dolphin is a lightweight and efficient document parsing large language model open-sourced by ByteDance. It adopts a two-stage approach: first analyzing the document layout, then parsing the content based on identified layout elements. In the first stage, Dolphin generates a sequence of layout elements; in the second stage, it uses these elements as anchors to extract content in parallel. Dolphin achieves state-of-the-art performance on various document parsing tasks, outperforming models like GPT-4.1 and Mistral-OCR. With only 322M parameters, Dolphin is compact, fast, and capable of parsing a wide range of document elements, including text, tables, and formulas. The model’s code and pretrained weights are publicly available for developers and researchers.

Key Features of Dolphin

Layout Analysis:
Identifies various document elements (such as headings, charts, tables, and footnotes) and generates a sequence of elements in natural reading order.
Content Extraction:
Converts entire document pages into structured formats like JSON or Markdown for further processing and presentation.
Text Paragraph Parsing:
Accurately detects and extracts text content from documents, supporting multilingual text including Chinese and English.
Formula Recognition:
Recognizes complex mathematical formulas, including inline and block-level formulas, and outputs them in LaTeX format.
Table Parsing:
Parses complex table structures, extracts cell content, and generates corresponding HTML tables.
Lightweight Architecture:
With only 322 million parameters, Dolphin is small and runs quickly, making it ideal for resource-constrained environments.
Multi-format Input Support:
Capable of handling various types of document images, such as academic papers, business reports, and technical manuals.
Diverse Output Formats:
Supports output in formats like JSON, Markdown, and HTML, ensuring easy integration with different systems.

Technical Principles Behind Dolphin

Page-Level Layout Analysis:
Dolphin uses a Swin Transformer to encode document images and extract visual features. A decoder generates a sequence of document elements, each tagged with its type (e.g., heading, table, figure) and coordinate position. The goal is to produce structured layout information in natural reading order.
Element-Level Content Parsing:
Based on the layout information from the first stage, Dolphin crops local views of each element from the original image. Then, with specific prompts, Dolphin performs parallel content extraction. For example, it uses a unique prompt to extract HTML from tables and a shared prompt to extract LaTeX from formulas and paragraphs. The decoder combines the element image and prompt to generate the final parsed content.

Project Resources

GitHub Repository: https://github.com/bytedance/Dolphin
Hugging Face Model Hub: https://huggingface.co/ByteDance/Dolphin
arXiv Technical Paper: https://arxiv.org/pdf/2505.14059
Online Demo: http://115.190.42.15:8888/dolphin/

Application Scenarios

Academic Research:
Parses text, formulas, and figures from academic papers, aiding literature management and data analysis.
Business & Office Work:
Extracts key information from business documents to assist with contract review and report generation.
Education:
Digitizes textbooks and exam papers, supporting online learning and multilingual instruction.
Technical Development:
Parses technical documentation for easier code management and technical communication.
Everyday Use:
Efficiently processes routine documents to boost productivity in daily office tasks.