Chunkr – An open-source document processing API developed by Lumina AI

What is Chunkr?

Chunkr is an open-source document processing API developed by Lumina AI, designed specifically for Retrieval-Augmented Generation (RAG) and knowledge base scenarios. Chunkr can convert complex documents (such as PDFs, PowerPoint slides, Word files, images, etc.) into structured data, with intelligent parsing across multiple formats.

Its core features include high-accuracy OCR, semantic chunking, multi-format output (HTML, Markdown, JSON, plain text), and seamless integration with various LLMs like OpenAI, Claude, and Ollama. Users can quickly get started via cloud services or deploy locally using Docker. Chunkr excels in document Q&A, enterprise knowledge bases, OCR workflows, and RAG systems, making it a powerful tool for document processing.

Key Features of Chunkr

Multi-format Document Parsing: Supports PDF, PPT, Word, image files, and more. Converts complex documents into structured, machine-readable formats.
High-Accuracy OCR: Extracts text while preserving spatial layout and positional metadata. Supports OCR with bounding boxes.
Semantic Chunking: Automatically segments documents into context-aware chunks suitable for RAG and LLM processing.
Multi-format Output: Exports structured content in formats such as HTML, Markdown, JSON, and plain text.
Python SDK: Offers a Python SDK for easy integration into Python applications and backend services.
LLM Integration: Compatible with various local or remote large language models (OpenAI, Claude, Ollama, etc.) with flexible configuration.

How Chunkr Works

Vision-Language Models (VLMs): Chunkr uses vision-language models to understand both the layout and content of documents. VLMs combine computer vision and natural language processing to recognize text, images, tables, and spatial relationships. This enables precise OCR and semantic chunking.
Document Layout Analysis: The system analyzes the layout of documents to identify headings, paragraphs, tables, charts, and other elements. Based on this structure, Chunkr splits content logically to create context-aware segments for downstream processing.
OCR Technology: Chunkr leverages advanced OCR to extract text from documents while retaining layout and positional information. This data is used for further chunking and structuring.
Semantic Chunking: Using NLP techniques, Chunkr analyzes extracted text semantically and breaks it into logically coherent blocks. Each chunk maintains relevant context, optimized for input into LLMs or RAG pipelines.

Project Links

Official Website: https://chunkr.ai/
GitHub Repository: https://github.com/lumina-ai-inc/chunkr

Use Cases for Chunkr

Document Q&A Systems: Converts complex documents into structured datasets to build high-quality corpora for question-answering applications.
Enterprise Knowledge Base Construction: Quickly transforms internal corporate documents into structured knowledge bases, improving information retrieval and management.
OCR Applications: Delivers high-accuracy text extraction with spatial metadata, supporting recognition of tables, mixed media layouts, and more.
RAG Systems: Provides structured data outputs (e.g., JSON, Markdown) tailored for efficient document retrieval and language generation pipelines.
Intelligent Document Processing: Enables summarization, classification, and automatic annotation through semantic chunking and LLM integration.