nanochat – Karpathy’s Open-Source, Low-Cost, Full-Stack ChatGPT Project
What is nanochat?
nanochat is an open-source project released by AI expert Andrej Karpathy, designed to train small language models with ultra-low cost and high efficiency while achieving ChatGPT-like conversational capabilities. With just $100 (using 8×H100 GPUs for 4 hours), users can train a small model capable of basic dialogue, storytelling, poetry generation, and simple Q&A. With a $1,000 budget (around 41.6 hours of training), the model’s performance improves significantly—it can solve simple math and coding problems and handle multiple-choice question tests.
The project includes a complete end-to-end pipeline covering data preparation, pretraining, mid-training, supervised fine-tuning (SFT), reinforcement learning (RL), and inference deployment, all implemented in about 8,000 lines of code. Its simple, readable structure makes it ideal for both learning and practical experimentation.
Key Features of nanochat
-
Tokenizer Training: Implemented in Rust, responsible for converting text into token ID sequences efficiently.
-
Pretraining: Trains a Transformer-based large language model on the FineWeb dataset, with performance evaluated using the CORE benchmark.
-
Mid-Training: Conducted on datasets like SmolTalk (user–assistant dialogues), multiple-choice datasets, and tool-use datasets, adapting the model to conversational and reasoning contexts.
-
Supervised Fine-Tuning (SFT): Fine-tunes the model on domain-specific datasets such as ARC-E/C and MMLU (world knowledge MCQs), GSM8K (math), and HumanEval (code), improving task-specific performance.
-
Reinforcement Learning (RL) Fine-Tuning: Uses the GRPO algorithm on the GSM8K dataset to further enhance reasoning and optimization capabilities.
-
Inference and Deployment: Implements an efficient inference engine with KV caching, simple prefill/decoding logic, and tool usage (via a lightweight Python sandbox). Users can interact through a CLI or a ChatGPT-like WebUI.
-
Report Card Generation: Produces a single Markdown report summarizing the training and inference process, presented in a gamified format for easy interpretation.
Technical Principles of nanochat
-
Minimalist Codebase: About 8,000 lines of clean, single-repo code with minimal dependencies—simple, transparent, and easy to modify.
-
Rust-Based Tokenizer: Implements tokenizer training in Rust for fast and efficient tokenization.
-
Transformer Architecture: Uses a Transformer-based LLM trained to learn language patterns and knowledge from large datasets.
-
Data-Driven Training: Pretrained on large-scale datasets such as FineWeb to acquire general language understanding.
-
Mid-Training Adaptation: Fine-tuned on dialogue-oriented datasets (e.g., SmolTalk) to better handle conversational and contextual tasks.
-
Reinforcement Learning Optimization: Applies the GRPO algorithm to improve reasoning and response quality through targeted RL fine-tuning.
-
Efficient Inference Engine: Incorporates KV caching and optimized prefill/decoding for fast inference performance.
-
WebUI Interaction: Offers a ChatGPT-style web interface, allowing users to chat directly with their trained model.
Project Repository
Application Scenarios of nanochat
-
Individuals and Teams: Ideal for privacy-conscious users or teams building secure, self-hosted conversational AI systems within internal networks.
-
Developers and Tech Enthusiasts: Serves as a hands-on platform for learning and experimenting with LLM training, P2P networking, encryption, and CLI application development.
-
Ad-Hoc Workgroups: Enables groups like emergency response teams to quickly deploy communication or AI assistant tools without central servers.
-
Education and Research: Provides researchers, educators, and students with a low-cost, transparent, and modifiable LLM development framework for learning and experimentation.