REFRAG – Meta Launches High-Efficiency Decoding Framework

What is REFRAG？

REFRAG is a high-efficiency decoding framework launched by Meta’s SuperIntelligence Lab, designed for retrieval-augmented generation (RAG) tasks. It optimizes how large language models (LLMs) process external knowledge through a “Compress → Sense → Expand” workflow. REFRAG splits retrieved long texts into multiple “chunks”, generates compact vector representations for each chunk, shortens input sequence lengths, and reduces computational load. Using a reinforcement learning (RL) policy network, the model intelligently identifies key information and retains the original text for important chunks. The framework significantly improves time-to-first-token (TTFT)—up to 30× faster—while maintaining performance comparable to full-context models, effectively addressing the efficiency challenges of large models handling long contexts.

REFRAG: Key Features

Significantly reduces time-to-first-token (TTFT): By optimizing the decoding process, REFRAG can accelerate TTFT by up to 30×, greatly enhancing real-time interactive performance.
Maintains or improves generation quality: Despite the acceleration, REFRAG maintains perplexity and accuracy across various downstream tasks comparable to full-context baseline models, and even outperforms them on certain tasks.
Expands context window: Using compression techniques, REFRAG allows models to handle more contextual information under the same computational budget. The effective context window can be expanded up to 16×, improving performance in tasks requiring long-context understanding.
Adapts to diverse applications: REFRAG is suitable for RAG tasks and can be applied to multi-turn dialogue, long-document summarization, and other tasks requiring long-context processing.

Technical Principles of REFRAG

Compress: Split retrieved long documents into multiple chunks and generate compact chunk embeddings for each, reducing input sequence length and computational cost while avoiding redundant encoding.
Sense: An RL-based policy network analyzes all chunk embeddings along with the user query to identify which chunks contain core information that must be presented in their original text form to the LLM, ensuring key details are not lost.
Expand: The final input to the main LLM is a hybrid sequence, consisting mostly of chunk embeddings with a few critical original text chunks. The LLM generates answers based on this optimized input, retaining key information while minimizing computational load.
Leverages sparsity in attention mechanisms: REFRAG observes that attention patterns in RAG tasks are often block-diagonal, focusing primarily within individual documents and between documents and the user query. By selectively compressing and expanding context, unnecessary computation is reduced, improving efficiency.

Project Resources

arXiv Paper: https://arxiv.org/pdf/2509.01092

Applications of REFRAG

Retrieval-Augmented Generation (RAG) Tasks: Optimizes decoding to significantly improve TTFT, ideal for scenarios requiring fast and accurate answer generation, such as intelligent customer support and online QA systems.
Multi-turn Dialogue Systems: Efficiently handles long conversation histories while maintaining coherence and accuracy, enhancing user experience.
Long-Document Summarization: Processes lengthy documents to generate high-quality summaries, suitable for news articles, academic papers, and other long texts.
Knowledge Graph QA: Combines knowledge graph retrieval with generation to quickly produce accurate answers, supporting knowledge-graph-driven intelligent QA systems.
Content Creation Assistance: Rapidly generates creative text to help authors draft articles, stories, or other content, improving productivity in creative workflows.