FutureX – A Dynamic Real-Time Evaluation Benchmark Launched by ByteDance in Collaboration with Fudan University and Other Institutions

What is FutureX？

FutureX is a dynamic real-time evaluation benchmark jointly released by research teams from ByteDance, Fudan University, Stanford University, and Princeton University. It is specifically designed for LLM agent future-prediction tasks. Using a semi-automated pipeline, FutureX collects future-event questions in real-time from 195 high-quality websites, automatically retrieves the true outcomes once events are resolved, and scores the predictions, effectively avoiding data contamination. FutureX covers multiple domains, including politics, economics, finance, sports, and entertainment, and includes multiple question types such as single-choice, multiple-choice, open-ended ranking, and numerical prediction. Questions are divided into four difficulty levels, providing a comprehensive assessment of LLM agents’ reasoning and predictive abilities.

Key Features of FutureX

Dynamic Real-Time Updates: Collects future-event questions in real-time and automatically scores predictions based on outcomes, ensuring timely and dynamic evaluation.
Avoids Data Contamination: Focuses on future-event prediction so that answers have not yet occurred at prediction time, ensuring evaluation fairness.
Simulates Real-World Challenges: Places LLM agents in real-world information flows, requiring advanced cognitive skills such as information gathering, data synthesis, probabilistic reasoning, and causal inference.
Large-Scale, Cross-Domain Coverage: Collects questions from 195 high-quality websites across politics, economics, finance, sports, and entertainment, providing a comprehensive evaluation environment.
Automated Evaluation Process: Fully automated, updating questions, collecting answers, and scoring daily, improving efficiency and scalability.
Multiple Question Types & Difficulty Levels: Includes single-choice, multiple-choice, open-ended ranking, and numerical prediction questions across four difficulty levels to thoroughly assess LLM agent capabilities.
Advances LLM Agent Development: Provides a dynamic, contamination-free benchmark, pushing LLM agents toward the performance level of professional human analysts and enhancing performance on complex reasoning and prediction tasks.

Core Advantages of FutureX

Design Principle: Offers a dynamic, comprehensive, and uncontaminated evaluation that simulates real-world challenges to assess LLM agents’ core intelligence.
Contamination-Free: Ensures answers are unknown at prediction time, avoiding data leakage.
Real-World Challenge Simulation: Evaluates agents in realistic information flows, testing advanced cognitive skills.
Large-Scale, Cross-Domain: Questions are collected from 195 high-quality websites spanning multiple domains.
Dynamic, Automated Evaluation: Questions and scoring are updated daily, ensuring timeliness, objectivity, and scalability.

Construction Process of FutureX

Website Collection & Filtering: Uses AIME agents to gather URLs, filtered with LLMs and human review to select 195 high-quality sites as the event database.
Event Template Generation: Creates templates for each website, allowing variable adaptation for different times.
Daily Event Planning: Generates prediction questions daily, applying manipulations (e.g., random options) and filtering out harmful, subjective, or trivial events.
Agent Prediction & Evaluation: Triggers agents to predict new events daily and automatically scores predictions after event outcomes are known.
Continuous Updates & Maintenance: Updates the event database daily, removing unavailable events and adding new ones to maintain dynamic and timely evaluation.

Data Characteristics of FutureX

Real-Time: Data is updated daily from 195 websites, ensuring synchronization with current information.
Diverse: Covers politics, economics, finance, sports, and entertainment, including multiple question types.
Contamination-Free: Focused on future events to avoid data leakage.
Dynamic: Events and outcomes are updated regularly, adding new events and removing unavailable ones.
Challenging: Includes difficulty-level categorization, from simple multiple-choice to complex open-ended questions.
Large-Scale: Generates approximately 500 events per week, providing rich evaluation samples.
Reliable: Rigorous data filtering and human review ensure trustworthy sources and high-quality evaluation.

Project Resources

arXiv Paper: https://arxiv.org/pdf/2508.11987

Experimental Results

Overall Results: Grok-4 and Gemini-2.5-flash Deep Research perform best on the hardest tasks, while base LLMs perform well on simpler tasks.
By Difficulty Level: Performance drops significantly as task difficulty increases, particularly at Level 4 (super-agent level).
By Domain: Different models show strengths in different domains; for example, GPT models perform well in cryptocurrency and technology, while DouBao-Seed1.6-Thinking excels in finance and economics.
Factor Analysis: Linear regression shows that difficulty level, domain, and model type significantly impact performance.
Case Studies: Comparisons with Wall Street financial analysts, the impact of fake websites on agent predictions, and real-time search capabilities evaluation.

Application Scenarios of FutureX

Finance: Evaluates LLM agents’ predictive ability for stock prices, economic indicators, and other future events, helping financial institutions select high-performing analytic agents.
Policy Making: Provides reliable evaluation tools for policy makers to assess potential impacts of different policies.
Business Decision-Making: Helps companies evaluate market trends and consumer behavior to support business decisions.
Technology Trend Analysis: Predicts technological development and innovation trends for tech companies and investors.
Sports Prediction: Predicts sports outcomes and athlete performance, supporting betting and event organization.
Entertainment Industry: Predicts popularity and box-office performance of films, music, and other entertainment products to guide decision-making.