XBai o4 — an open-source parallel reasoning model with high-quality reasoning traces

What is XBai o4?

XBai o4 is an open-source large language model trained using a Reflective Generation Form, combining long CoT (Chain-of-Thought) reinforcement learning and process-reward learning. It demonstrates strong capabilities in complex reasoning and, in its medium configuration, already outperforms OpenAI-o3-mini. XBai o4 uses a backbone network based on shared PRMs and policy models, which significantly reduces inference cost. The model achieves strong results on multiple benchmarks such as AIME24 and LiveCodeBench v5. It supports single-node and multi-node training and provides detailed installation and evaluation workflows, offering developers powerful tools and flexible usage options.

XBai o4 — Key features

Complex reasoning: Handles multi-step logical reasoning and mathematical problems, producing high-quality reasoning traces.
Efficient inference: A backbone architecture based on shared PRMs and policy models significantly lowers inference cost and improves efficiency.
Multilingual support: Handles and generates high-quality text in multiple languages, suitable for a range of NLP tasks.
Flexible training & deployment: Detailed training and deployment guides; supports both single-node and multi-node training so developers can adapt to available hardware.
Multi-task learning: Trained on a mixture of tasks (language modeling, mathematical reasoning, logical reasoning), improving generalization and adaptability.

XBai o4 — Technical principles

Reflective Generation Form: XBai o4 is trained using a Reflective Generation Form combined with long CoT reinforcement learning and process-reward learning, enabling deep reasoning and selection of high-quality reasoning traces.
Process Reward Learning: A reinforcement-learning-based approach that uses a reward model to evaluate performance across the reasoning process, helping the model learn intermediate steps and improve overall reasoning ability. XBai o4’s backbone—built on shared PRMs and policy models—further optimizes inference and reduces computational cost.
Multi-task training: The model is trained on multiple tasks (language modeling, math reasoning, logical reasoning, etc.), which enhances its applicability across scenarios and benchmarks.
Efficient inference architecture: Architectural and computational optimizations allow faster inference; multiple inference modes let users balance speed and accuracy. Detailed inference workflows and evaluation methods are provided for practical optimization.

Project links

GitHub repository: https://github.com/MetaStone-AI/XBai-o4/
Hugging Face model hub mirror: https://hf-mirror.com/MetaStoneTec/XBai-o4

Application scenarios

Education: Assist teaching by solving complex math and logic problems and explaining solution processes.
Research support: Help with literature reviews, experimental design ideas, and reasoning about complex scientific problems.
Programming assistance: Provide code generation, logical debugging, and troubleshooting suggestions to boost developer productivity and code quality.
Content creation: Rapidly generate high-quality copy, creative writing, and ideation to inspire creators.
Intelligent customer support: Deliver accurate answers and solutions to improve support efficiency and user experience.