MetaStone-S1 – A Reflective Generative Large Model Developed by RawStone Technology
What is MetaStone-S1?
MetaStone-S1 is a reflective generative large model developed by RawStone Technology, pioneering the integration of deep reasoning and self-filtering capabilities for reasoning chains. At its core, it adopts a self-supervised reflective paradigm and is built upon a shared-backbone architecture featuring a Strategy and Process Rating Model (SPRM). With only an additional 53M parameters, the model can evaluate the quality of each reasoning step in real time—without the need for human annotations. MetaStone-S1 supports Long-CoT reinforcement learning to generate ultra-long reasoning chains, outperforming peer models in tasks such as mathematics (AIME), coding (LiveCodeBench), and Chinese reasoning (C-EVAL). Open-sourced in 1.5B, 7B, and 32B versions, it delivers high performance at low inference costs, marking a new era of “self-correcting” reasoning intelligence.
Key Features of MetaStone-S1
-
Deep Reasoning Generation: Capable of producing ultra-long and complex chains of thought (Long-CoT), MetaStone-S1 is ideal for tackling high-difficulty reasoning tasks like mathematical proofs and programming algorithms.
-
Intelligent Reasoning Chain Optimization: With a built-in self-supervised process rating mechanism (SPRM), the model can automatically detect and eliminate faulty reasoning steps, significantly improving answer accuracy.
-
Multi-Level Reasoning Modes: Offers three operational modes—Low (fast response), Medium (balanced precision and speed), and High (deep thinking)—to meet varying reasoning demands across scenarios.
-
Open and Scalable: Fully open-sourced with 1.5B, 7B, and 32B parameter versions, along with supporting tools that empower developers to further enhance reasoning capabilities in specific domains.
Technical Principles of MetaStone-S1
-
Dual-Head Shared Architecture: The SPRM design shares a common Transformer backbone between the Policy Model and the Process Rating Model. It parallelly deploys a Generation Head (for producing reasoning chains) and a Scoring Head (for real-time, step-level scoring via self-supervised learning).
-
Self-Supervised Process Rewards: Introduces the SPR Loss (Self-supervised Process Reward Loss) algorithm, using final answer correctness as a weak supervision signal. Through noise-filtering mechanisms, it automatically generates pseudo-labels at the step level, eliminating the need for manual annotations.
-
Dynamic Reasoning Selection: Uses Test-Time Scaling during inference. For example, under High mode, 32 candidate reasoning chains are generated and scored by SPRM. The highest-scoring chain is selected to continue, forming a closed loop of “generation–evaluation–selection.”
-
Joint Optimization Mechanism: Leverages the GRPO reinforcement learning algorithm to jointly optimize the Policy Model and SPRM. The Policy Model aims to maximize answer correctness, while SPRM uses contrastive learning to differentiate high- and low-quality reasoning steps. They co-evolve by sharing gradients.
-
Emergent Capability Control: Introduces a scaling law between thinking length and model performance. By adjusting rollout iterations, it controls computation (parameter count × thinking tokens), enabling a smooth transition from rapid response (Low) to deep thinking (High).
Project Links for MetaStone-S1
-
GitHub Repository: https://github.com/MetaStone-AI/MetaStone-S1
-
HuggingFace Model Hub: https://huggingface.co/MetaStoneTec
-
arXiv Technical Paper: https://arxiv.org/pdf/2507.00195
Application Scenarios for MetaStone-S1
-
Smart Education: Acts as an “AI tutor” to precisely solve math and physics competition problems, generating interactive solution paths.
-
Legal Intelligence: Performs deep analysis of contract clauses, accurately identifies potential legal risks, and offers logic-consistent revision suggestions.
-
Intelligent Manufacturing: Uses multi-level causal reasoning to quickly locate root causes of industrial equipment failures and generate optimal repair plans, significantly improving productivity.
-
Academic Writing: Assists in the derivation of formulas and theoretical validation in scientific papers, ensuring logical rigor in academic content.