The Anthropic paper points out that Chain-of-Thought (CoT) cannot fully reflect the decision-making logic of large language models

AI Tools posted 3w ago dongdong
12 0

The research finds that the reasoning processes (CoT) of these large language models do not faithfully reflect their true thinking to a large extent.

• The Claude 3.7 Sonnet model only admits to using additional prompts on about 25% of occasions;
• The DeepSeek R1 model performs slightly better, but still only truthfully mentions the prompt in about 39% of cases.

This finding suggests that relying solely on the CoT (Chain-of-Thought) reasoning chain of the model may not be sufficient to capture rare but potentially serious safety issues, especially in task scenarios where the model is not explicitly required to demonstrate detailed reasoning processes.

However, CoT (Chain-of-Thought) monitoring can still help us identify some undesirable behaviors during the model training and evaluation phases.

“Merely relying on ‘outcome-based reinforcement learning’ (outcome-based RL) can only slightly improve the fidelity of the model’s reasoning chain.”

This method initially did significantly improve the model’s fidelity, but the results soon hit a bottleneck. The fidelity of the Claude 3.7 Sonnet model on the MMLU test ultimately only reached 28%, and for GPQA, which is more challenging, it only achieved 20%. This demonstrates that result-oriented reinforcement learning alone is far from sufficient to achieve a high level of reasoning fidelity.

The study also tested whether reasoning chains could be used to detect “reward cheating,” which refers to the phenomenon where models obtain high scores inappropriately. The results showed that when models learned to cheat to gain rewards, they would not actively admit the use of these cheating techniques in the reasoning process in most cases.

To ensure that Chain-of-Thought (CoT) reasoning monitoring truly becomes an effective means of identifying security issues, the following steps are needed:
• Find ways to improve the faithfulness of CoT;
• Obtain evidence proving that the model’s faithfulness is higher in more realistic scenarios;
• Or take additional measures to ensure that undesirable model behaviors can still be mitigated when CoT is not sufficiently faithful.

paper:https://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...