Check out the full paper with Omar Montasser & John Lafferty!
* Paper: arxiv.org/pdf/2505.15927
* Blog: awni.xyz/cot-info/
And come by our poster at NeurIPS in San Diego: neurips.cc/virtual/2025...
#NeurIPS2025 #MachineLearningTheory #LLM #ChainOfThought
[10/10]
25.11.2025 04:27
๐ 0
๐ 0
๐ฌ 0
๐ 0
๐ญ Implications for LLM research
* When designing annotation pipelines, investing in rich reasoning traces boosts data efficiency.
* Their value depends on how much internal computation they reveal.
* Enables measuring โtrace qualityโ through an information-theoretic lens.
[9/n]
25.11.2025 04:27
๐ 0
๐ 0
๐ฌ 1
๐ 0
๐งช Theory meets Practice
We empirically validate our theoryโs predictions in simple settings where the CoT information can be computed exactly.
We find that the theory closely predicts the sample-efficiency gains.
[8/n]
25.11.2025 04:27
๐ 1
๐ 0
๐ฌ 1
๐ 0
๐ CoT Information is fundamental
Our theory provides both upper and lower bounds, showing that CoT information is a fundamental measure of the power of CoT supervision.
[7/n]
25.11.2025 04:27
๐ 1
๐ 0
๐ฌ 1
๐ 0
๐ Interpretation
For many reasoning tasks, CoT-Info(ฮต) โซ ฮต, yielding much faster learning.
The CoT information CoT-Info(ฮต) captures the statistical advantage of CoT data.
CoT-Info(ฮต) / ฮต can be interpreted as the relative value of a CoT sample compared to an end-to-end sample.
[6/n]
25.11.2025 04:27
๐ 0
๐ 0
๐ฌ 1
๐ 0
Mathematical result on the sample complexity of learning with CoT supervision.
๐งฎ The Theory
To distinguish between hypotheses with error ฮต, classical theory tells us we need roughly O(1/ฮต) samples.
We prove that under CoT supervision, the sample complexity improves to O(1/CoTInfo(ฮต)).
[5/n]
25.11.2025 04:27
๐ 0
๐ 0
๐ฌ 1
๐ 0
Mathematical definition of the CoT information.
๐ง The Insight: CoT supervision doesnโt just tell the model what to predict; it constrains how it thinks.
We formalize this by introducing the โCoT Informationโ: a measure of the extra discriminative power gained by observing the reasoning trace, not just the label.
[4/n]
25.11.2025 04:27
๐ 0
๐ 0
๐ฌ 1
๐ 0
๐ก Core Problem: Training a model on only InputโOutput (end-to-end) is like teaching a student math by showing them only the final answers.
To learn complex reasoning this way, you need a massive amount of data to rule out all the โwrong waysโ to get the โright answer.โ
[3/n]
25.11.2025 04:27
๐ 1
๐ 0
๐ฌ 1
๐ 0
Large language models have been transformed by the shift from โlearn to predict the final answerโ to โlearn to predict the reasoning processโ via chain-of-thought supervision.
Can we understand why this works from a statistical lens and quantify the advantage?
[2/n]
25.11.2025 04:27
๐ 0
๐ 0
๐ฌ 1
๐ 0
A screenshot of the neurips conference page for this paper.
๐๐ Spotlight #NeurIPS2025 Paper on the Foundations of Chain-of-Thought Learning ๐๐
Excited to share our work developing a learning-theoretic account of the statistical advantage of chain-of-thought supervision in reasoning systems!
Blog: awni.xyz/cot-info
๐๐งต
[1/n]
25.11.2025 04:27
๐ 2
๐ 0
๐ฌ 1
๐ 0