awni.bsky.social (@awni)

Check out the full paper with Omar Montasser & John Lafferty!

* Paper: arxiv.org/pdf/2505.15927
* Blog: awni.xyz/cot-info/

And come by our poster at NeurIPS in San Diego: neurips.cc/virtual/2025...

#NeurIPS2025 #MachineLearningTheory #LLM #ChainOfThought
[10/10]

25.11.2025 04:27 👍 0 🔁 0 💬 0 📌 0

🔭 Implications for LLM research

* When designing annotation pipelines, investing in rich reasoning traces boosts data efficiency.
* Their value depends on how much internal computation they reveal.
* Enables measuring “trace quality” through an information-theoretic lens.

[9/n]

25.11.2025 04:27 👍 0 🔁 0 💬 1 📌 0

🧪 Theory meets Practice

We empirically validate our theory’s predictions in simple settings where the CoT information can be computed exactly.

We find that the theory closely predicts the sample-efficiency gains.
[8/n]

25.11.2025 04:27 👍 1 🔁 0 💬 1 📌 0

🔒 CoT Information is fundamental

Our theory provides both upper and lower bounds, showing that CoT information is a fundamental measure of the power of CoT supervision.
[7/n]

25.11.2025 04:27 👍 1 🔁 0 💬 1 📌 0

🔍 Interpretation

For many reasoning tasks, CoT-Info(ε) ≫ ε, yielding much faster learning.

The CoT information CoT-Info(ε) captures the statistical advantage of CoT data.

CoT-Info(ε) / ε can be interpreted as the relative value of a CoT sample compared to an end-to-end sample.
[6/n]

25.11.2025 04:27 👍 0 🔁 0 💬 1 📌 0

Mathematical result on the sample complexity of learning with CoT supervision.

🧮 The Theory

To distinguish between hypotheses with error ε, classical theory tells us we need roughly O(1/ε) samples.

We prove that under CoT supervision, the sample complexity improves to O(1/CoTInfo(ε)).
[5/n]

25.11.2025 04:27 👍 0 🔁 0 💬 1 📌 0

Mathematical definition of the CoT information.

🧠 The Insight: CoT supervision doesn’t just tell the model what to predict; it constrains how it thinks.

We formalize this by introducing the “CoT Information”: a measure of the extra discriminative power gained by observing the reasoning trace, not just the label.
[4/n]

25.11.2025 04:27 👍 0 🔁 0 💬 1 📌 0

💡 Core Problem: Training a model on only Input→Output (end-to-end) is like teaching a student math by showing them only the final answers.

To learn complex reasoning this way, you need a massive amount of data to rule out all the “wrong ways” to get the “right answer.”
[3/n]

25.11.2025 04:27 👍 1 🔁 0 💬 1 📌 0

Large language models have been transformed by the shift from “learn to predict the final answer” to “learn to predict the reasoning process” via chain-of-thought supervision.

Can we understand why this works from a statistical lens and quantify the advantage?
[2/n]

25.11.2025 04:27 👍 0 🔁 0 💬 1 📌 0

A screenshot of the neurips conference page for this paper.

🌟🔗 Spotlight #NeurIPS2025 Paper on the Foundations of Chain-of-Thought Learning 🔗🌟

Excited to share our work developing a learning-theoretic account of the statistical advantage of chain-of-thought supervision in reasoning systems!

Blog: awni.xyz/cot-info

👇🧵
[1/n]

25.11.2025 04:27 👍 2 🔁 0 💬 1 📌 0

awni.bsky.social

Latest posts by awni.bsky.social @awni