Sham Kakade's Avatar

Sham Kakade

@shamkakade

Harvard Professor. ML and AI. Co-director of the Kempner Institute. https://shamulent.github.io

912
Followers
89
Following
5
Posts
21.11.2024
Joined
Posts Following

Latest posts by Sham Kakade @shamkakade

Preview
Alignment reduces conceptual diversity of language models - Kempner Institute As large language models (LLMs) have become more sophisticated, there’s been growing interest in using LLM-generated responses in place of human data for tasks such as polling, user studies, and […]

NEW blog post: Do modern #LLMs capture the conceptual diversity of human populations? #KempnerInstitute researchers find #alignment reduces conceptual diversity of language models. bit.ly/4hNjtiI

10.02.2025 15:19 👍 12 🔁 3 💬 0 📌 0
Post image

NEW in the #KempnerInstitute blog: learn about ProCyon, a multimodal foundation model to model, generate & predict protein phenotypes. Read it here: bit.ly/4fA8xUk

19.12.2024 19:22 👍 6 🔁 1 💬 0 📌 0
https://bit.ly/4iohnqE

Calling college grads interested in intelligence research: the application for the #KempnerInstitute's post-bac program w/ the Harvard Kenneth C. Griffin Graduate School of Arts and Sciences Office for Equity, Diversity, Inclusion & Belonging is now open! Apply by Feb. 1, 2025.

t.co/jdJrzRegL0

09.12.2024 19:43 👍 14 🔁 4 💬 0 📌 0
Preview
Loss-to-Loss Prediction - Kempner Institute Scaling laws – which reliably predict the performance of large language models (LLMs) as a function of their size and the amount of data they have been trained on – […]

NEW in the #KempnerInstitute blog: A method to predict how #LLMs scale w/ compute across different datasets. Read it here:

09.12.2024 20:44 👍 7 🔁 2 💬 0 📌 0
Post image

LLM self-improvement has critical implications in synthetic data, post-training and test-time inference. To understand LLMs' true capability of self-improvement, we perform large-scale experiments with multiple families of LLMs, tasks and mechanisms. Here is what we found: (1/9)

06.12.2024 18:02 👍 12 🔁 4 💬 1 📌 1
Post image

NEW: we have an exciting opportunity for a tenure-track professor at the #KempnerInstitute and the John A. Paulson School of Engineering and Applied Sciences (SEAS). Read the full description & apply today: academicpositions.harvard.edu/postings/14362
#ML #AI

03.12.2024 01:24 👍 20 🔁 19 💬 0 📌 1

(5/n) 🤝 Shoutout to some great collaborators:
@hanlin_zhang, @depen_morwani, @vyasnikhil96, @uuujingfeng, @difanzou, @udayaghai
#AI #ML #ScalingLaws

22.11.2024 20:19 👍 1 🔁 0 💬 0 📌 0
Preview
How Does Critical Batch Size Scale in Pre-training? Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the compromise betwee...

(4/n) 🧠 Want theory? We provide rigorous justifications, provide critical hyperparameters, and characterize lr decay to the overtraining regime.
Check out the details here:
📄 arxiv.org/abs/2410.21676
📝 Blog: tinyurl.com/ysufbwsr

22.11.2024 20:19 👍 0 🔁 0 💬 1 📌 0
Post image

(3/n) 📊 From our controlled experiments on language models:
📈CBS increases as dataset size grows
🤏CBS remains weakly dependent on model size
Data size, not model size, drives parallel efficiency for large-scale pre-training.

22.11.2024 20:19 👍 1 🔁 0 💬 1 📌 0
Post image

(2/n) 🤔 How does CBS scale with model size and data size in pre-training? We find that CBS scales with data size and is largely invariant to model size. Prior beliefs that CBS scales with model size may have stemmed from Chinchilla’s coupled N-D scaling.

22.11.2024 20:19 👍 0 🔁 0 💬 1 📌 0
Post image

(1/n) 💡How can we speed up the serial runtime of long pre-training runs? Enter Critical Batch Size (CBS): the tipping point where the gains of data parallelism balance with diminishing efficiency. Doubling batch size halves the optimization steps—until we hit CBS, beyond which returns diminish.

22.11.2024 20:19 👍 16 🔁 4 💬 2 📌 0
Post image

How does test loss change as we change the training data? And how does this interact with scaling laws?

We propose a methodology to approach these questions by showing that we can predict the performance across datasets and losses with simple shifted power law fits.

21.11.2024 15:11 👍 19 🔁 7 💬 1 📌 2