Giles Thomas (@gilesthomas.com)

Writing an LLM from scratch, part 32d -- Interventions: adding attention bias Having bias terms for the query, key, and value attention weights is apparently no longer used because it doesn't help. Let's check that it really doesn't for our model!

Now this was a surprise. QKV bias is not meant to be useful -- but with my GPT-2 small model, it looks like it is! www.gilesthomas.com/2026/02/llm-...

07.02.2026 00:13 👍 1 🔁 1 💬 0 📌 0

Writing an LLM from scratch, part 32d -- Interventions: adding attention bias Having bias terms for the query, key, and value attention weights is apparently no longer used because it doesn't help. Let's check that it really doesn't for our model!

Now this was a surprise. QKV bias is not meant to be useful -- but with my GPT-2 small model, it looks like it is! www.gilesthomas.com/2026/02/llm-...

07.02.2026 00:13 👍 1 🔁 1 💬 0 📌 0

Writing an LLM from scratch, part 32c -- Interventions: removing dropout Does removing dropout improve our baseline model's test loss? Yes, absolutely, and even more than gradient clipping did.

Does removing dropout improve our baseline model's test loss? Yes, absolutely, and much more than gradient clipping did.

www.gilesthomas.com/2026/02/llm-...

05.02.2026 23:40 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 32c -- Interventions: removing dropout Does removing dropout improve our baseline model's test loss? Yes, absolutely, and even more than gradient clipping did.

Does removing dropout improve our baseline model's test loss? Yes, absolutely, and much more than gradient clipping did.

www.gilesthomas.com/2026/02/llm-...

05.02.2026 23:40 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 32b -- Interventions: gradient clipping Does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.

First "intervention" test: does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.

www.gilesthomas.com/2026/02/llm-...

05.02.2026 01:23 👍 1 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 32b -- Interventions: gradient clipping Does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.

First "intervention" test: does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.

www.gilesthomas.com/2026/02/llm-...

05.02.2026 01:23 👍 1 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 32a -- Interventions: training a baseline model I want to try a bunch of interventions like gradient clipping and removing dropout to see if my models get better. I need a baseline train without them so that I can be sure of the results.

Back to my LLM from scratch series. I want to train the *best* GPT-2-style model that I can locally in two days, and there are various levers to pull. Working out which ones work means I need a baseline for comparison.

www.gilesthomas.com/2026/02/llm-...

04.02.2026 02:10 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 32a -- Interventions: training a baseline model I want to try a bunch of interventions like gradient clipping and removing dropout to see if my models get better. I need a baseline train without them so that I can be sure of the results.

Back to my LLM from scratch series. I want to train the *best* GPT-2-style model that I can locally in two days, and there are various levers to pull. Working out which ones work means I need a baseline for comparison.

www.gilesthomas.com/2026/02/llm-...

04.02.2026 02:10 👍 0 🔁 1 💬 1 📌 0

Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer) A worked example of packaging a from-scratch GPT-2-style model for the Hugging Face Hub so it loads via from_pretrained, runs with pipeline, and trains with Trainer -- with notes on tokeniser gotchas.

I wanted to get a custom LLM up onto the @hf.co Hub, and couldn't find an in-depth tutorial. Here's the one I wish I'd found before I got started: www.gilesthomas.com/2026/01/cust...

28.01.2026 23:03 👍 0 🔁 0 💬 0 📌 0

Writing an LLM from scratch, part 31 -- the models are now on Hugging Face I've trained seven models using the GPT-2 architecture: let's share them on Hugging Face!

I thought it would be good to share the base models I've been training on Hugging Face, and now they are :-)

www.gilesthomas.com/2026/01/llm-...

17.01.2026 19:57 👍 1 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 31 -- the models are now on Hugging Face I've trained seven models using the GPT-2 architecture: let's share them on Hugging Face!

I thought it would be good to share the base models I've been training on Hugging Face, and now they are :-)

www.gilesthomas.com/2026/01/llm-...

17.01.2026 19:57 👍 1 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results I was unhappy with the LLM-as-a-judge instruction fine-tuning results I got when comparing my various base models. Could I make them any better?

I wanted to dig into why the results I got on instruction fine-tuning for each of my models didn't seem to match up well with the loss on the test set. Got some interesting results: www.gilesthomas.com/2026/01/2026...

09.01.2026 01:16 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results I was unhappy with the LLM-as-a-judge instruction fine-tuning results I got when comparing my various base models. Could I make them any better?

I wanted to dig into why the results I got on instruction fine-tuning for each of my models didn't seem to match up well with the loss on the test set. Got some interesting results: www.gilesthomas.com/2026/01/2026...

09.01.2026 01:16 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud Having trained a base model from scratch on my own machine over 48 hours, I wanted to make it faster by training with multiple GPUs in the cloud.

Having trained a GPT-2 scale base model from scratch in 48 hours locally, I wanted to see if I could do the same faster and at a reasonable cost in the cloud. I could! www.gilesthomas.com/2026/01/llm-...

07.01.2026 20:48 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud Having trained a base model from scratch on my own machine over 48 hours, I wanted to make it faster by training with multiple GPUs in the cloud.

Having trained a GPT-2 scale base model from scratch in 48 hours locally, I wanted to see if I could do the same faster and at a reasonable cost in the cloud. I could! www.gilesthomas.com/2026/01/llm-...

07.01.2026 20:48 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090 I felt like it should be possible to train a GPT-2 small level model on my own hardware using modern tools and open datasets from scratch. It was!

I managed to train my own base model from scratch on an RTX 3090! Very detailed notes here: www.gilesthomas.com/2025/12/llm-...

02.12.2025 18:19 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090 I felt like it should be possible to train a GPT-2 small level model on my own hardware using modern tools and open datasets from scratch. It was!

I managed to train my own base model from scratch on an RTX 3090! Very detailed notes here: www.gilesthomas.com/2025/12/llm-...

02.12.2025 18:19 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 27 -- what's left, and what's next? Having finished the main body of 'Build an LLM (from scratch)', it's time to think about what I need to do to treat this project as fully done

So, what's left to do in my series on building an LLM from scratch? And what follow-up series should I work on? Some musings: www.gilesthomas.com/2025/11/llm-...

04.11.2025 00:52 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 27 -- what's left, and what's next? Having finished the main body of 'Build an LLM (from scratch)', it's time to think about what I need to do to treat this project as fully done

So, what's left to do in my series on building an LLM from scratch? And what follow-up series should I work on? Some musings: www.gilesthomas.com/2025/11/llm-...

04.11.2025 00:52 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 26 -- evaluating the fine-tuned model Coming to the end of 'Build an LLM (from scratch)'! We evaluate the quality of the responses our model produces.

The end of the beginning... running evals on our model using Llama 3 is the last part of the main body of @sebastianraschka.com's "Build an LLM (from scratch)". Here's my writeup:

www.gilesthomas.com/2025/11/llm-...

03.11.2025 19:43 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 26 -- evaluating the fine-tuned model Coming to the end of 'Build an LLM (from scratch)'! We evaluate the quality of the responses our model produces.

The end of the beginning... running evals on our model using Llama 3 is the last part of the main body of @sebastianraschka.com's "Build an LLM (from scratch)". Here's my writeup:

www.gilesthomas.com/2025/11/llm-...

03.11.2025 19:43 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 25 -- instruction fine-tuning Some notes on the first part of chapter 7 of 'Build an LLM (from scratch)': instruction fine-tuning

Back on track with chapter 7 of "Build an LLM (from scratch)": notes on instruction fine-tuning of our GPT-2 model:

www.gilesthomas.com/2025/10/llm-...

29.10.2025 21:07 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 25 -- instruction fine-tuning Some notes on the first part of chapter 7 of 'Build an LLM (from scratch)': instruction fine-tuning

Back on track with chapter 7 of "Build an LLM (from scratch)": notes on instruction fine-tuning of our GPT-2 model:

www.gilesthomas.com/2025/10/llm-...

29.10.2025 21:07 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 24 -- the transcript hack Back when I started playing with LLMs, I found that you could build a (very basic) chatbot with a base model -- no instruction fine-tuning at all! Does that work with GPT-2?

Back when I started messing with LLMs, it looked to me like you could get reasonably OK results for chat applications without instruction fine-tuning. So before getting into Chapter 7 of "Build an LLM (from scratch)", I decided to see if that was really true:

www.gilesthomas.com/2025/10/llm-...

28.10.2025 20:20 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 24 -- the transcript hack Back when I started playing with LLMs, I found that you could build a (very basic) chatbot with a base model -- no instruction fine-tuning at all! Does that work with GPT-2?

Back when I started messing with LLMs, it looked to me like you could get reasonably OK results for chat applications without instruction fine-tuning. So before getting into Chapter 7 of "Build an LLM (from scratch)", I decided to see if that was really true:

www.gilesthomas.com/2025/10/llm-...

28.10.2025 20:20 👍 0 🔁 1 💬 1 📌 0

Retro Language Models: Rebuilding Karpathy’s RNN in PyTorch Revisiting Karpathy’s text-generating RNNs with PyTorch’s built-in LSTM class — a practical look at why training sequence models is so different from Transformers.

And the next step -- a code walkthrough of my PyTorch version of Karpathy's 2015-vintage RNNs.

www.gilesthomas.com/2025/10/retr...

24.10.2025 18:57 👍 0 🔁 1 💬 0 📌 0

Retro Language Models: Rebuilding Karpathy’s RNN in PyTorch Revisiting Karpathy’s text-generating RNNs with PyTorch’s built-in LSTM class — a practical look at why training sequence models is so different from Transformers.

And the next step -- a code walkthrough of my PyTorch version of Karpathy's 2015-vintage RNNs.

www.gilesthomas.com/2025/10/retr...

24.10.2025 18:57 👍 0 🔁 1 💬 0 📌 0

Writing an LLM from scratch, part 23 -- fine-tuning for classification After all the hard work, chapter 6 in 'Build an LLM (from scratch)' is a nice easy one -- how do we take a next-token predictor and turn it into a classifier?

Chapter 6 was easy and fun! Fine-tuning an LLM for classification tasks, with some initially disappointing results -- but it all came out in the wash: www.gilesthomas.com/2025/10/llm-...

22.10.2025 23:06 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 23 -- fine-tuning for classification After all the hard work, chapter 6 in 'Build an LLM (from scratch)' is a nice easy one -- how do we take a next-token predictor and turn it into a classifier?

Chapter 6 was easy and fun! Fine-tuning an LLM for classification tasks, with some initially disappointing results -- but it all came out in the wash: www.gilesthomas.com/2025/10/llm-...

22.10.2025 23:06 👍 0 🔁 1 💬 1 📌 0

Writing an LLM from scratch, part 22 -- finally training our LLM! Finally, we train an LLM! The final part of Chapter 5 of Build an LLM (from Scratch) runs the model on real text, then loads OpenAI’s GPT-2 weights for comparison.

Part 22 is live: we finally train the LLM :-) Following @sebastianraschka.com's book, we train on Edith Wharton, then swap in GPT-2 (124M) weights for comparison. Notes on seeding, AdamW, temperature and top-k.

www.gilesthomas.com/2025/10/llm-...

15.10.2025 23:45 👍 1 🔁 1 💬 2 📌 0

Giles Thomas

Latest posts by Giles Thomas @gilesthomas.com