Now this was a surprise. QKV bias is not meant to be useful -- but with my GPT-2 small model, it looks like it is! www.gilesthomas.com/2026/02/llm-...
Now this was a surprise. QKV bias is not meant to be useful -- but with my GPT-2 small model, it looks like it is! www.gilesthomas.com/2026/02/llm-...
Now this was a surprise. QKV bias is not meant to be useful -- but with my GPT-2 small model, it looks like it is! www.gilesthomas.com/2026/02/llm-...
Does removing dropout improve our baseline model's test loss? Yes, absolutely, and much more than gradient clipping did.
www.gilesthomas.com/2026/02/llm-...
Does removing dropout improve our baseline model's test loss? Yes, absolutely, and much more than gradient clipping did.
www.gilesthomas.com/2026/02/llm-...
First "intervention" test: does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.
www.gilesthomas.com/2026/02/llm-...
First "intervention" test: does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.
www.gilesthomas.com/2026/02/llm-...
Back to my LLM from scratch series. I want to train the *best* GPT-2-style model that I can locally in two days, and there are various levers to pull. Working out which ones work means I need a baseline for comparison.
www.gilesthomas.com/2026/02/llm-...
Back to my LLM from scratch series. I want to train the *best* GPT-2-style model that I can locally in two days, and there are various levers to pull. Working out which ones work means I need a baseline for comparison.
www.gilesthomas.com/2026/02/llm-...
I wanted to get a custom LLM up onto the @hf.co Hub, and couldn't find an in-depth tutorial. Here's the one I wish I'd found before I got started: www.gilesthomas.com/2026/01/cust...
I thought it would be good to share the base models I've been training on Hugging Face, and now they are :-)
www.gilesthomas.com/2026/01/llm-...
I thought it would be good to share the base models I've been training on Hugging Face, and now they are :-)
www.gilesthomas.com/2026/01/llm-...
I wanted to dig into why the results I got on instruction fine-tuning for each of my models didn't seem to match up well with the loss on the test set. Got some interesting results: www.gilesthomas.com/2026/01/2026...
I wanted to dig into why the results I got on instruction fine-tuning for each of my models didn't seem to match up well with the loss on the test set. Got some interesting results: www.gilesthomas.com/2026/01/2026...
Having trained a GPT-2 scale base model from scratch in 48 hours locally, I wanted to see if I could do the same faster and at a reasonable cost in the cloud. I could! www.gilesthomas.com/2026/01/llm-...
Having trained a GPT-2 scale base model from scratch in 48 hours locally, I wanted to see if I could do the same faster and at a reasonable cost in the cloud. I could! www.gilesthomas.com/2026/01/llm-...
I managed to train my own base model from scratch on an RTX 3090! Very detailed notes here: www.gilesthomas.com/2025/12/llm-...
I managed to train my own base model from scratch on an RTX 3090! Very detailed notes here: www.gilesthomas.com/2025/12/llm-...
So, what's left to do in my series on building an LLM from scratch? And what follow-up series should I work on? Some musings: www.gilesthomas.com/2025/11/llm-...
So, what's left to do in my series on building an LLM from scratch? And what follow-up series should I work on? Some musings: www.gilesthomas.com/2025/11/llm-...
The end of the beginning... running evals on our model using Llama 3 is the last part of the main body of @sebastianraschka.com's "Build an LLM (from scratch)". Here's my writeup:
www.gilesthomas.com/2025/11/llm-...
The end of the beginning... running evals on our model using Llama 3 is the last part of the main body of @sebastianraschka.com's "Build an LLM (from scratch)". Here's my writeup:
www.gilesthomas.com/2025/11/llm-...
Back on track with chapter 7 of "Build an LLM (from scratch)": notes on instruction fine-tuning of our GPT-2 model:
www.gilesthomas.com/2025/10/llm-...
Back on track with chapter 7 of "Build an LLM (from scratch)": notes on instruction fine-tuning of our GPT-2 model:
www.gilesthomas.com/2025/10/llm-...
Back when I started messing with LLMs, it looked to me like you could get reasonably OK results for chat applications without instruction fine-tuning. So before getting into Chapter 7 of "Build an LLM (from scratch)", I decided to see if that was really true:
www.gilesthomas.com/2025/10/llm-...
Back when I started messing with LLMs, it looked to me like you could get reasonably OK results for chat applications without instruction fine-tuning. So before getting into Chapter 7 of "Build an LLM (from scratch)", I decided to see if that was really true:
www.gilesthomas.com/2025/10/llm-...
And the next step -- a code walkthrough of my PyTorch version of Karpathy's 2015-vintage RNNs.
www.gilesthomas.com/2025/10/retr...
And the next step -- a code walkthrough of my PyTorch version of Karpathy's 2015-vintage RNNs.
www.gilesthomas.com/2025/10/retr...
Chapter 6 was easy and fun! Fine-tuning an LLM for classification tasks, with some initially disappointing results -- but it all came out in the wash: www.gilesthomas.com/2025/10/llm-...
Chapter 6 was easy and fun! Fine-tuning an LLM for classification tasks, with some initially disappointing results -- but it all came out in the wash: www.gilesthomas.com/2025/10/llm-...
Part 22 is live: we finally train the LLM :-) Following @sebastianraschka.com's book, we train on Edith Wharton, then swap in GPT-2 (124M) weights for comparison. Notes on seeding, AdamW, temperature and top-k.
www.gilesthomas.com/2025/10/llm-...