Tim Mastny's Avatar

Tim Mastny

@timmastny

Software engineer bringing digital finance to Africa. Writes about low-level programming, data science, and algorithms at https://timmastny.com

20
Followers
53
Following
33
Posts
08.01.2025
Joined
Posts Following

Latest posts by Tim Mastny @timmastny

Post image

#statstab #317 Tests for Pairwise Mean Differences in R
by @timmastny.bsky.social

Thoughts: A nice illustration of what {emmeans} is doing when computing contrasts using various functions.

#emmeans #r #pairwise #ttest #tutorial #dataviz

timmastny.com/blog/tests-p...

08.04.2025 13:43 πŸ‘ 6 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Manus turns out to be just Claude Sonnet + 29 other tools, Reflection 70B vibes ngl

Word is that it's just a wrapper around Claude 3.7! www.reddit.com/r/LocalLLaMA...

10.03.2025 17:03 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Dead Rise (Original Soundtrack) Ian Wright Β· EP Β· 2025 Β· 4 songs

My friend @timmastny.bsky.social made a game for the NES and asked me to write music. We made both the game and music under the constraints of the NES.

Check out the game on tmastny.itch.io/dead-rise

And the soundtrack on all streaming platforms!

Spotify link: open.spotify.com/album/2NDjba...

20.02.2025 19:42 πŸ‘ 3 πŸ” 1 πŸ’¬ 0 πŸ“Œ 1
False Sharing | Tim Mastny

This is called false sharing - when CPUs coordinate over variables they don't actually share.

In this case the penalty was small, but in other cases it can cause 5%, 25%, or even 8000% performance degradation. If you want to learn more, check out my post!
timmastny.com/blog/false-s...

18.02.2025 18:28 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

The cache line is the smallest unit of data unique to each CPU cache, commonly 64 or 128 bytes. If a CPU wants to modify a variable on a cache line, it must coordinate with all other CPUs to invalidate their copy, even if they some of the variables on that line are not shared.

18.02.2025 18:28 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Variables `a` and `b` are not shared between the two threads: thread 0 uses a and thread 1 uses b. But since they are initialized next to each other in the program on the left, they share the same cache line.

18.02.2025 18:28 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

The program on the right is 0.7% faster measured over millions of iterations. Why?

18.02.2025 18:28 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Which program is faster? 1/

18.02.2025 18:28 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Expected Database Query Latency

Check out problem 2: sirupsen.com/napkin/probl...
And my napkin math: timmastny.com/projects/nap...

12.02.2025 17:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

It’s also random for main memory, but we only need to load one new cache line from main memory per page: 50 * 50 ns = 2.5 us. The prefetcher should ensure the sequential read of the rest of the page is fast.

Even if we have to fetch more cache lines, it’s much faster than 1ms!

12.02.2025 17:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Solution: I didn’t consider that each page access is random! A random SSD of 8KB is 100 us, so 101 us ~ 100 us for 16KB, 1 ms for 10 pages.

When doing random reads, SSDs are 100x slower, not 10x!

12.02.2025 17:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Total:
50 pages * .25 us / page = 12.5 us
10 pages * 2 us / page = 20 us
12.5 us + 20 us = 32.5 us

Seems reasonable: SSD is 10x slower than main memory.

12.02.2025 17:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

50 pages / query * 80% of pages in cache = 40 pages in cache, 10 on disk.

Sequential SSD read: 1 us / 8KB * 16KB / page = 2 us / page.

Page scan time: 1 ns / 64 bytes * 16KB / page = 250 ns / page = .25 us / page

12.02.2025 17:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Problem 2 from the @sirupsen.bsky.social's napkin math newsletter:

Your SSD-backed database has a usage-pattern that rewards you with a 80% page-cache hit-rate. The median is 50 distinct disk pages for a query to gather its query results. What is the expected average query time from your database?

12.02.2025 17:48 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Tuning B+ Trees for Memory Performance | Tim Mastny

The blue line in the first plot is our theoretical model and the black points are average search times on my implementation of B+ tree search. Our theoretical model agrees very closely with our benchmark!

Check out my blog post for more details: timmastny.com/blog/tuning-...

11.02.2025 18:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Pointer chasing is slow, but the number is limited to the height of the tree. Scanning is fast, but quickly adds up. Our model combines these costs: for any number of total keys, we can calculate the optimal number of keys per node 'b' that minimizes total search time.

11.02.2025 18:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

On the other hand, a B+ tree with infinite keys per node is an array. A sequential scan of 1 trillion keys would take about 16 minutes at 1ns per key.

11.02.2025 18:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Let’s start with the extremes. With one key per node, our B+ tree is a binary search tree. Even with 1,000,000,000 keys, the maximum height is <30. We can find any key in less 3 us: 30 pointer jumps * 100 ns per jump = 3 us

11.02.2025 18:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Search in B+ trees is a balance of two operations: scanning within nodes (1ns each) and jumping between nodes (100ns each). The size of each node determines how many of each operation we do - and that's why finding the optimal node size is crucial for performance.

11.02.2025 18:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Problem 1 | Tim Mastny

Here's problem one from the newsletter:
sirupsen.com/napkin/probl...

And my napkin math:
timmastny.com/projects/nap...

04.02.2025 23:52 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

The solution estimated the storage cost to be $0.01 GB/month, but didn't mention write costs. Simon mentions disk storage rather than blob, which I did not consider πŸ€”

04.02.2025 23:52 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Tips from solution: calculate and estimate with powers of 10 to simplify the math, e.g. 1 day = 9 * 10^4 seconds. I also like to use powers of 10 that are multiples of three to match SI prefixes: 5 * 10^5 bytes = 0.5 * 10^6 = 0.5 MB

04.02.2025 23:52 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

At $5 / 1 million writes, we have to batch requests together before writing to the blob, or we would 250x our storage cost. Batching 1000 requests costs $15,000 / year. Batching 1,000,000 is only $15.

04.02.2025 23:52 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Data size: 1 character per byte, 5 characters per word, 200 words per request = 1KB per request. That works out to be 9TB / day. Blob storage at $0.02 GB / month is $65,000 per year. But we also have to pay for writes to blob storage!

04.02.2025 23:52 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Starting to work through the Napkin Math newsletter by x.com/Sirupsen

Problem 1: How much will the storage of logs cost for a standard, monolithic 100,000 RPS web application?

04.02.2025 23:52 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
RGA Editor

The demo includes several interactive examples, showing how the data structure solves some typical synchronization problems.

Check out the demo here: timmastny.com/rga/

27.01.2025 19:19 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Video thumbnail

Latest demo: collaborative text editing synchronized with a conflict-free data structure.

The data structure ensures that insertions, edits, and deletions are properly merged during synchronization. Even if the changes were made offline. 1/

27.01.2025 19:19 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Why Branch Prediction Needs Real Data | Tim Mastny

Loops, hot paths, and repetitive data means programs are often predictable: even a simple "branch taken" prediction can achieve 70% accuracy.

Unlike sorting algos, we don't care how they perform on random data: we need to compare against real data
timmastny.com/blog/branch-...

15.01.2025 20:19 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Think about hashing like putting balls randomly into bins. gshare could put all the balls into one bin.

But with concatenation, each bin has a maximum of 4 balls, leading to fewer collisions on average.

The problem is that branch history is not random! 2/

15.01.2025 20:19 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Video thumbnail

When analyzing branch prediction algorithms, the "better" hashing approach (gshare) is theoretically worse than the simple one (concatenation).

What's going on?

15.01.2025 20:19 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0