Ashton Six (@ashtonsix.com)

I want bare metal instances that can launch within 2-3 seconds, for a better (local dev <-> remote execution) REPL workflow

vs. fast launching containers (eg, Cloud Run), bare metal gives me more reliable benchmark measurements and the ability to bring tools like Nsight

17.01.2026 18:28 👍 2 🔁 0 💬 0 📌 0

i've found some success sticking to SIMD-friendly scalar patterns

i get loop order right (polyhedral analysis), add hints like `#pragma omp simd`, and run at -O3: that's _usually_ enough. you can check output with -S (gives readable ASM)

or use SIMDe, that works too

17.01.2026 02:34 👍 0 🔁 0 💬 0 📌 0

1. Introduction — PTX ISA 9.1 documentation

This feels like a continuation of the reduction operators introduced in Blackwell's TMA (cp.reduce.async.bulk). Fun fact! Data movement often dominates power usage vs compute because of physics: thicker longer wires = more power needed to transmit each bit. Makes a lot of sense to optimise here.

17.01.2026 02:02 👍 3 🔁 0 💬 0 📌 0

Mmm! Nice corollary: software optimisations for prefix sums (re-parenthesizing) generalise across associative ops: +, ^, prefix-of-prefix.

I made a thread about it: bsky.app/profile/asht...

17.01.2026 01:31 👍 0 🔁 0 💬 0 📌 0

perf-portfolio/delta at main · ashtonsix/perf-portfolio HPC research and demonstrations. Contribute to ashtonsix/perf-portfolio development by creating an account on GitHub.

Full write-up, implementation (NEON) and benchmark results (Graviton4) here: github.com/ashtonsix/pe...

I love solving these kinds of performance puzzles—and I'm currently available for hire! Reach out if interested 😊. 3/3

17.01.2026 00:55 👍 0 🔁 0 💬 0 📌 0

The ILP trick:

# Local prefix sums
out[0..3] = prefix(in[0..3])
out[4..7] = prefix(in[4..7])
...

# Late carry broadcast (redundant compute)
out[4..7] += out[3];
out[8..11] += out[7];
...

By delaying the carry we allow the CPU to compute all local prefix sums in parallel, >doubling throughput. 2/

17.01.2026 00:55 👍 0 🔁 0 💬 1 📌 0

I got SOTA (L1-hot, SIMD) on prefix sum by ADDING instructions (7.7 GB/s → 19.8 GB/s). Consider:

for i = 0..n: out[i] = out[i-1] + in[i]

This SUCKS, because out[i] must wait on out[i-1]. There's an unbroken dependency chain which disrupts Instruction Level Parrallelism (ILP). 1/

17.01.2026 00:55 👍 1 🔁 0 💬 1 📌 1

Ashton Six

Latest posts by Ashton Six @ashtonsix.com