Ashton Six's Avatar

Ashton Six

@ashtonsix.com

Research Engineer (software), with interests in superoptimisation, fast integer compression, and indexing for OLAP

35
Followers
223
Following
7
Posts
16.01.2026
Joined
Posts Following

Latest posts by Ashton Six @ashtonsix.com

I want bare metal instances that can launch within 2-3 seconds, for a better (local dev <-> remote execution) REPL workflow

vs. fast launching containers (eg, Cloud Run), bare metal gives me more reliable benchmark measurements and the ability to bring tools like Nsight

17.01.2026 18:28 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

i've found some success sticking to SIMD-friendly scalar patterns

i get loop order right (polyhedral analysis), add hints like `#pragma omp simd`, and run at -O3: that's _usually_ enough. you can check output with -S (gives readable ASM)

or use SIMDe, that works too

17.01.2026 02:34 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
1. Introduction β€” PTX ISA 9.1 documentation

This feels like a continuation of the reduction operators introduced in Blackwell's TMA (cp.reduce.async.bulk). Fun fact! Data movement often dominates power usage vs compute because of physics: thicker longer wires = more power needed to transmit each bit. Makes a lot of sense to optimise here.

17.01.2026 02:02 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Mmm! Nice corollary: software optimisations for prefix sums (re-parenthesizing) generalise across associative ops: +, ^, prefix-of-prefix.

I made a thread about it: bsky.app/profile/asht...

17.01.2026 01:31 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
perf-portfolio/delta at main Β· ashtonsix/perf-portfolio HPC research and demonstrations. Contribute to ashtonsix/perf-portfolio development by creating an account on GitHub.

Full write-up, implementation (NEON) and benchmark results (Graviton4) here: github.com/ashtonsix/pe...

I love solving these kinds of performance puzzlesβ€”and I'm currently available for hire! Reach out if interested 😊. 3/3

17.01.2026 00:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

The ILP trick:

# Local prefix sums
out[0..3] = prefix(in[0..3])
out[4..7] = prefix(in[4..7])
...

# Late carry broadcast (redundant compute)
out[4..7] += out[3];
out[8..11] += out[7];
...

By delaying the carry we allow the CPU to compute all local prefix sums in parallel, >doubling throughput. 2/

17.01.2026 00:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I got SOTA (L1-hot, SIMD) on prefix sum by ADDING instructions (7.7 GB/s β†’ 19.8 GB/s). Consider:

for i = 0..n: out[i] = out[i-1] + in[i]

This SUCKS, because out[i] must wait on out[i-1]. There's an unbroken dependency chain which disrupts Instruction Level Parrallelism (ILP). 1/

17.01.2026 00:55 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 1