I want bare metal instances that can launch within 2-3 seconds, for a better (local dev <-> remote execution) REPL workflow
vs. fast launching containers (eg, Cloud Run), bare metal gives me more reliable benchmark measurements and the ability to bring tools like Nsight
17.01.2026 18:28
π 2
π 0
π¬ 0
π 0
i've found some success sticking to SIMD-friendly scalar patterns
i get loop order right (polyhedral analysis), add hints like `#pragma omp simd`, and run at -O3: that's _usually_ enough. you can check output with -S (gives readable ASM)
or use SIMDe, that works too
17.01.2026 02:34
π 0
π 0
π¬ 0
π 0
1. Introduction β PTX ISA 9.1 documentation
This feels like a continuation of the reduction operators introduced in Blackwell's TMA (cp.reduce.async.bulk). Fun fact! Data movement often dominates power usage vs compute because of physics: thicker longer wires = more power needed to transmit each bit. Makes a lot of sense to optimise here.
17.01.2026 02:02
π 3
π 0
π¬ 0
π 0
Mmm! Nice corollary: software optimisations for prefix sums (re-parenthesizing) generalise across associative ops: +, ^, prefix-of-prefix.
I made a thread about it: bsky.app/profile/asht...
17.01.2026 01:31
π 0
π 0
π¬ 0
π 0
perf-portfolio/delta at main Β· ashtonsix/perf-portfolio
HPC research and demonstrations. Contribute to ashtonsix/perf-portfolio development by creating an account on GitHub.
Full write-up, implementation (NEON) and benchmark results (Graviton4) here: github.com/ashtonsix/pe...
I love solving these kinds of performance puzzlesβand I'm currently available for hire! Reach out if interested π. 3/3
17.01.2026 00:55
π 0
π 0
π¬ 0
π 0
The ILP trick:
# Local prefix sums
out[0..3] = prefix(in[0..3])
out[4..7] = prefix(in[4..7])
...
# Late carry broadcast (redundant compute)
out[4..7] += out[3];
out[8..11] += out[7];
...
By delaying the carry we allow the CPU to compute all local prefix sums in parallel, >doubling throughput. 2/
17.01.2026 00:55
π 0
π 0
π¬ 1
π 0
I got SOTA (L1-hot, SIMD) on prefix sum by ADDING instructions (7.7 GB/s β 19.8 GB/s). Consider:
for i = 0..n: out[i] = out[i-1] + in[i]
This SUCKS, because out[i] must wait on out[i-1]. There's an unbroken dependency chain which disrupts Instruction Level Parrallelism (ILP). 1/
17.01.2026 00:55
π 1
π 0
π¬ 1
π 1