Luke Lau (@lukel97) — bluesky.baby

Closing the gap, part 2: Probability and profitability Welcome back to the second post in this series looking at how we can improve the performance of RISC-V code from LLVM.

One of the nice parts of #llvm is that often times you'll find yourself needing to do some sort of non-trivial analysis, but usually there's already a pass for it.

Here's how you can reuse a block frequency analysis to make a chess engine 7% faster on #riscv: lukelau.me/2026/01/26/c...

27.01.2026 13:49 👍 12 🔁 1 💬 0 📌 0

Closing the LLVM RISC-V gap to GCC, part 1 At the time of writing, GCC beats Clang on several SPEC CPU 2017 benchmarks on RISC-V1: Compiled with -march=rva22u64_v -O3 -flto, running the train ↩

Does LLVM produce slower RISC-V code than GCC? Currently, yes.

Can we make LLVM produce faster code? Also, yes!

lukelau.me/2025/12/10/c...

#llvm #riscv

10.12.2025 14:42 👍 14 🔁 7 💬 0 📌 0

"How NOT To Program
an Out-of-order Vector
Processor" slides are public.

static.sched.com/hosted_files...

23.10.2025 10:51 👍 1 🔁 1 💬 1 📌 0

A title card with a photo of Mikhail and the same information, but adding 11:50am (in Santa Clara)

We're looking forward to the RISC-V Summit North America next week where Mikhail Gadelha (one of our compiler engineers) will be presenting "Unlocking 15% More Performance: A Case Study in LLVM Optimization for RISC-V". Be sure to catch his talk next Thurs

riscvsummit2025.sched.com/event/28OTp/...

17.10.2025 14:09 👍 10 🔁 5 💬 0 📌 0

In Pictures: HK police deploy armoured vehicle on Tiananmen anniversary Police have deployed an armoured vehicle in Hong Kong's commercial heart, amidst an ongoing heavy security presence on the 36th anniversary of the Tiananmen Square crackdown.

Police have deployed an armoured vehicle in Hong Kong's commercial heart, amidst an ongoing heavy security presence on the 36th anniversary of the Tiananmen Square crackdown. In full: buff.ly/f4hVB50

04.06.2025 10:30 👍 14 🔁 9 💬 0 📌 3

Picture of a presenter showing a slide that details outcomes of RISE funded RISC-V software ecosystem projects.

I'm delighted to see two of @igalia.com's projects for RISE highlighted at the RISC-V Summit Europe.

Find out more about our work on both LLVM optimisation and testing/CI on the RISE blog (with more to come in the future!):
riseproject.dev/2025/05/08/p...
riseproject.dev/2024/10/15/w...

14.05.2025 10:50 👍 6 🔁 3 💬 0 📌 0

@camel-cdr.bsky.social rvv-bench is used here!

18.04.2025 10:33 👍 5 🔁 0 💬 1 📌 0

We're looking forward to EuroLLVM next week in Berlin. Be sure to check out talks from my colleague @lukel97.bsky.social and myself on:
* Work to further improve RISC-V vector codegen (extending the VL Optimizer), and
* Work done with the support of RISE to improve RISC-V LLVM testing.

12.04.2025 07:30 👍 9 🔁 4 💬 0 📌 0

FEX 2503 Tagged Here we are again, another month and some more cool changes with FEX. Let’s dive in and see what has changed!

What if I told you 3DNow! square root recíprocals are defined for negative numbers?... Also the amazing FEX 2503 is out. Read about some of my work and the work of other FEX maintainers' in the release notes: fex-emu.com/FEX-2503/ #fex #igalia #gaming #linux #arm64

06.03.2025 15:50 👍 4 🔁 2 💬 1 📌 0

ccache for LLVM builds across multiple directories TL;DR: ccache base_dir saves the day

Some notes on ccache+LLVM. Summary: if you do a lot of builds across different checkouts/worktrees/builddirs, be sure to set the base_dir option and -DLLVM_USE_RELATIVE_PATHS_IN_DEBUG_INFO=ON muxup.com/2025q1/ccach...

27.02.2025 18:39 👍 9 🔁 4 💬 0 📌 0

Inside SiFive’s P550 Microarchitecture RISC-V is a relatively young and open source instruction set. So far, it has gained traction in microcontrollers and academic applications. For example, Nvidia replaced the Falcon microcontrollers …

Hello you fine Internet folks,
Today's article is on SiFive's P550 microarchitecture. The P550 core is one of the fastest RISC-V cores available currently and is claimed to be comparable to ARM's Cortex A75.
Hope y'all enjoy!

old.chipsandcheese.com/2025/01/26/i...

open.substack.com/pub/chipsand...

26.01.2025 22:14 👍 12 🔁 5 💬 0 📌 0

Executable loading and startup performance on macOS Recently, I fixed a startup performance regression in Node.js on macOS after an extensive investigation. Along the way, I learned a lot about tools on macOS and Node.js compilation workflows that don’

New blog post covering the mysterious 10ms startup regression of Node.js on macOS, the journey of investigating the issue with various performance tools, and figuring out the fix (which also helped making the binary smaller).

joyeecheung.github.io/blog/2025/01...

11.01.2025 22:25 👍 127 🔁 18 💬 3 📌 2

A Simple ELF - The Ivory Tower The Ivory Tower is a blog about software engineering and development philosophy by Anders Sundman.

A Simple ELF 4zm.org/2024/12/25/a...

27.12.2024 11:18 👍 0 🔁 0 💬 0 📌 0

build: build v8 with -fvisibility=hidden on macOS by joyeecheung · Pull Request #56275 · nodejs/node V8 should be built with -fvisibility=hidden, otherwise the resulting binary would contain unnecessary symbols. In particular, on macOS, this leads to 5000+ weak symbols resolved at runtime, leading...

After two months of chasing, finally found out what's happening behind this mysterious startup time regression on macOS from Node.js v20.x - it's missing -fvisibility=hidden 😅 (I guess that's what happens when the build configs become dusty enough) github.com/nodejs/node/...

16.12.2024 21:55 👍 59 🔁 8 💬 3 📌 2

Abnormally slow loop (25x) under OCaml 5 / macOS / arm64 · Issue #13262 · ocaml/ocaml Hello, I am using macOS Ventura 13.6.7 with an Apple M2 Max processor. A loop that writes values into an integer array is about 20x slower with OCaml 5 than with OCaml 4. Using Array.set versus Arr...

Recently I came across this treatise by Stephen Dolan

github.com/ocaml/ocaml/...

12.12.2024 00:03 👍 23 🔁 5 💬 2 📌 0

256 loads, since it’s an LMUL 8 load with VLEN=256! I’m not sure how it compares to the scalar equivalent, but my guess is that the vlse8.v is loading one element at a time under the hood

11.12.2024 11:17 👍 0 🔁 0 💬 0 📌 0

$A screenshot of a terminal: luke@bananapif3:~/slowest-instr$ cat main.S .section .rodata str: .asciz "Cycles: %d\n" foo: .zero 256 * STRIDE .section .text .global main main: addi sp, sp, -8 sd ra, 0(sp) rdcycle s1 rdcycle s2 sub s3, s2, s1 # rdcycle overhead la a0, foo li a1, STRIDE vsetvli t0, zero, e8, m8, tu, mu rdcycle s1 vlse8.v v8, (a0), a1 rdcycle s2 sub s1, s2, s1 sub s1, s1, s3 la a0, str mv a1, s1 call printf ld ra, 0(sp) addi sp, sp, 8 ret luke@bananapif3:~/slowest-instr$ clang main.S -DSTRIDE=65536 -march=rv64gv luke@bananapif3:~/slowest-instr$ perf stat -e cycles:u ./a.out Cycles: 66640979 Performance counter stats for './a.out': 78,064,581 cycles:u 0.049648957 seconds time elapsed 0.000000000 seconds user 0.049907000 seconds sys$

A screenshot of a terminal: luke@bananapif3:~/slowest-instr$ cat main.S .section .rodata str: .asciz "Cycles: %d\n" foo: .zero 256 * STRIDE .section .text .global main main: addi sp, sp, -8 sd ra, 0(sp) rdcycle s1 rdcycle s2 sub s3, s2, s1 # rdcycle overhead la a0, foo li a1, STRIDE vsetvli t0, zero, e8, m8, tu, mu rdcycle s1 vlse8.v v8, (a0), a1 rdcycle s2 sub s1, s2, s1 sub s1, s1, s3 la a0, str mv a1, s1 call printf ld ra, 0(sp) addi sp, sp, 8 ret luke@bananapif3:~/slowest-instr$ clang main.S -DSTRIDE=65536 -march=rv64gv luke@bananapif3:~/slowest-instr$ perf stat -e cycles:u ./a.out Cycles: 66640979 Performance counter stats for './a.out': 78,064,581 cycles:u 0.049648957 seconds time elapsed 0.000000000 seconds user 0.049907000 seconds sys

Trying to find the slowest possible RISC-V instruction. This single vlse8.v with a stride of 65536 bytes takes 66 million cycles on a Banana Pi F3. That's 0.04 seconds @1.6GHz
#risc-v

11.12.2024 09:40 👍 23 🔁 5 💬 4 📌 0

The maximum possible vl is 2^16 I think, so that would fit in XLEN=32?

06.12.2024 16:28 👍 1 🔁 0 💬 1 📌 0

With that said I forgot how confusing the V extension hierarchy can be. After thinking about about EEW=64 on XLEN=32 I think I need to go lie down a bit 😵‍💫

06.12.2024 16:21 👍 2 🔁 0 💬 0 📌 0

Otherwise EEW=64 is supported as usual, since there’s also this bit at the bottom:

> The V extension requires the scalar processor implements the F and D extensions

06.12.2024 16:18 👍 2 🔁 0 💬 0 📌 0

Is it this bit here?

> The V extension supports all vector load and store instructions (Section Vector Loads and Stores), except the V extension
does not support EEW=64 for index values when XLEN=32.

I’m interpreting that as index values I.e only indices passed to vluxei64.v and friends

06.12.2024 16:16 👍 2 🔁 0 💬 3 📌 0

Are you talking about zve32x? That doesn’t include any fp support, but zve32f should mandate f and zve64f should mandate d I think

06.12.2024 04:48 👍 1 🔁 0 💬 1 📌 0

'RVV mask tricks' # broadcast nth bit vmand.mm v8, in, mNth vcpop.m t0, v8 sub t0, x0, t0 vmv.v.x v8, t0 # prefix xor viota.m v8, in vand.vi v8, v8, 1 vmsne.vi v8, v8, 0 vmor.mm v0, v8, in # can often be omitted # move nth bit to first vmand.mm v8, in, mNth vcpop.m t0, v8 vmv.v.x v8, t0 vmsof.m v0, v8 # move mask to GPR vmv.x.s t0, v0 # move GPR to mask vmv.s.x v0, t0 # assuming vl<=64, set SEW=64 before # these two should really be dedicated instructions # shift mask up by 1 vslide1up.vx v8, in, x0 vsrl.vi v8, v8, 7 vmadd.vx v0, 2, v8 # shift mask up by 1 vslide1down.vx v8, in, x0 vadd.vv v0, in, in vmacc.vx v0, 128, v8

Here are some slightly tricky RVV mask patterns.

03.12.2024 21:37 👍 7 🔁 3 💬 1 📌 0

Even better is being able to measure the numbers yourself without the need for vendor tables. RISC-V support for llvm-exegesis is landing soon IIUC, with RVV not too far behind either.

03.12.2024 03:02 👍 4 🔁 0 💬 0 📌 0

RVV benchmark

The RVV Agner Fog is camel-cdr.github.io/rvv-bench-re..., it’s an incredibly useful resource. We use it all the time for LLVM!

03.12.2024 00:52 👍 3 🔁 0 💬 1 📌 0

Luke Lau

Latest posts by Luke Lau @lukel97