(@camel-cdr) — bluesky.baby

riscv-isa-manual/src/zvzip.adoc at zvzip · ved-rivos/riscv-isa-manual RISC-V Instruction Set Manual. Contribute to ved-rivos/riscv-isa-manual development by creating an account on GitHub.

I only slightly disagree with using segmented load/store transpose. If you need to transpose from memory fine, but if you need register to register going though memory isn't the best. I'd use vslide1up/down or in the future vpaire/vpairo: github.com/ved-rivos/ri...

23.10.2025 10:51 👍 1 🔁 0 💬 0 📌 0

"How NOT To Program
an Out-of-order Vector
Processor" slides are public.

static.sched.com/hosted_files...

23.10.2025 10:51 👍 1 🔁 1 💬 1 📌 0

Fuzzing tip: use VLA instead of fixed-size buffers or malloc

1. with fixed-size buffers asan won't catch everything.

2. VLAs are faster than malloc, in my case I get 15% faster fuzzing.

If VLAs aren't portable enough, just check __STDC_NO_VLA__ and select between the other options.

12.10.2025 09:45 👍 0 🔁 0 💬 0 📌 0

*correction: 0.5/0.5/2/4 for vector-scalar/immediate compares (0.5/2/4/8 for vector-vector)

25.09.2025 17:59 👍 0 🔁 0 💬 0 📌 0

For the scalar instructions:
* 6-issue: add/sub/lui/xor/sll/shNadd/zext/clz/cpop/min/rotl/rev8/bext/...
* 3-issue: load/store
* 2-issue: fadd/fmul/fmacc/fmin/fcvt
* 1-issue: mul/mulh/feq/flt
* pipelined: fsqrt/fdiv: ~8.5, div/rem: 12-16

25.09.2025 17:50 👍 0 🔁 0 💬 0 📌 0

My takeaway so far is to not be scared to use the segmented load/stores, and LMUL>1 permutes are good, but you probably want to avoid LMUL=8 ones when possible. I'll continue manually unrolling none-lane-crossing permutes. For LMUL>1 comparisons, it's better to use .vx/vi over .vv when possible.

25.09.2025 17:50 👍 0 🔁 0 💬 1 📌 0

The vslide1up/vslide1down do scale perfectly, though, with 0.5/1/2/4. It's not in the benchmark, but I hope vslideup/vslidedown with immediate "1" also do.

We'll have to wait for the other microbenchmarks to get a more complete picture.

25.09.2025 17:50 👍 0 🔁 0 💬 1 📌 0

* Ovlt behavior isn't supported, but I don't really care much about it

The only bigger negative thing I've seen so far is that the vslideup/vslidedown instructions don't scale linearly or close to linearly with LMUL, even for a small immediate shift amount like "3".

25.09.2025 17:50 👍 0 🔁 0 💬 1 📌 0

* dual-issue vrgather, with good scaling: 0.5/1/8/30
* dual-issue vcompress, with OK scaling: 0.5/3/6/17 (I still think this could get close to linear)
* Fault-only-first loads seem to have no overhead
* Segmented load/stores look quite fast, even the more exotic ones like seg7

25.09.2025 17:50 👍 0 🔁 0 💬 1 📌 0

* Most instructions have an inverse throughput of 0.5/1/2/4 for LMUL=1/2/4/8, even vslide1up/down, 64-bit vmulh, viota, vpopc and integer reductions
* 0.5/0.5/1/2 for vector-scalar/immediate compares and 0.5/1/2/- for narrowing instructions (see "Microarchitecture speculations" section)

25.09.2025 17:49 👍 0 🔁 0 💬 2 📌 0

Tenstorrent decided to publish the first benchmark data for Ascalon's RVV implementation using the instruction throughput benchmark of my rvv-bench benchmark suite. <3

camel-cdr.github.io/rvv-bench-re...

Overall, the results look really good so far:

25.09.2025 17:49 👍 3 🔁 3 💬 1 📌 0

Third Way is an unfortunate name: en.wikipedia.org/wiki/Third_W...

24.08.2025 15:25 👍 0 🔁 0 💬 0 📌 0

So if you are currently involved with ISA-level decisions about inclusion of any pext/pdep-like instructions:

Please consider including SAG/inverse-SAG with bit-reversal of the goats.

No matter which of the two implementation methods you are using: All you need to do is not mask the goat bits.

25.07.2025 23:30 👍 4 🔁 3 💬 0 📌 0

It looks like the patent expires at the end of 2028. The earliest I could see a RVI extension ratified at this point is 2027, so it's definitely worth evaluating.

Also, the new diagrams are cool.

24.07.2025 18:03 👍 1 🔁 0 💬 1 📌 0

I've only watched the last hour this far, but I quite liked your take on null-terminated strings.
C really has to be understood with its history in mind.

20.07.2025 20:12 👍 0 🔁 0 💬 0 📌 0

Ventana’s Second Gen RISC V Processor for Data Center and Other High Performance | Greg Favor YouTube video by Ventana Micro

www.youtube.com/watch?v=OPgj...

11.07.2025 21:00 👍 1 🔁 0 💬 0 📌 0

Their V2 slides say, that they have a macro-op cache equivalent in size to a regular 32 KiB icache.
It can store variable length entries of up to 48 macro ops, which can be fuses from non-sequential instruction runs by collapsing taken branches.

11.07.2025 20:59 👍 0 🔁 0 💬 1 📌 0

RWT Forums - Real World Tech content overridden

TIL about Trace Cache: www.realworldtech.com/forum/?threa... (thread on Apples Trace Cache)

Ventanas Veyron V2/V3 seem to also use something like a trace cache.

11.07.2025 20:59 👍 0 🔁 0 💬 1 📌 0

CBP2025 - Opening Remarks - Rami Sheikh YouTube video by Rami Sheikh

Ohh, the talk recordings are on YouTube: www.youtube.com/watch?v=1lwz...

28.06.2025 09:22 👍 1 🔁 0 💬 0 📌 0

The sixth Championship of Branch Prediction (CBP2025) happened a week ago: ericrotenberg.wordpress.ncsu.edu/cbp2025-work...

28.06.2025 06:36 👍 5 🔁 2 💬 1 📌 0

edu-sag/param.v at main · clairexen/edu-sag Educational 8-Bit Sheep-And-Goats (SAG) Verilog Reference IP - clairexen/edu-sag

I wrote a reference implementation for a SAG without bit reflection: github.com/clairexen/ed..., and I wrote a parametric SAG core for any bit width: github.com/clairexen/ed...

20.06.2025 16:04 👍 1 🔁 2 💬 0 📌 0

>>> lut=np.array([ord('a'),0,ord('e'),0,ord('i'),0,0,ord('o'),0,0,ord('u'),0,0,0,0,0], dtype=np.uint8)
>>> inp=np.frombuffer(b"test128aeiou72761xjs",dtype=np.uint8)
>>> lut[(inp&31)>>1]==inp

16.06.2025 14:19 👍 0 🔁 0 💬 0 📌 0

4x 16-bit: 120 u^2 63% utilized, 5GHz met (49 slack)

2x 32-bit: 120 u^2 65% utilized, 5GHz met (52 slack)

1x 64-bit: 153 u^2 64% utilized, 5GHz met (14 slack)

So subsetting on SEW really doesn't make much sense compared to a .vx subset.

12.06.2025 11:46 👍 1 🔁 0 💬 1 📌 0

I got OpenROAD working and tested the bfly part of your implementation (so without decode) in a SIMD setup.

asap7, targeting 5GHz, 75% placement density and 50% utilization:

12.06.2025 11:46 👍 1 🔁 0 💬 1 📌 0

RVV benchmark SiFive X280

SiFive X280 RVV benchmarks: camel-cdr.github.io/rvv-bench-re...

Civil was so nice run my RVV benchmark on the SiFive X280 cores on the Tenstorrent Blackhole.

06.06.2025 22:57 👍 0 🔁 0 💬 0 📌 0

I just had this problem on RISC-V where I didn't clobber the vector registers and some autovectorized surrounding code broke on a newer kenel version.

06.06.2025 18:31 👍 0 🔁 0 💬 0 📌 0

TIL you can't do forward compatible syscalls with inline assembly because the kernel can decide to clobber architectural state that was added after you wrote the code.

If you use svc with inline assembly, you have to explicitly clobber SVE registers.
Good luck doing this back in 2015 when you wrote

06.06.2025 18:31 👍 0 🔁 0 💬 1 📌 0

I suppose, the instruction encoding space has to be considered as well.

06.06.2025 12:48 👍 0 🔁 0 💬 1 📌 0

Ah, I understand my mistake now.
My mental model had the element order between the stages as fixed, which is why I didn't see the equivalence of the graphs.

06.06.2025 12:44 👍 1 🔁 0 💬 0 📌 0

Guess I'll step up: github.com/camel-cdr/bf...

And, yes, I wasn't the first person to write an optimizing brainfuck interpreter in the c preprocessing, that honor goes to kotha.

06.06.2025 11:26 👍 1 🔁 0 💬 0 📌 0

Latest posts by @camel-cdr