More

camel-cdr · 2026-01-14T10:21:46 1768386106

> The answer, if it’s not obvious from my tone already:), is 8%.

Not if the data is small and in cache.

> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.

I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.

shihab · 2026-01-14T12:01:56 1768392116

Yeah N is big enough that entire data isn't in the cache, but the memory access pattern here is the next best thing: totally linear, predictable access. I remember seeing around 94%+ L1d cache hit rate.

camel-cdr · 2026-01-11T22:42:46 1768171366

How many thing you learned working with LLMs in 2022 are relevant today? How many things you learned now are relevant in the future?

y1n0 · 2026-01-11T23:37:26 1768174646

This question misses the point. Everything you learn today informs how you learn in the future.

camel-cdr · 2025-12-23T18:37:55 1766515075

The problem should be equivalent to: https://www.reddit.com/r/simd/comments/1hmwukl/mask_calculat...

Falvyu's and bremac's solution seems to be the best.

zokrezyl · 2025-12-23T22:11:50 1766527910

thanks for pointing out! I tried the borrowing trick from the previous segment, was pretty obvious, but for some reason failed as could not avoid at least one conditinonal... will try again.

zokrezyl · 2025-12-24T12:58:54 1766581134

For my problem describe under the link above the suggestions above eliminate indeed the branches, but same time the extra instructions slow down the same as my initial branches. Meaning, detecting newlines would work almost 100% of memory throughput, but detecting first non-space reduces the speed to bit above 50% of bandwith

https://gist.github.com/zokrezyl/8574bf5d40a6efae28c9569a8d6...

camel-cdr · 2025-12-13T00:17:28 1765585048

ARM:

    subs w0, w10, w11
    b.vx trap

RISC-V:

    subw a0, t0, t1
    sub a1, t0, t1
    bne a0, a1, trap

camel-cdr · 2025-12-13T00:02:52 1765584172

Ok, let's test it then!

For testing, I use a custom qemu plugin to calculate the dynamic instruction count, dynamic uop count, and dynamic instruction size. Every instruction with multiple register writebacks was counted as one uop per writeback, and to make the results more comparable, SIMD was disabled.

I used this setup to run self-compiling single-file versions of chibicc (assembling) and tinycc (generating object file), which are small C compilers of 9K and 24K LOC respectively. Both compilers were cross-compiled using clang-22 and were benchmarked cross-compiling themselves to x86.

Let's look at the impact of -ftrapv first. In chibicc O3/O2/Os the dynamic upos increased due to -ftrapv for RISC-V by 5.3%/5.1%/6.7%, and for ARM by 5.1%/5.0%/6.4%. Interestingly, in tinycc it only increased for RISC-V by 1.6%/1.0%/1.0%, while ARM increased slightly more with 1.6%/2.0%/1.3%.

In terms of dynamic instruction count, ARM needed to execute 6%/15% fewer instructions than RISC-V for chibicc/tinycc. Looking at the uops, RISC-V needs to execute 6% more uops in tinycc, but ARM needs to execute 0.5% more uops with chibicc. The dynamic instruction size, which estimates the pressure on icache and fetch bandwidth, was 24%/10% lower in RISC-V for chibicc/tinycc.

Note that this did not model any instruction fusion in RISC-V and only treated incrementing loads and load pairs as multiple uops (to mirror Apple Silicon).

If the only fusion pair you implement is adjacent compressed sp relative stores, then RISC-V ends up with a lower uop count for both programs. They are trivial to implement because you can just interpret the two adjacent 16-bit instructions as a single 32-bit instruction, and compilers always generate them next to each other and in sorted order in function prolog code. You can do this directly in your RVC expander; it only adds minimal additional delay (zero with a trick), which is constant regardless of decode width.

Raw data:

    chibicc/clang-O3-armv9:       insns: 419886184    uops:  450136257    bytes: 1679544736
    chibicc/clang-O3-armv9-trap:  insns: 450205913    uops:  474206409    bytes: 1800823652
    chibicc/clang-O3-rva23:       insns: 449328186    uops:  449328186    bytes: 1288202666
    chibicc/clang-O3-rva23-trap:  insns: 474623648    uops:  474623648    bytes: 1375991094
    chibicc/clang-O2-armv9:       insns: 421810039    uops:  451501004    bytes: 1687240156
    chibicc/clang-O2-armv9-trap:  insns: 451642152    uops:  475084965    bytes: 1806568608
    chibicc/clang-O2-rva23:       insns: 449625081    uops:  449625081    bytes: 1286452180
    chibicc/clang-O2-rva23-trap:  insns: 473682134    uops:  473682134    bytes: 1369720036
    chibicc/clang-Os-armv9:       insns: 457841653    uops:  489902437    bytes: 1831366612
    chibicc/clang-Os-armv9-trap:  insns: 497189616    uops:  523323893    bytes: 1988758464
    chibicc/clang-Os-rva23:       insns: 486216287    uops:  486216287    bytes: 1363135906
    chibicc/clang-Os-rva23-trap:  insns: 520889604    uops:  520889604    bytes: 1473263784


    tinycc/clang-O3-armv9:        insns: 115189179    uops:  126358884    bytes: 460756716
    tinycc/clang-O3-armv9-trap:   insns: 117139555    uops:  128361973    bytes: 468558220
    tinycc/clang-O3-rva23:        insns: 137035509    uops:  137035509    bytes: 427878586
    tinycc/clang-O3-rva23-trap:   insns: 139248009    uops:  139248009    bytes: 436548988
    tinycc/clang-O2-armv9:        insns: 115184314    uops:  126568360    bytes: 460737256
    tinycc/clang-O2-armv9-trap:   insns: 117651772    uops:  129195276    bytes: 470607088
    tinycc/clang-O2-rva23:        insns: 137362294    uops:  137362294    bytes: 420468990
    tinycc/clang-O2-rva23-trap:   insns: 138649335    uops:  138649335    bytes: 428680948
    tinycc/clang-Os-armv9:        insns: 130661270    uops:  144718253    bytes: 522645080
    tinycc/clang-Os-armv9-trap:   insns: 132574148    uops:  146565708    bytes: 530296592
    tinycc/clang-Os-rva23:        insns: 152798316    uops:  152798316    bytes: 452181732
    tinycc/clang-Os-rva23-trap:   insns: 154232874    uops:  154232874    bytes: 458257882

camel-cdr · 2025-12-10T20:07:36 1765397256

> Ventana's cores are more like Cortex A76 kinds of things

More like Neoverse-V3: https://www.ventanamicro.com/technology/risc-v-cpu-ip/

BTW: "Silicon platforms launching in early 2026."

I wonder if this will be delayed due to the acquisition.

snvzz · 2025-12-11T14:39:17 1765463957

Doubtful. To have silicon in early 2026 would mean tapeout happened months ago.

camel-cdr · 2025-12-10T11:24:12 1765365852

> It's a shame though that Zcmp extension didn't get into RVA23 even as an optional extension

Zcmp is only for embedded applications without D support.

You wouldn't want an instruction with up to 13 destinations in high performance designs anyways.

If you want load/store pair, we already have that, you can just interpret two adjacent 16-bit load or stores as a single 32-bit instruction.

Joker_vD · 2025-12-11T21:49:55 1765489795

> You wouldn't want an instruction with up to 13 destinations in high performance designs anyways.

Why not? Code density matters even in high-performance designs although I guess the "millicode routines" can help with that somewhat. Still, the ordering of stores/loads is undefined, and they are allowed to be re-done however many times, so... it shouldn't be onerous to implement? Expanding it into μops during the decoding stages seems straightforward.

camel-cdr · 2025-12-11T22:07:59 1765490879

> Expanding it into μops during the decoding stages seems straightforward.

I wouldn't say so, because if you want to be able to crack an instruction into up to N uops, now the second instruction could be placed in any slot from the 2nd to the 1+Nth and you now have to create huge shuffle hardware tk support this.

Apple for example can only crack instructions that generate up to 3 μops at decode (or before rename) anything beyond needs to be microcoded and stall decoding other instructions.

camel-cdr · 2025-12-02T19:00:35 1764702035

> Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate?

AFAIK, the reason RISC-V supports alternative link registers is that it allows for efficient -msave-restore, keeps the encoding orthogonal to LUI/AUPIC and using the smaller immediate didn't impact codegen much.

camel-cdr · 2025-11-13T15:23:30 1763047410

These days PractRand seems to be the best randomness test suite: https://pracrand.sourceforge.net/

camel-cdr · 2025-11-10T15:31:13 1762788673

The funny thing is that their isn't even a standardized RISC-V BE ABI yet.