> The answer, if it’s not obvious from my tone already:), is 8%.
Not if the data is small and in cache.
> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.
I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.
Yeah N is big enough that entire data isn't in the cache, but the memory access pattern here is the next best thing: totally linear, predictable access. I remember seeing around 94%+ L1d cache hit rate.
thanks for pointing out! I tried the borrowing trick from the previous segment, was pretty obvious, but for some reason failed as could not avoid at least one conditinonal... will try again.
For my problem describe under the link above the suggestions above eliminate indeed the branches, but same time the extra instructions slow down the same as my initial branches. Meaning, detecting newlines would work almost 100% of memory throughput, but detecting first non-space reduces the speed to bit above 50% of bandwith
For testing, I use a custom qemu plugin to calculate the dynamic instruction count, dynamic uop count, and dynamic instruction size.
Every instruction with multiple register writebacks was counted as one uop per writeback, and to make the results more comparable, SIMD was disabled.
I used this setup to run self-compiling single-file versions of chibicc (assembling) and tinycc (generating object file), which are small C compilers of 9K and 24K LOC respectively.
Both compilers were cross-compiled using clang-22 and were benchmarked cross-compiling themselves to x86.
Let's look at the impact of -ftrapv first.
In chibicc O3/O2/Os the dynamic upos increased due to -ftrapv for RISC-V by 5.3%/5.1%/6.7%, and for ARM by 5.1%/5.0%/6.4%.
Interestingly, in tinycc it only increased for RISC-V by 1.6%/1.0%/1.0%, while ARM increased slightly more with 1.6%/2.0%/1.3%.
In terms of dynamic instruction count, ARM needed to execute 6%/15% fewer instructions than RISC-V for chibicc/tinycc.
Looking at the uops, RISC-V needs to execute 6% more uops in tinycc, but ARM needs to execute 0.5% more uops with chibicc.
The dynamic instruction size, which estimates the pressure on icache and fetch bandwidth, was 24%/10% lower in RISC-V for chibicc/tinycc.
Note that this did not model any instruction fusion in RISC-V and only treated incrementing loads and load pairs as multiple uops (to mirror Apple Silicon).
If the only fusion pair you implement is adjacent compressed sp relative stores, then RISC-V ends up with a lower uop count for both programs.
They are trivial to implement because you can just interpret the two adjacent 16-bit instructions as a single 32-bit instruction, and compilers always generate them next to each other and in sorted order in function prolog code.
You can do this directly in your RVC expander; it only adds minimal additional delay (zero with a trick), which is constant regardless of decode width.
> You wouldn't want an instruction with up to 13 destinations in high performance designs anyways.
Why not? Code density matters even in high-performance designs although I guess the "millicode routines" can help with that somewhat. Still, the ordering of stores/loads is undefined, and they are allowed to be re-done however many times, so... it shouldn't be onerous to implement? Expanding it into μops during the decoding stages seems straightforward.
> Expanding it into μops during the decoding stages seems straightforward.
I wouldn't say so, because if you want to be able to crack an instruction into up to N uops, now the second instruction could be placed in any slot from the 2nd to the 1+Nth and you now have to create huge shuffle hardware tk support this.
Apple for example can only crack instructions that generate up to 3 μops at decode (or before rename) anything beyond needs to be microcoded and stall decoding other instructions.
> Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate?
AFAIK, the reason RISC-V supports alternative link registers is that it allows for efficient -msave-restore, keeps the encoding orthogonal to LUI/AUPIC and using the smaller immediate didn't impact codegen much.
Not if the data is small and in cache.
> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.
I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.
reply