> An insane programmer can set LMUL to a value higher than 1, making vector inst...

dsharlet · on Sept 4, 2023

> The only case I can currently fore see where using LMUL=1 and manually unrolling instead will likely be always beneficial is vrgather operations that don't need to cross between registers in a register group (e.g. byte swapping).

What about algorithms where register pressure is an issue?

I think the problem with LMUL is it assumes that you always want to unroll the innermost dimension (where the vector loads are stride 1). That's usually, the last dimension I try to unroll, if there are any registers left over. If there is any sharing of data across any other dimension in the algorithm, it's better to tile/unroll those first.

Of course, for a simple algorithm, there will be registers left over. But I think more interesting algorithms will struggle on RVV if you must use LMUL > 1 for performance.

adgjlsfhk1 · on Sept 5, 2023

My favorite example of big LMUL is matmul. You can do an entire gemm microkernel in like 8 instructions with LMUL=4 by using an 7x4 kernel. You have 32 that turn into 8 registers with LMUL=4, 7 of which end up storing your C values, 1 stores your A values and you put the B values in scalar registers. Thus your entire kernel ends up being 1 4 wide vector load load and 7 4x wide FMA instructions.

camel-cdr · on Sept 4, 2023

> What about algorithms where register pressure is an issue

Then you'll probably saturate the processor without using a larger LMUL, but I think many algorithms can work with LMUL=2, without running out of registers.

brucehoult · on Sept 5, 2023

LMUL (and especially fractional LMUL) isn't for performance, it's for kernels with mixed-element sizes, to maximise the number of variables (and elements) you can keep in registers without spilling.

Being able to use LMUL as a way to get the effect of unrolling and hide the pointer bumps and loop control on simple loops on narrow processors, without expanding the code, is just a bonus.

RaisingSpear · on Sept 5, 2023

> The only case I can currently fore see where using LMUL=1 and manually unrolling instead will likely be always beneficial is vrgather operations that don't need to cross between registers in a register group (e.g. byte swapping).

This is somewhat of a major problem with some kernels. RVV is quite spartan in the permute options it offers, often forcing you to use vrgather for many things. And as you suggest, vrgather doesn't scale well, so sticking to LMUL=1 seems sensible in a lot of cases. (this will also be a problem for RISC-V implementations that aim for longer vectors)

Honestly LMUL>1 could be more useful if more permute instructions were offered, particularly a restricted shuffle (like VPSHUFB on AVX or TBLQ on SVE2.1).

Stuff like vector constants can be more costly with LMUL>1, since they must consume more than one register.

On big cores, the only benefit LMUL>1 gives (other than design concepts like how widen work, or shuffling across vectors etc) is a code size reduction. Which is quite a dubious benefit for a somewhat complex feature.

Maybe smaller cores can extract more out of it, but it's not something I'm too knowledgeable about.

adgjlsfhk1 · on Sept 5, 2023

One place where LMUL makes a ton of sense is big-little hybrid designs. Your big cores can get wider execution units that can run LMUL=2 instructions natively while your little cores break them up into LMUL=.5 and processes them 4x slower. This lets you avoid the issue Intel 12th gen ran into where the consumer chips lost AVX-512 capability because the little cores only supported 256 wide vectors.

Also code size reduction can be pretty nice. It doesn't show up well in microbenchmarks but lowering load on the front end of your CPU is always nice. It means you get to save a little power or use the area for extra execution units.

RaisingSpear · on Sept 6, 2023

> Your big cores can get wider execution units that can run LMUL=2 instructions natively while your little cores break them up into LMUL=.5 and processes them 4x slower

RVV spec doesn't allow LMUL to vary across cores in a hybrid design, so no, that doesn't work. RVV doesn't do anything to help with hybrid cores with differing widths.

> It means you get to save a little power or use the area for extra execution units.

Power, maybe, but I don't see the rest being likely. For general purpose cores, you're going to need a wide enough front end to support scalar code, which tends to have more execution units than vector.

adgjlsfhk1 · on Sept 6, 2023

I'm not saying that the supported lmul varies. I'm saying that the different cores will break up the same instructions into bigger or smaller execution unit units.

RaisingSpear · on Sept 6, 2023

There's nothing inherent with RVV or LMUL which makes this any easier/harder than other ISAs. Intel could design their little cores to break AVX-512 instructions into 4x128b parts if they felt like it, for example.

adgjlsfhk1 · on Sept 7, 2023

That's technically true, but it means that the best way of handling vector instructions on little cores is something you were going to do anyway.

RaisingSpear · on Sept 7, 2023

Sorry, I don't quite understand you there - what is the "something" you speak of, and who does "you" refer to?

ryukoposting · on Sept 5, 2023

This LMUL thing sounds like a total antipattern. Does the OS have to handle it manually on context switches, or can it just pop the process's LMUL state with all the normal registers? Sounds like a feature that coult get ugly in embedded applications.

dzaima · on Sept 5, 2023

"LMUL state" is three bits (7 possible values) of thread-wide config that just change how operations operate with the same vector registers; similar to say the FP dynamic rounding mode, which also needs to be saved & restored across context switches.

camel-cdr · on Sept 5, 2023

LMUL is encoded in the vtype, which includes LMUL, element width and tail/mask policy, it can be easily read and set.