More

dsharlet · 2025-09-05T18:14:00 1757096040

The problem I've seen is this: in order to get good performance, no matter what language you use, you need to understand the hardware and how to use the instructions you want to use. It's not enough to know that you want to use tensor cores or whatever, you also need to understand the myriad low level requirements they have.

Most people that know this kind of thing don't get much value out of using a high level language to do it, and it's a huge risk because if the language fails to generate something that you want, you're stuck until a compiler team fixes and ships a patch which could take weeks or months. Even extremely fast bug fixes are still extremely slow on the timescales people want to work on.

I've spent a lot of my career trying to make high level languages for performance work well, and I've basically decided that the sweet spot for me is C++ templates: I can get the compiler to generate a lot of good code concisely, and when it fails the escape hatch of just writing some architecture specific intrinsics is right there whenever it is needed.

adgjlsfhk1 · 2025-09-05T18:42:07 1757097727

The counterpoint to this is that having a language that has a graceful slide between python like flexibility and hand optimized assembly is really useful. The thing I like most about Julia is it is very easy to both write fast somewhat sloppy code (e.g. for exploring new algorithms), but then you can go through and tune it easily for maximal performance and get as fast as anything out there.

wolvesechoes · 2025-09-05T20:19:56 1757103596

> easily for maximal performance and get as fast as anything out there.

Optimizing Julia is much harder than optimizing Fortran or C.

postflopclarity · 2025-09-06T14:20:07 1757168407

for equal LOC, sure. for equal semantics, less true

dsharlet · on April 28, 2024

I wrote a library in C++ (I know, probably a non-starter for most reading this) that I think does most of what you want, as well as some other requests in this thread (generalized to more than just multiply-add): https://github.com/dsharlet/array?tab=readme-ov-file#einstei....

A matrix multiply written with this looks like this:

    enum { i = 2, j = 0, k = 1 };
    auto C = make_ein_sum<float, i, j>(ein<i, k>(A) * ein<k, j>(B));

Where A and B are 2D arrays. This is strongly typed all the way through, so you get a lot of feedback at compile time, and C is 2D array object at compile time. It is possible to make C++ template errors reasonable with enable_if and the like, this works well-ish on clang, but not so well in GCC (YMMV).

This library is a lot less automated than most other einsum implementations. You have to explicitly control the loop ordering (in the example above, the `j` loop is innermost because it is loop 0). If you build a good loop order for your shapes, the compiler will probably autovectorize your inner loop, and you'll get pretty good performance. Control over the loop ordering is in general a useful tool, but it's probably a lot lower level than most users want.

dsharlet · on Jan 3, 2024

This is really overstating how hard it is to compete with matrix multiply libraries. The main reason those libraries are so big and have had so much work invested in them is their generality: they're reasonably fast for almost any kind of inputs.

If you have a specific problem with constraints you can exploit (e.g. known fixed dimensions, sparsity patterns, data layouts, type conversions, etc.), it's not hard at all to beat MKL, etc... if you are using a language like C++. If you are using python, you have no chance.

It isn't even necessarily that different from a few nested loops. Clang is pretty damn good at autovectorizing, you just have to be a little careful about how you write the code.

geysersam · on Jan 3, 2024

If you need a really fast compiled inner loop then you can often implement it in Python using Numba. Using that you can easily implement something like sparse matrix multiplication in Python.

p-e-w · on Jan 3, 2024

> If you are using python, you have no chance.

Of course you do. Every special-case multiplication algorithm you might need already has an optimized implementation that you can just `pip install`, and move on with what you're actually working on.

The whole scientific computing world runs on Python. Straightforward numerics code using NumPy tends to murder C/C++ code in regard to performance, unless that code is written by people who make a living hand-optimizing computational routines.

SideQuark · on Jan 3, 2024

> The whole scientific computing world runs on Python

If you ignore the majority of scientific code running on supercomputers doing most of science in C++ and Fortran.

Even in areas where python is used, the majority of the compute runs on C/C++/Fortran, with a little python as glue.

If you think numpy (written in c/c++) murders c/c++ code, you should learn about HPC, where really high performance happens. They don't use numpy.

bjourne · on Jan 3, 2024

> This is really overstating how hard it is to compete with matrix multiply libraries.

I'll file this under "talk is cheap". :) I tried it last year and got within 50% of BLAS. Getting above that is tons of work. Which you have to repeat for every processor model, NUMA, and every combination of matrix type (long thin, short wide, etc).

dsharlet · on Jan 3, 2024

This gets to 90% of BLAS: https://github.com/dsharlet/array/blob/38f8ce332fc4e26af0832...

The less involved versions still get ~70%.

But this is also quite general. I’m claiming you can beat BLAS if you have some unique knowledge of the problem that you can exploit. For example, some kinds of sparsity can be implemented within the above example code yet still far outperform the more general sparsity supported by MKL and similar.

bjourne · on Jan 4, 2024

I don't believe you. OpenBLAS is multithreaded but the code you posted is single-threaded. The inner kernel isn't very well optimized either. So, no way.

dsharlet · on Jan 4, 2024

I should have mentioned somewhere, I disabled threading for OpenBLAS, so it is comparing one thread to one thread. Parallelism would be easy to add, but I tend to want the thread parallelism outside code like this anyways.

As for the inner loop not being well optimized... the disassembly looks like the same basic thing as OpenBLAS. There's disassembly in the comments of that file to show what code it generates, I'd love to know what you think is lacking! The only difference between the one I linked and this is prefetching and outer loop ordering: https://github.com/dsharlet/array/blob/master/examples/linea...

bjourne · on Jan 4, 2024

So I actually tested your code: https://gist.github.com/bjourne/c2d0db48b2e50aaadf884e4450c6...

On my machine single-threaded OpenBLAS (called via NumPy) multiplies two single precision 4096x4096 matrices in 0.95 seconds. Your code takes over 30 seconds when compiled with clang++. And yes I used -O3, -march=native, and all that jazz. Btw, your code crashes g++ which doesn't necessarily mean that it is incorret, but it indicates that the code may be difficult for the compiler to optimize. For comparison, my own matrix multiplication code (https://github.com/bjourne/c-examples/blob/master/libraries/...) run in single-threaded mode takes 0.89 seconds. Which actually beats OpenBLAS, but OpenBLAS retakes the lead for larger arrays when multi-threading is added. You can look at my code for how to write a decent inner kernel. Writing it in pure C without intrinsics and hoping that the compiler will optimize it definitely will not work.

It also is not true that "Parallelism would be easy to add". Unless your algorithm is designed from the start to exploit multi-threading, attempting to bolt it on later will not yield good performance.

dsharlet · on Jan 4, 2024

The makefile asks for -O2 with clang. I find that -O3 almost never helps in clang. (In gcc it does.)

Here's what I see:

   $ clang++ --version
   clang version 18.0.0

   $ time make bin/matrix
   mkdir -p bin
   clang++ -I../../include -I../ -o bin/matrix matrix.cpp  -O2 -march=native -ffast-math -fstrict-aliasing -fno-exceptions -DNDEBUG -DBLAS  -std=c++14 -Wall -lstdc++ -lm -lblas
   1.25user 0.29system 0:02.74elapsed 56%CPU (0avgtext+0avgdata 126996maxresident)k
   159608inputs+120outputs (961major+25661minor)pagefaults 0swaps

   $ bin/matrix
   ...
   reduce_tiles_z_order time: 3.86099 ms, 117.323 GFLOP/s
   blas time: 0.533486 ms, 849.103 GFLOP/s

   $ OMP_NUM_THREADS=1 bin/matrix
   ...
   reduce_tiles_z_order time: 3.89488 ms, 116.303 GFLOP/s
   blas time: 3.49714 ms, 129.53 GFLOP/s

My inner loop in perf: https://gist.github.com/dsharlet/5f51a632d92869d144fc3d6ed6b... BLAS inner loop in perf (a chunk of it, it is unrolled massively): https://gist.github.com/dsharlet/5b2184a285a798e0f0c6274dc42...

Despite being on a current-ish version of clang, I've been getting similar results from clang for years now.

Anyways, I'm not going to debate any further. It works for me :) If you want to keep writing code the way you have, go for it.

bjourne · on Jan 4, 2024

-O2 did improve performance significantly, but it's still 0.7 s for NumPy and 5.1 seconds for your code on 4096x4096 matrices. Either you're using a slow version of BLAS or you are benchmarking with matrices that are comparatively tiny (384x1536 is nothing).

dsharlet · on Jan 4, 2024

BLAS is getting almost exactly 100% of the theoretical peak performance of my machine (CPU frequncy * 2 fmadd/cycle * 8 lanes * 2 ops/lane), it's not slow. I mean, just look at the profiler output...

You're probably now comparing parallel code to single threaded code.

bjourne · on Jan 4, 2024

No, multi-threaded OpenBLAS improves performance to 0.15s.

dsharlet · on Jan 4, 2024

I dunno man. My claim was that for specific cases with unique properties, it's not hard to beat BLAS, without getting too exotic with your code. BLAS doesn't have routines for multiplies with non-contiguous data, various patterns of sparsity, mixed precision inputs/outputs, etc. The example I gave is for a specific case close-ish to the case I cared about.

You're changing it to a very different case, presumably one that you cared about, although 4096x4096 is oddly square and a very clean power of 2... I said right at the beginning of this long digression that what is hard about reproducing BLAS is its generality.

bjourne · on Jan 5, 2024

When I run your benchmark with matrices larger than 1024x1024 it errors out in the verification step. Since your implementation isn't even correct I think my original point about OpenBLAS being extremely difficult to replicate still stands.

lifthrasiir · on Jan 3, 2024

You need to have enough experience to be able to be a little careful though. This is generally true for most languages, loop unrolling works even better in Python for example, but many Python programmers aren't even aware of this possibility.

dsharlet · on Oct 19, 2023

I tried it out and compared it to C++ at the last release. Here was what I found: https://github.com/dsharlet/mojo_comments

Some of the issues I pointed out there are pretty low hanging fruit for LLVM, so they may have improved a lot in more current releases.

snowAbstraction · on Oct 19, 2023

I skimmed your post and I wonder if mojo is focusing on such small 512x512 matrices? What is your thinking on generalizing your results for larger matrices?

dsharlet · on Oct 19, 2023

I think for a compiler it makes sense to focus on small matrix multiplies, which are a building block of larger matrix multiplies anyways. Small matrix multiplies emphasize the compiler/code generation quality. Even vanilla python overhead might be insignificant when gluing small-ish matrix multiplies together to do a big multiply.

dsharlet · on Oct 5, 2023

I don't think it's true at all, you can just write C/C++ with SIMD intrinsics, just like you can on ARM or x86, and the instruction set is mostly awesome. OpenCL would just be an extra layer to get in the way. If you want to just run some existing OpenCL code, I guess that would be nice, but I doubt any OpenCL code written for a GPU would actually run well on Hexagon anyways.

The article also complains about VLIW in the same paragraph, but I don't think VLIW makes things harder, it just makes problems more obvious. If you write ARM or x86 code that has dependencies between every instruction, that's going to suck too, you just won't know it until you run it, but VLIW will make it obvious if you just look at the generated code. For the kinds of programs that make sense to run on a processor like Hexagon, VLIW is fine.

The whole Hexagon environment is just so much better than any of the other similar DSPs I'm aware of: you can use open source LLVM to compile code for it (so you aren't stuck with an old version of GCC), and the OS is much closer to standard (e.g. thread synchronization is just pthreads).

I did a bunch of work on Hexagon and I like it a lot. It is my favorite in its class.

dsharlet · on Sept 4, 2023

> The only case I can currently fore see where using LMUL=1 and manually unrolling instead will likely be always beneficial is vrgather operations that don't need to cross between registers in a register group (e.g. byte swapping).

What about algorithms where register pressure is an issue?

I think the problem with LMUL is it assumes that you always want to unroll the innermost dimension (where the vector loads are stride 1). That's usually, the last dimension I try to unroll, if there are any registers left over. If there is any sharing of data across any other dimension in the algorithm, it's better to tile/unroll those first.

Of course, for a simple algorithm, there will be registers left over. But I think more interesting algorithms will struggle on RVV if you must use LMUL > 1 for performance.

adgjlsfhk1 · on Sept 5, 2023

My favorite example of big LMUL is matmul. You can do an entire gemm microkernel in like 8 instructions with LMUL=4 by using an 7x4 kernel. You have 32 that turn into 8 registers with LMUL=4, 7 of which end up storing your C values, 1 stores your A values and you put the B values in scalar registers. Thus your entire kernel ends up being 1 4 wide vector load load and 7 4x wide FMA instructions.

camel-cdr · on Sept 4, 2023

> What about algorithms where register pressure is an issue

Then you'll probably saturate the processor without using a larger LMUL, but I think many algorithms can work with LMUL=2, without running out of registers.

brucehoult · on Sept 5, 2023

LMUL (and especially fractional LMUL) isn't for performance, it's for kernels with mixed-element sizes, to maximise the number of variables (and elements) you can keep in registers without spilling.

Being able to use LMUL as a way to get the effect of unrolling and hide the pointer bumps and loop control on simple loops on narrow processors, without expanding the code, is just a bonus.

dsharlet · on April 10, 2022

If you are looking for something like this in C++, here's my attempt at implementing it: https://github.com/dsharlet/array#einstein-reductions

It doesn't do any automatic optimization of the loops like some of the projects linked in this thread, but, it provides all the tools needed for humans to express the code in a way that a good compiler can turn it into really good code.

andi999 · on April 10, 2022

How is "This is 40-50x faster than a naive C implementation of nested loops on my machine, " that possible/Do you know how this was achieved?

mynameismon · on April 10, 2022

Not OP, but from a cursory glance at the code, it seems to be achieved with the combination of splitting the matrices into chunks to fit them in the L1/L2 caches (line 2 in the code), using tricks like switching indexes to achieve better cache locality, and using SIMD + Fused Multiply Add to further speedup things

andi999 · on April 10, 2022

Thanks! Which file are you refering to?

mynameismon · on April 10, 2022

The code right above the assembly.

dsharlet · on April 10, 2022

Compilers can be pretty good if you help them out a bit. Here's my implementation of Einstein reductions (including summations) in C++, which generate pretty close to ideal code until you start getting into processor architecture specific optimizations: https://github.com/dsharlet/array#einstein-reductions

dsharlet · on May 12, 2020

Funny you mention SPICE to VST compilation... It was on my list for this (my) side project but I never got around to it: http://livespice.org/

edit: And a Tubescreamer is one of the examples!

dsharlet · on Feb 10, 2020

> which introduces some frequency/phase distortion

I agree with everything you said here, except this. Can you explain this? Why does oversampling introduce distortion?

unlinked_dll · on Feb 10, 2020

Because upsampling requires interpolation. I'm using distortion in the abstract/academic sense here.

Any signal through a non ideal (or nontrivial, eg a gain) system will introduce distortion, harmonic being new energy at different frequencies, frequency distortion in the relative magnitude at the output that is different than the input, and the same for phase.

If you use a polynomial interpolator there will be harmonic distortion. If you use an interpolation filter there will be frequency distortion, which is highly correlated to the phase distortion.

What that distortion is and if it is perceptible depends on your design and constraints. But the key element that is at play here is that oversampling for nonlinear system modeling, which introduces harmonic distortion after the fact, is that it can amplify frequency (and therefore phase) distortion and your constraints are tighter than they normally are.

If you want an example take a the ideal interpolator, which is a sinc function. Because a true sync is non causal you cannot implement it, therefore lossless interpolation is not achievable. There will always be some loss of information, the design problem is trading off that loss with the resources required. We can do pretty damn good upsampling today however.

analog31 · on Feb 10, 2020

One thing that might work in our favor here, is that for this particular application -- simulating a guitar amplifier -- we are only dealing with the subset of behaviors that are physically realizable in an electronic circuit.