Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A basic introduction to NumPy's einsum (ajcr.net)
141 points by s1291 on April 9, 2022 | hide | past | favorite | 49 comments


Also see Einops: https://github.com/arogozhnikov/einops, which uses a einsum-like notation for various tensor operations used in deep learning.

https://einops.rocks/pytorch-examples.html shows how it can be used to implement various neural network architectures in a more simplified manor.


Einops looks nice! It reminds me of https://github.com/deepmind/einshape which is another attempt at unifying reshape, squeeze, expand_dims, transpose, tile, flatten, etc under an einsum-inspired DSL.


Somebody also realized that much of the time you can use one single function to describe all 3 of the einops operations. I present to you, einop: https://github.com/cgarciae/einop


In addition, einsum in jax uses einops.


https://github.com/mcabbott/Tullio.jl:

  - better syntax (because Julia has proper macro/metaprogramming)
  - faster
  - automatically works with GPU arrays.


If you are looking for something like this in C++, here's my attempt at implementing it: https://github.com/dsharlet/array#einstein-reductions

It doesn't do any automatic optimization of the loops like some of the projects linked in this thread, but, it provides all the tools needed for humans to express the code in a way that a good compiler can turn it into really good code.


How is "This is 40-50x faster than a naive C implementation of nested loops on my machine, " that possible/Do you know how this was achieved?


Not OP, but from a cursory glance at the code, it seems to be achieved with the combination of splitting the matrices into chunks to fit them in the L1/L2 caches (line 2 in the code), using tricks like switching indexes to achieve better cache locality, and using SIMD + Fused Multiply Add to further speedup things


Thanks! Which file are you refering to?


The code right above the assembly.


I'd really like to use einsum more often, because it allows me to code my expressions the same way I derive them on pen and paper. Unfortunately, as mentioned in the article, it's slow, because it converts your formula to a for loop.

So usually, I rewrite my formulas into messy combinations of broadcasts, transposes and array multiplications. Is there a package or an algorithm that does this conversion automatically? It seems to be a pretty straightforward problem, at least for most expressions I use.


The Tullio library in Julia is a pretty fantastic option for Einstein summation. It’s performance is great, it generates CUDA kernels, and does some clever tricks for automatic differentiation. It’s also a bit more readable than numpy’s einsum function, since you just write:

  @tullio C[i,j] := A[i,k] * B[k,j]


Are there any benchmarks comparing Tullio generated kernels from einsum with an equivalent NVIDIA kernel?


it's highly unlikely that the performance will be great on GPU if it generates its own kernels. cutensor has been hand optimized to do exactly these types of operations in a way emitting a kernel automatically can't do.


On the GPU, operations which look like matrix multiplication are indeed quite slow. As you say there are many tricks, and Tullio doesn't (right now) know them. For operations which are more like broadcasting it does much better.

On the CPU, the situation is much better. With some help from LoopVectorization.jl (which optimises micro-kernels) it will often beat OpenBLAS at matrix multiplication. The best-case scenario is an operation which would otherwise be permutedims plus matrix multiplication, for which it will often be several times faster, by fusing these.

The notation above is shared by some other packages. TensorOperations.jl is always decomposes to known kernels (including on the GPU) and OMEinsum.jl usually does so (with a fallback to loops), both more like einsum. TensorCast.jl is more like einops, just notation for writing reshape/permute/slice operations.


you might be surprised. I don't know a lot about how it generates gpu kernels, but for CPU it uses LoopVectorization which is able to (among other things) derive optimal gemm kernels automatically. For GPU, I believe tullio used KernelAbstractions.jl to generate the kernels, but I'm less familiar with the details.


Damn that notation is super awesome. basically self explanatory.


Just use C or C++ or fortran or julia or even lua.

The issue of "slow loops" is entirely self-inflicted by some languages. It's getting ridiculous to still worry about this shit in 2022.

Simple loops can and should be just as fast as vectorized programs. When they are slower, it is 100% due to deliberate decisions by the language dessigners


But BLAS is much more than 3 loops for matrix-matrix multiplication.

https://stackoverflow.com/questions/1303182/how-does-blas-ge...

Why turn your back on all the carefully crafted optimizations?


BLAS is great and must be used whenever possible. That was not my point.

I don't want to turn my back on the beautiful loop notation. For some algorithms, the loop notation is clearer than any "vectorized" version. It is absurd that the language penalizes you for that. Loops are alright.


I agree. At least some faster-cpython work is going on, does anyone know if there are aggregate updates on that, like performance progress reports?


This is 100% not the case with SIMD.


Compilers can be pretty good if you help them out a bit. Here's my implementation of Einstein reductions (including summations) in C++, which generate pretty close to ideal code until you start getting into processor architecture specific optimizations: https://github.com/dsharlet/array#einstein-reductions


or Rust


It's not true that einsum converts to a for loop, and it is sometimes faster than other numpy built-in functions.

https://stackoverflow.com/questions/18365073/why-is-numpys-e...

Did I misunderstood your comment?


If I understand https://github.com/numpy/numpy/blob/v1.22.0/numpy/core/einsu... and https://github.com/numpy/numpy/blob/v1.22.0/numpy/core/src/m... correctly, using einsum without the optimize flag seems to use a for loop in C to do the multiplication.

The optimizer clearly tries to improve the performance, but in many cases, it doesn't seem to change anything. Let's simply multiply some matrices:

  x, y = np.random.rand(200, 200, 200), np.random.rand(200, 200, 200)
I can do

  %timeit x@y
  40.3 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
or a naive

  %timeit np.einsum('bik,bkj->bij',x,y)
  1.53 s ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But even with optimization, I see

  %timeit np.einsum('bik,bkj->bij',x,y, optimize=True)
  1.54 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm not sure if I'm doing something wrong.


The looped matrix multiply that you show is very hard to optimize for in the general case of einsum. Often the looped GEMM is found permuted such as `kbi,kjb->bij`. In this case, heuristics are needed to determine if GEMM is worth it due to unaligned memory copies.

`optimize=True` is generally best when there are more than two tensors in the expression.


I just tried replicating the same experiment using Jax's numpy API, and einsum is still slower, but at least the same order of magnitude:

  %timeit (x_jax @ y_jax).block_until_ready()
  579 µs ± 4.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  %timeit jnp.einsum('bik,bkj->bij',x_jax,y_jax, optimize=True).block_until_ready()
  658 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  %timeit jnp.einsum('bik,bkj->bij',x_jax,y_jax).block_until_ready()
  660 µs ± 2.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Iirc pytorch's einsum used to be very slow


I would say https://github.com/arogozhnikov/einops should be pretty close to your requirments


I would like to second that! I converted multiple of my lab mates into Einops just by having them browse through the tutorial in https://einops.rocks/ :)


Looking at the documentation - einops seems to only implement operations on a single tensor, so it's quite far from a general replacement for einsum, which can perform tensor-products on an arbitrary number of tensors.


There are many evaluation strategies for tensor contractions; naively translating into nested for loops and then doing direct random-access arithmetic is just one approach. If you like numpy's einsum, a drop in replacement is https://optimized-einsum.readthedocs.io/en/stable/. In general, it's an open question as to how to best "compile" a tensor contraction expression into something executable, for several reasons (there are a lot of possible orderings for the contractions, if one wishes to rely on BLAS routines for efficient primitives one will find that they don't exactly fit the problem of tensor contraction like a glove, and so on).


The linked documentation states

    For example, the optimization of np.einsum has been passed upstream and most of the same features that can be found in this repository can be enabled with numpy.einsum(..., optimize=True). 
So numpy should provide the same perf for most cases.


For more efficient einsum, see projects like https://optimized-einsum.readthedocs.io/en/stable/path_findi....


I've found einsum to be amazing at consolidating my code into something more readable, particularly for implementing architectures from scratch.

Here's a good video that explains why its so good: https://www.youtube.com/watch?v=pkVwUVEHmfI

Also check out Lucid Rains Github, who uses it extensively to build transformer architectures from scratch: https://github.com/lucidrains \

* Example: https://github.com/lucidrains/alphafold2/blob/d59cb1ea536bc5...


amazing at consolidating my code into something more readable

Readable if you're already familiar with einsum notation. Otherwise there's learning curve. An alternative to einsum is using multiple dot product and reshape ops, hopefully with each one having a comment - this would be a lot more readable imo.


But then to find some mistake you would have to corroborate n * 2 pieces of information, which is quite a bit more, before getting into the issue of whether or not the sequence of operations actually makes sense on top of all that.


So let’s say there’s a bug, and an einsum statement is a suspect. Assume I’m not familiar with einsum notation, but I know dot products and reshapes. How do I debug it?


Guess you can’t, then. Personally I’m a bigger fan of einops since the variable/index names can be a lot more self-documenting.


I really don't find that einsum is less intuitive than reshape. Maybe they both take about the same time to learn the first time. And that's less time than it would take me to struggle through even a single function using reshape ops.


How many DL people are familiar with dot products and reshapes, and how many are familiar with einsum?


I really couldn't say.

It does occur to me, though, that you are probably talking about the elementwise product, not the dot product. The dot product includes a reduce_sum step and outputs a scalar.



If you're into tensor algebra i can only recommend the beautiful piece of Software Cadabra is:

https://cadabra.science/

We wrote an article with it once, 40th order in the Lagrangian, perhaps 50k pages of calculations when all printed. Amazing tool! Thanks Kasper!


Oh god why did I not know about this when I did my theoretical physics PhD


I have to imagine python has revolutionised physics education at university.

I taught myself C before arriving (and, eg., esp. via stanford's programming paradigms course) -- if I hadnt, I would be resigned with my classmates to C code projected on OHP slides.

If I had my way today, the whole of any applied science would be taught with both mathematical and programming (python for convenience) notation.


I used all of C + Python + NumPy a lot, just didn't know about this feature. Would have helped prototyping various bits and pieces at the Python level.


There's also an implementation in R: https://github.com/const-ae/einsum




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: