A basic introduction to NumPy's einsum

teddykoker · on April 10, 2022

Also see Einops: https://github.com/arogozhnikov/einops, which uses a einsum-like notation for various tensor operations used in deep learning.

https://einops.rocks/pytorch-examples.html shows how it can be used to implement various neural network architectures in a more simplified manor.

johnnymellor · on April 10, 2022

Einops looks nice! It reminds me of https://github.com/deepmind/einshape which is another attempt at unifying reshape, squeeze, expand_dims, transpose, tile, flatten, etc under an einsum-inspired DSL.

pizza · on April 10, 2022

Somebody also realized that much of the time you can use one single function to describe all 3 of the einops operations. I present to you, einop: https://github.com/cgarciae/einop

PartiallyTyped · on April 10, 2022

In addition, einsum in jax uses einops.

moelf · on April 10, 2022

https://github.com/mcabbott/Tullio.jl:

  - better syntax (because Julia has proper macro/metaprogramming)
  - faster
  - automatically works with GPU arrays.

dsharlet · on April 10, 2022

If you are looking for something like this in C++, here's my attempt at implementing it: https://github.com/dsharlet/array#einstein-reductions

It doesn't do any automatic optimization of the loops like some of the projects linked in this thread, but, it provides all the tools needed for humans to express the code in a way that a good compiler can turn it into really good code.

andi999 · on April 10, 2022

How is "This is 40-50x faster than a naive C implementation of nested loops on my machine, " that possible/Do you know how this was achieved?

mynameismon · on April 10, 2022

Not OP, but from a cursory glance at the code, it seems to be achieved with the combination of splitting the matrices into chunks to fit them in the L1/L2 caches (line 2 in the code), using tricks like switching indexes to achieve better cache locality, and using SIMD + Fused Multiply Add to further speedup things

andi999 · on April 10, 2022

Thanks! Which file are you refering to?

mynameismon · on April 10, 2022

The code right above the assembly.

cschmid · on April 9, 2022

I'd really like to use einsum more often, because it allows me to code my expressions the same way I derive them on pen and paper. Unfortunately, as mentioned in the article, it's slow, because it converts your formula to a for loop.

So usually, I rewrite my formulas into messy combinations of broadcasts, transposes and array multiplications. Is there a package or an algorithm that does this conversion automatically? It seems to be a pretty straightforward problem, at least for most expressions I use.

catgary · on April 10, 2022

The Tullio library in Julia is a pretty fantastic option for Einstein summation. It’s performance is great, it generates CUDA kernels, and does some clever tricks for automatic differentiation. It’s also a bit more readable than numpy’s einsum function, since you just write:

  @tullio C[i,j] := A[i,k] * B[k,j]

whimsicalism · on April 10, 2022

Are there any benchmarks comparing Tullio generated kernels from einsum with an equivalent NVIDIA kernel?

shaklee3 · on April 10, 2022

it's highly unlikely that the performance will be great on GPU if it generates its own kernels. cutensor has been hand optimized to do exactly these types of operations in a way emitting a kernel automatically can't do.

mcabbott · on April 10, 2022

On the GPU, operations which look like matrix multiplication are indeed quite slow. As you say there are many tricks, and Tullio doesn't (right now) know them. For operations which are more like broadcasting it does much better.

On the CPU, the situation is much better. With some help from LoopVectorization.jl (which optimises micro-kernels) it will often beat OpenBLAS at matrix multiplication. The best-case scenario is an operation which would otherwise be permutedims plus matrix multiplication, for which it will often be several times faster, by fusing these.

The notation above is shared by some other packages. TensorOperations.jl is always decomposes to known kernels (including on the GPU) and OMEinsum.jl usually does so (with a fallback to loops), both more like einsum. TensorCast.jl is more like einops, just notation for writing reshape/permute/slice operations.

adgjlsfhk1 · on April 10, 2022

you might be surprised. I don't know a lot about how it generates gpu kernels, but for CPU it uses LoopVectorization which is able to (among other things) derive optimal gemm kernels automatically. For GPU, I believe tullio used KernelAbstractions.jl to generate the kernels, but I'm less familiar with the details.

rowanG077 · on April 10, 2022

Damn that notation is super awesome. basically self explanatory.

enriquto · on April 9, 2022

Just use C or C++ or fortran or julia or even lua.

The issue of "slow loops" is entirely self-inflicted by some languages. It's getting ridiculous to still worry about this shit in 2022.

Simple loops can and should be just as fast as vectorized programs. When they are slower, it is 100% due to deliberate decisions by the language dessigners

thrownawayalre · on April 9, 2022

But BLAS is much more than 3 loops for matrix-matrix multiplication.

https://stackoverflow.com/questions/1303182/how-does-blas-ge...

Why turn your back on all the carefully crafted optimizations?

enriquto · on April 10, 2022

BLAS is great and must be used whenever possible. That was not my point.

I don't want to turn my back on the beautiful loop notation. For some algorithms, the loop notation is clearer than any "vectorized" version. It is absurd that the language penalizes you for that. Loops are alright.

kzrdude · on April 10, 2022

I agree. At least some faster-cpython work is going on, does anyone know if there are aggregate updates on that, like performance progress reports?

upbeat_general · on April 10, 2022

This is 100% not the case with SIMD.

dsharlet · on April 10, 2022

Compilers can be pretty good if you help them out a bit. Here's my implementation of Einstein reductions (including summations) in C++, which generate pretty close to ideal code until you start getting into processor architecture specific optimizations: https://github.com/dsharlet/array#einstein-reductions

TheRealKing · on April 11, 2022

or Rust

thrownawayalre · on April 9, 2022

It's not true that einsum converts to a for loop, and it is sometimes faster than other numpy built-in functions.

https://stackoverflow.com/questions/18365073/why-is-numpys-e...

Did I misunderstood your comment?

cschmid · on April 10, 2022

If I understand https://github.com/numpy/numpy/blob/v1.22.0/numpy/core/einsu... and https://github.com/numpy/numpy/blob/v1.22.0/numpy/core/src/m... correctly, using einsum without the optimize flag seems to use a for loop in C to do the multiplication.

The optimizer clearly tries to improve the performance, but in many cases, it doesn't seem to change anything. Let's simply multiply some matrices:

  x, y = np.random.rand(200, 200, 200), np.random.rand(200, 200, 200)

I can do

  %timeit x@y
  40.3 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

or a naive

  %timeit np.einsum('bik,bkj->bij',x,y)
  1.53 s ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But even with optimization, I see

  %timeit np.einsum('bik,bkj->bij',x,y, optimize=True)
  1.54 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm not sure if I'm doing something wrong.

dgasmith · on April 10, 2022

The looped matrix multiply that you show is very hard to optimize for in the general case of einsum. Often the looped GEMM is found permuted such as `kbi,kjb->bij`. In this case, heuristics are needed to determine if GEMM is worth it due to unaligned memory copies.

`optimize=True` is generally best when there are more than two tensors in the expression.

cschmid · on April 10, 2022

I just tried replicating the same experiment using Jax's numpy API, and einsum is still slower, but at least the same order of magnitude:

  %timeit (x_jax @ y_jax).block_until_ready()
  579 µs ± 4.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  %timeit jnp.einsum('bik,bkj->bij',x_jax,y_jax, optimize=True).block_until_ready()
  658 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  %timeit jnp.einsum('bik,bkj->bij',x_jax,y_jax).block_until_ready()
  660 µs ± 2.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

savant_penguin · on April 10, 2022

Iirc pytorch's einsum used to be very slow

machinekob · on April 9, 2022

I would say https://github.com/arogozhnikov/einops should be pretty close to your requirments

polygamous_bat · on April 9, 2022

I would like to second that! I converted multiple of my lab mates into Einops just by having them browse through the tutorial in https://einops.rocks/ :)

jpeanuts · on April 9, 2022

Looking at the documentation - einops seems to only implement operations on a single tensor, so it's quite far from a general replacement for einsum, which can perform tensor-products on an arbitrary number of tensors.

braised_babbage · on April 9, 2022

There are many evaluation strategies for tensor contractions; naively translating into nested for loops and then doing direct random-access arithmetic is just one approach. If you like numpy's einsum, a drop in replacement is https://optimized-einsum.readthedocs.io/en/stable/. In general, it's an open question as to how to best "compile" a tensor contraction expression into something executable, for several reasons (there are a lot of possible orderings for the contractions, if one wishes to rely on BLAS routines for efficient primitives one will find that they don't exactly fit the problem of tensor contraction like a glove, and so on).

thrownawayalre · on April 9, 2022

The linked documentation states

    For example, the optimization of np.einsum has been passed upstream and most of the same features that can be found in this repository can be enabled with numpy.einsum(..., optimize=True).

So numpy should provide the same perf for most cases.

madisonmay · on April 10, 2022

For more efficient einsum, see projects like https://optimized-einsum.readthedocs.io/en/stable/path_findi....

imranq · on April 10, 2022

I've found einsum to be amazing at consolidating my code into something more readable, particularly for implementing architectures from scratch.

Here's a good video that explains why its so good: https://www.youtube.com/watch?v=pkVwUVEHmfI

Also check out Lucid Rains Github, who uses it extensively to build transformer architectures from scratch: https://github.com/lucidrains \

* Example: https://github.com/lucidrains/alphafold2/blob/d59cb1ea536bc5...

p1esk · on April 10, 2022

amazing at consolidating my code into something more readable

Readable if you're already familiar with einsum notation. Otherwise there's learning curve. An alternative to einsum is using multiple dot product and reshape ops, hopefully with each one having a comment - this would be a lot more readable imo.

pizza · on April 10, 2022

But then to find some mistake you would have to corroborate n * 2 pieces of information, which is quite a bit more, before getting into the issue of whether or not the sequence of operations actually makes sense on top of all that.

p1esk · on April 10, 2022

So let’s say there’s a bug, and an einsum statement is a suspect. Assume I’m not familiar with einsum notation, but I know dot products and reshapes. How do I debug it?

pizza · on April 10, 2022

Guess you can’t, then. Personally I’m a bigger fan of einops since the variable/index names can be a lot more self-documenting.

hakuseki · on April 10, 2022

I really don't find that einsum is less intuitive than reshape. Maybe they both take about the same time to learn the first time. And that's less time than it would take me to struggle through even a single function using reshape ops.

p1esk · on April 10, 2022

How many DL people are familiar with dot products and reshapes, and how many are familiar with einsum?

hakuseki · on April 11, 2022

I really couldn't say.

It does occur to me, though, that you are probably talking about the elementwise product, not the dot product. The dot product includes a reduce_sum step and outputs a scalar.

thetwentyone · on April 10, 2022

Also see: https://github.com/ahwillia/Einsum.jl

immmmmm · on April 10, 2022

If you're into tensor algebra i can only recommend the beautiful piece of Software Cadabra is:

https://cadabra.science/

We wrote an article with it once, 40th order in the Lagrangian, perhaps 50k pages of calculations when all printed. Amazing tool! Thanks Kasper!

physicsguy · on April 10, 2022

Oh god why did I not know about this when I did my theoretical physics PhD

mjburgess · on April 10, 2022

I have to imagine python has revolutionised physics education at university.

I taught myself C before arriving (and, eg., esp. via stanford's programming paradigms course) -- if I hadnt, I would be resigned with my classmates to C code projected on OHP slides.

If I had my way today, the whole of any applied science would be taught with both mathematical and programming (python for convenience) notation.

physicsguy · on April 11, 2022

I used all of C + Python + NumPy a lot, just didn't know about this feature. Would have helped prototyping various bits and pieces at the Python level.

aquafox · on April 10, 2022

There's also an implementation in R: https://github.com/const-ae/einsum