"Rewrite the hot path in C/C++" is also a landmine because how inefficient the b...

IshKebab · 2025-08-06T17:54:09 1754502849

And it's not just inefficiency. Even with fancy FFI generators like PyO3 or SWIG, adding FFI adds a ton of work, complexity, makes debugging harder, distribution harder, etc.

In my opinion in most cases where you might want to write a project in two languages with FFI, it's usually better not to and just use one language even if that language isn't optimal. In this case, just write the whole thing in C++ (or Rust).

There are some exceptions but generally FFI is a huge cost and Python doesn't bring enough to the table to justify its use if you are already using C++.

pavon · 2025-08-06T17:30:01 1754501401

One use of Python as a "glue language" I've seen that actually avoids the performance problems of those bindings is GNU Radio. That is because its architecture basically uses python as a config language that sets up the computation flow-graph at startup, and then the rest of runtime is entirely in compiled code (generally C++). Obviously that approach isn't applicable to all problems, but it really shaped my opinion of when/how a slow glue language is acceptable.

slt2021 · 2025-08-06T19:37:22 1754509042

This. Use python only for control flow, and offload data flow to a library that is better suited for this: written in C, uses packed structs, cache friendly, etc.

if you want multiprocessing, use the multiprocessing library, scatter and gather type computation, etc

didip · 2025-08-06T16:58:21 1754499501

These days it's "rewrite in Rust".

Typically Python is just the entry and exit point (with a little bit of massaging), right?

And then the overwhelming majority of the business logic is done in Rust/C++/Fortran, no?

01HNNWZ0MV43FF · 2025-08-06T17:59:13 1754503153

With computer vision you end up wanting to read and write to huge buffers that aren't practical to serialize and are difficult to share. And even allocating and freeing multi-megabyte framebuffers at 60 FPS can put a little strain on the allocator, so you want to reuse them, which means you have to think about memory safety.

That is probably why his demo was Sobel edge detection with Numpy. Sobel can run fast enough at standard resolution on a CPU, but once that huge buffer needs to be read or written outside of your fast language, things will get tricky.

This also comes up in Tauri, since you have to bridge between Rust and JS. I'm not sure if Electron apps have the same problem or not.

jononor · 2025-08-06T21:36:09 1754516169

The "numpy" Sobel code is not that good, unfortunately - all the iteration is done in Python, so there is not much benefit from involving numpy. If one would use say scipy.convolve2d on a numpy.array, it would be much faster.

aeroevan · 2025-08-06T18:45:57 1754505957

In the data science/engineering world apache arrow is the bridge between languages, so you don't actually need to serialize into language specific structures which is really nice

aragilar · 2025-08-06T12:59:13 1754485153

Isn't this just a specific example of the general rule of pulling out repeated use of the same operation in a loop? I'm not sure calls out to C are specifically slow in CPython (given many operations are really just calling C underneath).

Twirrim · 2025-08-06T14:19:20 1754489960

The serialisation cost of translating data representations between python and C (or whatever compiled language you're using) is notable. Instead of having the compiled code sit in the centre of a hot loop, it's significantly better to have the loop in the compiled code and call it once

https://pythonspeed.com/articles/python-extension-performanc...

kragen · 2025-08-06T17:03:59 1754499839

You don't have to serialize data or translate data representations between CPython and C. That article is wrong. What's slow in their example is storing data (such as integers) the way CPython likes to store it, not translating that form to a form easily manipulated in C, such as a native integer in a register. That's just a single MOV instruction, once you get past all the type checking and reference counting.

You can avoid that problem to some extent by implementing your own data container as part of your C extension (the article's solution #1); frobbing that from a Python loop can still be significantly faster than allocating and deallocating boxed integers all the time, with dynamic dispatch and reference counting. But, yes, to really get reasonable performance you want to not be running bytecodes in the Python interpreter loop at all (the article's solution #2).

But that's not because of serialization or other kinds of data format translation.

morkalork · 2025-08-06T14:31:56 1754490716

The overhead of copying and moving data around in Python is frustrating. When you are CPU bound on a task, you can't use threads (which do have shared memory) because of the GIL, so you end up using whole processes and then waste a bunch of cycles communicating stuff back and forth. And yes, you can create shared memory buffers between Python processes but that is nowhere near as smooth as say two Java threads working off a shared data structure that's got synchronized sprinkled on it.

tcfhgj · 2025-08-08T06:05:21 1754633121

GIL is a thing of the past, no?

KeplerBoy · 2025-08-06T13:00:54 1754485254

The key is to move the entire loop to a compiled language instead of just the inner operation.

dgan · 2025-08-06T13:26:09 1754486769

they are specifically slow. there was a project which measured FFI cost in different languages, and python is awfully bad

ActorNightly · 2025-08-06T15:51:37 1754495497

>how inefficient the boundary crossing is

For 99.99% of the programs that people write, the modern M.2 NVME hard drives are plenty fast, and thats the laziest way to load data into a C extension or process.

Then there is unix pipes which are sufficiently fast.

Then there is shared memory, which basically involves no loading.

As with Python, all depends on the setup.

zahlman · 2025-08-06T16:49:43 1754498983

The problem isn't loading the data, but marshalling it (i.e, transforming it into a data structure that makes sense for the faster language to operate on, and back again). Or if you don't transform (or the data is special-cased enough that no transformation makes sense) then the available optimizations become much more limited.

jononor · 2025-08-06T20:57:45 1754513865

There are several datastructures for numeric data that do not need marshalling, and are suitable for very efficient interoperetion between Python and C/C++/Rust etc. Examples include array.array (in standard library), numpy.array, and PyArrow.

ActorNightly · 2025-08-06T17:13:44 1754500424

Thats all just design. Nothing having to do with particular language.