Btw, with the next release iceoryx2 will have Python bindings. They are already on main and we will make it available via PIP. This should make it easier to use with Pytorch.
I'm not sure I fully understand what you mean? Do you assume we implemented the same approach for shared memory communication like described in the blog post?
If that’s the case, I want to reassure you that we don’t use locks. Quite the contrary, we use lock-free[1] algorithm to implement the queues. We cannot use locks for the reason you mentioned and also for cases when an application dies while holding the lock. This would result in a deadlock and cannot be used in a safety critical environment. Btw, there are already cars out there which are using a predecessor of iceoryx to distribute camera data in an ECU.
For hard realtime systems we have a wait-free queue. This gives even more guarantees. Lock-free algorithms often have a CAS loop (compare and swap), which in theory can lead to starvation but it's practically unlikely as long as your system does not run at 100% CPU utilization all the time. As a young company, we cannot open source everything immediately, so the wait-free queue will be part of a commercial support package, together with more sophisticated tooling, like teased in the blog post.
Regarding memory guarantees. There are essentially the same guarantees like what you have when sharing an Arc<T> via a Rust channel. After publishing the producer releases the ownership to the subscriber and they have read-only access for as long as they hold the sample. When the sample is dropped by all subscriber, it will be released back to the shared memory allocator.
Btw, we also have an event signalling mechanism to not poll the queue but wait until the producer signals that new data is available. But this requires a context switch and it is up to the user to decide if it is desired to have.
> In my experience shared memory is really hard to implement well and manage:
I second that. It took us quite some time to get the correct architecture. After all, iceoryx2 is the third incarnation of this piece of software, with elfepiff an me working on the last two.
> 1. Unless you're using either fixed sized or specially allocated structures, you end up paying for serialization anyhow (zero copy is actually one copy).
Indeed, we are using fixed size structures with a bucket allocator. We have ideas on how to enable the usage on types which support custom allocators and even with raw pointers but that is just a crazy idea which might not pan out to work.
> 2. There's no way to reference count the shared memory - if a reader crashes, it holds on to the memory it was reading. You can get around this with some form of watchdog process, or by other schemes with a side channel, but it's not "easy".
>
> 3. Similar to 2, if a writer crashes, it will leave behind junk in whatever filesystem you are using to hold the shared memory.
Indeed, this is a complicated topic and support from the OS would be appreciated. We found a few ways on how to make this feasible, though.
The origins of iceoryx are in automotive and there it is required to split functionality up into multiple processes. When one process goes down, the system can still operate in a degraded mode or just restart the faulty process. With this, one needs an efficient and low-latency solution else the CPU is spending more time on copying data than on doing real work.
Of course there are issues like the producer mutating data after delivery, but here are also solutions for this. It will of course affect the latency but should still be better than using e.g. unix domain sockets.
Fun fact. For iceoryx1 we supported only 4GB memory chunks and some time ago someone came and asked if we could lift this limitation since he wanted to transfer a 92GB large language model via shared memory.
Thanks for the tips. We have a comparison with message queues and unix domain sockets [1] on the repo on github [2].
~~It's nice to see that independent benchmarks are in the same ballpark than the one we perform.~~
Edit: sorry, I confused your link with another one which also has ping-pong in its title
We provide data types which are shared memory compatible, which means one does not have to serialize/deserialize. For image or lidar data, one also does not have to serialize and this is where copying large data really hurts. But you are right, if your data structures are not shared memory compatible, one has to serialize the data first and this has its cost, depending on what serialization format one uses. iceoryx is agnostic to this though and one can select what's the best for a given use case.
Besides being written in Rust, the big difference is the decentralized approach. With iceoryx1 a central daemon is required but with iceoryx2 this in not the case anymore. Furthermore, more fine grained control over the resources like memory and endpoints like publisher. Overall the architecture is more modular and it should be easier to port iceoryx2 to even more platforms and customize it with 3rd party extension.
With this release we have initial support for C and C++. Not all features of the Rust version are supported yet, but the plan is to finish the bindings with the next release. Furthermore, with an upcoming release we will make it trivial to communicate between Rust, C and C++ applications and all the other language bindings we are going to provide, with Python being probably the next one.
I've been looking around for some kind of design documents that explain how you were able to ditch the central broker, but I haven't found much. Do you have breadcrumbs?
This is a longer story, but I'll try to provide the essence.
* All IPC resources are represented in the file system and have a global naming scheme. So if you would like to perform a service discovery, you take a look at the `/tmp/iceoryx2/services` list all service toml files that you are allowed to access and handle them.
* Connecting to a service means, under the hood, opening a specific shared memory identified via a naming scheme, adding yourself to the participant list, and receiving/sending data.
* Crashing/resource cleanup is done decentrally by every process that has the permissions to perform them.
* In a central/broker architecture you would have the central broker that checks this in a loop.
* In a decentralized architecture, we defined certain sync points where this is checked. These points are placed so that you check the misbehavior before it would affect you. For instance, when a sender shall send you a message every second but you do not receive it, you would actively check if it is still alive. Other sync points are, when an iceoryx2 node is created or you connect or disconnect to a service.
The main point is that the API is decentralized but you can always use it in a central daemon if you like - but you don't have to. It is optional.
Same here. Shared memory is one of those things where the kernel could really help some more with reliable cleanup (1). Until then you're mostly doomed to have a rock solid cleanup daemon or are limited to eventual cleanup by restarting processes. I have my doubts that it isn't possible to get into a situation where segments are being exhausted and you're forced to intervene
(1) I'm referring to automatic refcounting of shm segments using posix shm (not sys v!) when the last process dies or unmaps
Now I'm curious. It's seems you are not the father I'm still drinking beer with. This means there is only one person left that fits this attribute :) ... we should meet for some beer with the other father ;)
I'm one of the iceoryx mantainers. Great to see some new players in this field. Competition leads to innovation and maybe we can even collaborate in some areas :)
I did not yet look at the code but you made me curious with the raw pointers. Do you found a way to make this work without serialization or mapping the shm to the same address in all processes?
I will have a closer look at the jemmaloc integration since we had something similar in mind with iceoryx2.
We are doing it with fancy-pointers (yes, that is the actual technical term in C++ land) and allocators. It’s open-source, so it’s not like there’s any hidden magic, of course: “Just” a matter of working through it.
Using manual mapping (same address values on both sides, as you mentioned) was one idea that a couple people preferred, but I was the one who was against it, and ultimately this was heeded. So that meant:
Raw pointer T* becomes Allocator<T>::pointer. So if user happens to enjoy using raw pointers directly in their structures, they do need to make that change. But, beats rewriting the whole thing… by a lot.
container<T> becomes container<T, Allocator<T>>, where `container` was your standard or standard-compliant (uses allocator properly) container of choice. So if user prefers sanity and thus uses containers (including custom ones they developed or third-party STL-compliant ones), they do need to use an allocator template argument in the declaration of the container-typed member.
But, that’s it - no other changes in data structure (which can be nested and combined and …) to make it SHM-sharable.
We in library “just” have to provide the SHM-friendly Allocator<T> for user to use. And, since stateful allocators are essentially unusable by mere humans in my subjective opinion (boost.interprocess authors disagree apparently), use a particular trick to work with an individual SHM arena. “Activator” API.
So that leaves the mere topic of this SHM-friendly fancy-pointer type, which we provide.
For SHM-classic mode (if you’re cool with one SHM arena = one SHM segment and both sides being able to write to SHM; and boost.interprocess alloc algorithm) —- enabled with a template arg switch when setting up your session object —- that’s just good ol’ offset_ptr.
For SHM-jemalloc (which leverages jemalloc, and hence is multi-segment and cool like that, plus with better segregation/safety between the sides) internally there are multiple SHM-segments, so offset_ptr is insufficient. Hence we wrote a fancy-pointer for the allocator, which encodes the SHM segment ID and offset within the 64 bits. That sounds haxory and hardcore, but it’s not so bad really. BUT! It needs to also be able to be able to point outside SHM (e.g., into stack which is often used when locally building up a structure), so it needs to be able to encode an actually-raw vaddr also. And still use 64 bits, not more. Soooo I used pointer tagging, as not all 64 bits of a vaddr carry information.
So that’s how it all works internally. But hopefully to the user none of these details is necessary to understand. Use our allocator when declaring container members. Use allocator’s fancy-pointer type alias (or similar alias, we give ya the aliases conveniently hopefully) when declaring a direct pointer member. And specify which SHM-backing technique you want us to internally use - depending on your safety and allocation perf desires (currently available choices are SHM-classic and SHM-jemalloc).
Hehe, we are also using fancy-pointer in some places :)
We started with mapping to the shm to the same address but soon noticed that it was not a good idea. It works until some application already mapped something to the same address. It's good that you did not went that route.
I hoped you had an epiphany and found a nice solution for the raw-pointer problem without the need to change them and we could borrow that idea :) Replacing the raw-pointer with fancy-pointer is indeed much simpler than replacing the whole logic.
Since the raw-pointer need to be replaced by fancy-pointer, how do you handle STL container? Is there a way to replace the pointer type or some other magic?
Hehe, we have something called 'relative_ptr' which also tracks the segment ID + offset. It is a struct of two uint64_t though. Later on, we needed to condense it to 64 bit to prevent torn writes in our lock-free queue exchange. We went the same route and encoded the segment ID in the upper 16 bits since only 48 bits are used for addressing. It's kind of funny that other devs also converge to similar solutions. We also have something called 'relocatable_ptr'. This one tracks only the offset to itself and is nice to build relocatable structures which can be memcopied as long as the offset points to a place withing the copied memory. It's essentially the 'boost::offset_ptr'.
Btw, when you use jemalloc, do you free the memory from a different process than from which you allocate? We did the same for iceory1 but moved to a submission-queue/completion-queue architecture to reduce complexity in the allocator and free the memory in the same process that did the allocation. With iceoryx2 we also plan to be more dynamic and have ideas to implement multiple allocators with different characteristics. Funnily, jmalloc is also on the table for use-cases where fragmentation is not a big problem. Maybe we can create a common library for shm allocating strategies which can be used for both projects.
> I hoped you had an epiphany and found a nice solution for the raw-pointer problem without the need to change them and we could borrow that idea :)
Well, almost. But alas, I am unable to perform magic in which a vaddr in process 1 means the same thing in process 2, without forcing it to happen by using that mmap() option. And indeed, I am glad we didn't go down that road -- it would have worked within Akamai due to our kernel team being able to do such custom things for us, avoiding any conflict and so on; but this would be brittle and not effectively open-sourceable.
> Since the raw-pointer need to be replaced by fancy-pointer, how do you handle STL container? Is there a way to replace the pointer type or some other magic?
Yes, through the allocator. An allocator is, at its core, three things. 1, what to execute when asked to allocate? 2, what to execute when asked to deallocate? 3, and this is the relevant part here, what is the pointer type? This used to be an alias `pointer` directly in the allocator type, but it's done through traits, modernly. Point being: An allocator type can have the pointer type just be T; or* it can alias it to a fancy-pointer type. Furthermore, to be STL-compliant, a container type must religiously follow this convention and never rely on T* being the pointer type. Now, in practice, some GNU stdc++ containers are bad-boys and don't follow this; they will break; but happily:
- clang's libc++ are fine;
- boost.container's are fine (and, of course, implement exactly the required API semantics in general... so you can just use 'em);
- any custom-written containers should be written to be fine; for example see our flow::util::Basic_blob which we use as a nailed-down vector<uint8_t> (with various goodies like predictable allocation size behavior and such) for various purposes. That shows how to write such a container that properly follows STL-compliant allocator behavior. (But again, this is not usually something you have to do: the aforementioned containers are delightful and work. I haven't looked into abseil's.)
So that's how. Granted, subtleties don't stop there. After all, there isn't just "one" SHM arena, the way there is just one general heap. So how to specify which SHM-arena to be allocating-in? One, can use a stateful allocator. But that's pain. Two, can use the activator trick we used. It's quite convenient in the end.
> Btw, when you use jemalloc, do you free the memory from a different process than from which you allocate?
No; this was counter to the safety requirements we wanted to keep to, with SHM-jemalloc. We by default don't even turn on writability into a SHM-arena by any process except the one that creates/manages the arena - can't deallocate without writing. Hence there is some internal, async IPC that occurs for borrower-processes: once a shared_ptr<T> group pointing into SHM reaches ref-count 0, behind the scenes (and asynchronously, since deallocating need not happen at any particular type and shouldn't block user threads), it will indicate to the lending-process this fact. Then once all such borrower-processes have done this, and the same has occurred with the original shared_ptr<T> in the lender-process (which allocated in the first place), the deallocation occurs back in the lender-process.
If one chooses to use SHM-classic (which -- I feel compelled to keep restating for some reason, not sure why -- is a compile-time switch for the session or structure, but not some sort of global decision), then it's all simplicity itself (and very quick -- atomic-int-quick). offset_ptr, internally-stored ref-count of owner-processes; once it reaches 0 then whichever process/piece of code caused it, will itself deallocate it.
The idea of its design is that one could plug-in still more SHM-providers instead of SHM-jemalloc or SHM-classic. It should all keep working through the magic of concepts (not formal C++20 ones... it's C++17).
---
Somewhere above you mentioned collaboration. I claim/hope that Flow-IPC is designed in a pragmatic/no-frills way (tried to vaguely imitate boost.interprocess that way) that exposes whichever layer you want to use, publicly. So, to give an example relating to what we are discussing here:
Suppose someone wants to use iceoryx's badass lock-free mega-fast one-microsecond transmission. But, they'd like to use our SHM-jemalloc dealio to transmit a map<string, vector<Crazy_ass_struct_with_more_pointers_why_not>>. I completely assure you I can do the following tomorrow if I wanted:
- Install iceoryx and get it to essentially work, in that I can transmit little constant-size blobs with it at least. Got my mega-fast transmission going.
- Install Flow-IPC and get it working. Got my SHM-magic going.
- In no more than 1 hour I will write a program that uses just the SHM-magic part of Flow-IPC -- none of its actual IPC-transmission itself per se (which I claim itself is pretty good -- but it ain't lock-free custom awesomeness suitable for real-time automobile parts or what-not) -- but uses iceoryx's blob-transmission.
It would just need to ->construct<T>() with Flow-IPC (this gets a shared_ptr<T>); then ->lend_object<T>() (this gets a tiny blob containing an opaque SHM-handle); then use iceoryx to transmit the tiny blob (I would imagine this is the easiest possible thing to do using iceoryx); on the receiver call Flow-IPC ->borrow_object<T>(). This gets the shared_ptr<T> -- just the like the original. And that's it. It'll get deallocated once both shared_ptr<T> groups in both processes have reached ref-count 0. A cross-process shared_ptr<T> if you will. (And it is by the way just a shared_ptr<T>: not some custom type monstrosity. It does have a custom deleter, naturally, but as we know that's not a compile-time decision.)
So yes, believe it or not, I was not trying to out-compete you all here. There is zero doubt you're very good at what you do. The most natural use cases for the two overlap but are hardly the same. Live and let live, I say.
Don't worry. It's great to have other projects in this field, exploring different routes and you created a great piece of software. The best thing, it's all open source after all :)
Reading your response is almost as you've been at our coffee chats. Quite some of your ideas are also either already implemented in iceoryx2 or on our todo list. It seems we just put our focus on different things. Here and there you also added the cherry on top. This motivates us to improve on some areas we neglected the last years. We can learn from each other and improve our projects thanks to the beauty of open source.