We’re stuck with big endian forever because of network byte order. There will probably always be a niche market for BE CPUs for things that do lots of packet processing in software.
Anything which is a bitstream on a slow processor BE has the advantage of being simpeler, see in order processing, anything else it does not matter due to caches and the non issue of adding a few more fets here and there depending on your preferred format and arriving format.
(though for debugging hex encoded data I still prefer BE but that is just a personal preference.)
From first hand experience, swapping the endianness is a non-issue in network processing performance-wise (it is headache-wise though). When processing packets in software, the cost is dominated by the following:
- memory bandwidth limits: for each packet, you do pkt NIC -> RAM, headers RAM -> cache, process, cache -> RAM, pkt RAM -> NIC. Oh and that's assuming you're only looking at headers for e.g. routing; performing DPI will have the whole packet do RAM -> cache.
- branch predictor limits: if you have enough mixed traffic, the branch predictor will be basically useless. Even performing RPS will not save you if you have enough streams
So yeah, endianness is a non-issue processing-wise. So more so that one of the most expensive operations (checksumming) can be done on a LE CPU without swapping the byte order.
Even assuming this does have a measurable performance effect for the kind of processors you run Linux on (as opposed to something like a Cortex-M), you only need to have load-big-endian and store-big-endian instructions.