Yeah but that's kind of a given here, C with AVX intrinsics or C with NEON intri...

Const-me · on Nov 4, 2024

> C with AVX intrinsics or C with NEON intrinsics or C with SVE intrinsics is also 100% non-portable

That’s not true because C with SIMD intrinsics is portable across operating systems. Due to differences in calling conventions and other low-level things, assembly is specific to a combination of target ISA and target OS.

Here’s a real-life example what happens when assembly code fails to preserve SSE vector registers specified as non-volatile in the ABI convention of the target OS: https://issues.chromium.org/issues/40185629

mort96 · on Nov 4, 2024

Thanks, this is a good point!

dragontamer · on Nov 4, 2024

But ispc is portable between AVX512 and ARM NEON.

The key is that SIMD writing needs a SIMD language. OpenCL, ISPC, CUDA, HIP and C++AMP (rip) are such examples of SIMD languages.

I guess a really portable SIMD language would be WebGL / HLSL and such.

kolbe · on Nov 4, 2024

Scalar languages can go a long ways in this exercise. There are a few hiccups in trying to vectorize scalar languages, but see how well clang's latest -O3 models can do on C++.

I would love to see ISPC take off. It's my favorite way of abstracting code in this new computing paradigm. But it doesn't seem to be getting much traction.

I think the future is also going to be less "SIMD" and more "MIMD" (there's probably a better term for that). You even see in AVX512 with things like aggregations (which aren't SIMD), that there's no reason a single op has to be the same exact operation spread over 512/(data size) slots. You can just as easily put popular sets of operations into the tool set.

dragontamer · on Nov 4, 2024

SIMD coders are doing DirectX / HLSL and/or Vulcan in practice. Or CUDA of course.

AVX512 is nice, but 4000+ shaders on a GPU is better. CPUs sit at an awkward point, you need a small dataset that suffers major penalties for CPU/GPU transfers.

Too large, and GPU RAM is better as a backing store. Too small, no one notices the differences.

jsheard · on Nov 4, 2024

> CPUs sit at an awkward point, you need a small dataset that suffers major penalties for CPU/GPU transfers.

Or a dataset so large that it won't fit in the memory of any available GPU, which is the main reason why high-end production rendering (think Pixar, ILM, etc) is still nearly always done on CPU clusters. Last I heard their render nodes typically had 256GB RAM each, and that was a few years ago so they might be up to 512GB by now.

dragontamer · on Nov 4, 2024

Yes but GPU clusters with 256++ GB of VRAM is possible and superior in bandwidth thanks to NVSwitch.

I'd say CPUs still have the RAM advantage but closer to 1TB+ of RAM, where NVSwitch no longer scales to. CPUs with 1TB of RAM are a fraction of the cost too, so price/performance deserves a mention.

------

Even then, PCIe is approaching the bandwidth of RAM (latency remains a problem of course).

For Raytracing in particular, certain objects (bigger background objects or skymaps) have a higher chance of being hit.

There are also OctTrees where you can have rays bounce inside of a 8GB chunk (all of which is loaded in GPU RAM only), and only reorganize the rays when they leave a chunk.

So even Pixar-esque scenes can be rendered quickly in 8Gb chunks. In theory of course, I read a paper on it but I'm not sure if this technique is commercial yet.

But basically, raytrace until a ray leaves your chunk. If it does, collate it for another pass to the chunk it's going to. On the scale of millions of rays (like in Pixar movies), enough are grouped up that it improves rendering while effectively minimizing GPU VRAM usage.

Between caching common objects and this octtree / blocks technique, I think Raytracing can move to pure GPU. Whenever Pixar feels like spending a $Billion on the programmers of course.

janwas · on Nov 4, 2024

Or perf/TCO, or availability? :)

datadeft · on Nov 4, 2024

Not a C coder but isn't there a way to embed platform specific optimizations into your C project and make it conditional, so that during build time you get the best implementation?

    #if defined(__ARM_NEON__)
    #include <arm_neon.h>
    ...

dagw · on Nov 4, 2024

Yes, but then you have to write (and debug and maintain) each part 3 times.

There are also various libraries that create cross platform abstractions of the underlying SIMD libraries. Highway, from Google, and xSIMD are two popular such libraries for C++. SIMDe is a nice library that also works with C.

pmarreck · on Nov 4, 2024

> Yes, but then you have to write (and debug and maintain) each part 3 times.

could you not use a test suite structure (not saying it would be simple) that would run the suite across 3 different virtualized chip implementations? (The virtualization itself might introduce issues, of course)

datadeft · on Nov 4, 2024

> each part 3 times.

True.

So most projects just use SIMDe, xSIMD or something similar for such use cases?

astrange · on Nov 4, 2024

FFmpeg is the most successful of such projects and it uses handwritten assembly. Ignore the seductive whispers of people trying to sell you unreliable abstractions. It's like this for good reasons.

dagw · on Nov 4, 2024

Or you just say that my code is only fast on hardware that supports nativ AVX512 (or whatever). In many cases where speed really matters that is a reasonable tradeoff to make.

ulrikrasmussen · on Nov 4, 2024

You can always do that using build flags, but it doesn't make it portable, you as a programmer still have to manually port the optimized code to all the other platforms.

mort96 · on Nov 4, 2024

Yeah, you can do that, but that still means you write platform-specific code. What you typically do is that you write a cross-platform scalar implementation in standard C, and then for each target you care about, you write a platform-specific vectorized implementation. Then, through some combination of compile-time feature detection, build flags, and runtime feature detection, you select which implementation to use.

(The runtime part comes in because you may want a single amd64 version of your program which uses AVX-512 if that's available but falls back to AVX-256 and/or SSE if it's not available)

magicalhippo · on Nov 4, 2024

> runtime feature detection

For any code that's meant to last a bit more than a year, I would say that should also include runtime benchmarking. CPUs change, compilers change. The hand-written assembly might be faster today, but might be sub-optimal in the future.

mort96 · on Nov 4, 2024

The assumption that vectorized code is faster than scalar code is a pretty universally safe assumption (assuming the algorithm lends itself to vectorization of course). I'm not talking about runtime selection of hand-written asm compared to compiler-generated code, but rather runtime selection of vector vs scalar.