More

volta87 · on Aug 20, 2023

The arguments in this blogpost are fundamentally flawed. The fact that they opened a bug based on them but got shut down should have raised all red flags.

When compiling and running a C program, the only thing that matters is "what the C abstract machine does". Programs that exhibit UB in the abstract machine are allowed to do "anything".

Trying to scope that down using arguments of the form "but what the hardware does is X" are fundamentally flawed, because anything means anything, and what the hardware does doesn't change that, and therefore it doesn't matter.

This blogpost "What The Hardware Does is not What Your Program Does" explains this in more detail and with more examples.

https://www.ralfj.de/blog/2019/07/14/uninit.html

eklitzke · on Aug 20, 2023

The blog post is also kind of unhinged because in the incredibly rare cases where you would want to write code like this you can literally just use the asm keyword.

I think it's also worth considering WHY compilers (and the C standard) make these kinds of assumptions. For starters, not all hardware platforms allow unaligned accesses at all. Even on x86 where it's supported, you want to avoid doing unaligned reads at all costs because they're up to 2x slower than aligned accesses. God forbid you try to use unaligned atomics, because while technically supported by x86 they're 200x slower than using the LOCK prefix with an aligned read.[^1] The fact that you need to go through escape hatches to get the compiler to generate code to do unaligned loads and stores is a good thing, because it helps prevent people from writing code with mysterious slowdowns.

Writing a function that takes two pointers of the same type already has to pessimize loads and stores on the assumption that the pointers could alias. That is to say, if your function takes int p, int q then doing a store to p requires reloading q, because p and q could point to the same thing. Thankfully in some situations the compiler can figure out that in a certain context p and q have different addresses and therefore can't alias, this helps the compiler generate faster code (by avoiding redundant loads). If p and q are allowed to alias even when they have different addresses, this would all go out the window and you'd basically need to assume that all pointer types could alias under any situation. This would be TERRIBLE for performance.

[^1]: https://rigtorp.se/split-locks/

saagarjha · on Aug 20, 2023

> Even on x86 where it's supported, you want to avoid doing unaligned reads at all costs because they're up to 2x slower than aligned accesses.

This is generally not true.

von_lohengramm · on Aug 21, 2023

It is generally true because "up to" includes the same speed. However, unaligned memory can significantly impact caching.

Gibbon1 · on Aug 20, 2023

> For starters, not all hardware platforms allow unaligned accesses at all.

Yeah and always everywhere a mistake. It was a mistake back in the 1970's and it's increasing bigger mistake as time goes on. Just like big endian and 'network order'

saagarjha · on Aug 20, 2023

Not really. They’re quite convenient in many cases, and can actually be more efficient than carefully preparing aligned loads/stores.

vlovich123 · on Aug 20, 2023

While the sentiment is correct as to why compilers makes alignment assumptions, a lot of the details here I think are not quite right.

> For starters, not all hardware platforms allow unaligned accesses at all

If you're dealing with very simple CPUs like the ARM M0, sure. But even the M3/M4 allows unaligned access.

> Even on x86 where it's supported, you want to avoid doing unaligned reads at all costs because they're up to 2x slower than aligned accesses

I believe that information hasn't been true for a long time (since 1995). Unless you're talking about unaligned accesses that also cross a cache line boundary being slower [1]. But I imagine that aligned accesses crossing a cache line boundary are also similarly slower because the slowness is the cache line boundary.

> God forbid you try to use unaligned atomics, because while technically supported by x86 they're 200x slower than using the LOCK prefix with an aligned read

What you're referring to is atomic unaligned access that's also across cache line boundaries. I don't know what it is within a cache line, but I imagine it's not as bad as you make it out to be. Unaligned atomics across cache line boundaries also don't work on ARM and have much spottier support than unaligned access in general.

TLDR: People cargo cult advice about unaligned access but it's more because it's a simpler rule of thumb and there's typically very little benefit to pack things as tightly as possible which is where unaligned accesses generally come up.

[1] https://news.ycombinator.com/item?id=10529947

AshamedCaptain · on Aug 20, 2023

Your message is more misleading than the GP.

Many architectures sold today still claim unaligned accesses are optional (e.g. all ARM pre-v7, which includes the popular Raspberry Pi Zero). Not to mention that even if they are supported, not all instructions support it (which is the case today on all ARM cores and even on x86).

From the architectures and instructions which may support it, it may have a performance penalty which may range from "somewhat slower" (e.g. Intel still recommends stack alignment, because otherwise many internal store optimizations start giving up) to "ridiculously slower" (e.g. I once had to write a trap handler that software-emulated unaligned accesses on ARM -- on all 32-bit ARMs Linux still does this for all instructions except plain undecorated LDR/STR when the special unaligned ABI is enabled).

And finally, even if the architecture supports it with decent enough performance, it may do it with relaxed atomicity. E.g. even as of today aarch64 makes zero guarantees regarding atomicity of even atomic instructions on unaligned addresses (yes, really). To put it simply because it is a _pain in the ass_ to implement correctly (say programmer does atomic load/store on overlapping addresses with different alignments). This is whether they cross cache lines or not.

i.e. it's as a bad as the GP is saying. You can't just put one example of one processor handling each case correctly to dismiss this claim, because the point is that most processor's don't bother and those who do bother still have severe crippling limitations that make it unfeasible to use in a GP compiler.

And there is still a lot of benefit to packing things up... but it does require way too much care and programmer effort.

torusle · on Aug 20, 2023

> If you're dealing with very simple CPUs like the > ARM M0, sure. But even the M3/M4 allows unaligned > access.

On ARM M3/M4 you have the same issue with LDRD and STRD instructions which do not allow unaligned access. Even the normal load/stores don't allow unaligned access in all cases. Try this in the peripheral memory region for starters. And things get even more complicated when the memory protection unit shakes up things.

macjohnmcc · on Aug 20, 2023

Yeah even Microsoft's compiler aligns values on appropriate boundaries for performance reasons. DWORDs on DWORD boundaries etc. And if you want to pack the data structure to avoid the gaps in structures there are methods to do so via #pragma options. I think their complaining about what was done for performance reasons shows a great lack of overall understanding. More time researching and less time griping would have served them better.

jcranmer · on Aug 20, 2023

This line in particular really bugs me:

> The present blog post brings bad, and as far as I know, previously undocumented news. Even if you really are targeting an instruction set without any memory access instruction that requires alignment, GCC still applies some sophisticated optimizations that assume aligned pointers.

I could have told you this was true ~20 years ago, and the main reason I'm so conservative in how far back gcc has been doing this is that it's only around that time I started programming--I strongly suspect this dates back to the 90's.

SAI_Peregrinus · on Aug 20, 2023

It dates to the first standardization of C in 1989. The "C as portable assembly" view ended when ANSI C got standardized, and K&R's 2nd edition was published.

j16sdiz · on Aug 20, 2023

I would argue it's the modern understanding of C standard is flawed.

Back in 89, many of those unspecified behavior were understood as implementation/hardware dependent, not undefined. Aliasing was the norm, `restrict` was actually a keyword.

Modern C is neither safe nor low-level.

jcranmer · on Aug 20, 2023

Ascertaining the state of the mind of the C committee in 1989 is difficult, since only the documents from ~late 1996 are consistently available online (the earlier documents are probably sitting somewhere in a warehouse in Geneva, but they may as well not exist anymore).

But definitely by the time C99 came out, it is clear that optimize-assuming-UB-doesn't-happen was an endorsed viewpoint of the committee [1]. C99 also added restrict to the language (not C89 as you suggest), and restrict was the first standardized feature that was a pure UB-optimization hint [2].

It is important to remember that there isn't just one catch-all category of implementation-varying behavior. There is a difference between unspecified behavior, implementation-defined behavior, and undefined behavior. Undefined behavior has been understood, from its inception, as behavior that doesn't constrain the compiler, and often describes behavior that can't be meaningfully constrained (especially with regards to potentially-trapping operations).

[1] The C99 rationale gives an example of an optimization that compilers can perform that relies on assuming UB can't happen--reassociation of integer addition, on one's complement machines.

[2] The register keyword is I believe even in K&R C and would also be qualified as a compiler hint feature, but I note that it prohibits taking the address of the variable entirely, so it doesn't rely on UB. Whereas restrict has to rely on "if these two variables alias, it's UB" to allow the compiler to optimize assuming nonaliasing.

Someone · on Aug 20, 2023

> Back in 89 […] `restrict` was actually a keyword.

Was it? I thought it’s more recent. https://en.wikipedia.org/wiki/Restrict seems to agree (“In the C programming language, restrict is a keyword, introduced by the C99 standard,[1] that can be used in pointer declarations”), as does https://en.cppreference.com/w/c/language/restrict (“restrict type qualifier (since C99)”)

Was there an older usage?

rsaxvc · on Aug 20, 2023

I've used several pre-C99 embedded compilers that supported restrict. IIRC, probably of mid 90s vintage.

6D794163636F756 · on Aug 20, 2023

I haven't gotten to use C in industry, but I was taught that undefined behavior just means that it is defined by the running system and not the compiler. Is that not the general understanding? Maybe I was just taught that way because it was old timers teaching it.

gdwatson · on Aug 20, 2023

If the language standard leaves some behavior undefined, other sources (e.g., POSIX, your ABI, your standard library docs, or your compiler docs) are free to define it. If they do, and you are willing to limit your program’s portability, you can use that behavior with confidence. But they also leave many behaviors undefined, and you can’t rely on those.

For implementation-defined behavior, the language standard lays out a menu of options and your implementation is required to pick one and document it. IMHO, many things in the C standard are undefined that ought to be implementation-defined. But unaligned pointer accesses would be hard to handle that way; at best you could make the compiler explicitly document whether or not it supports them on a given architecture.

patrakov · on Aug 20, 2023

What you are talking about is implementation-defined behavior. It exists in the C standard separately from the undefined behavior.

SAI_Peregrinus · on Aug 20, 2023

Implementation Defined behavior means the standards authors provided a list of possible behaviors, and compiler authors must pick one and document which they picked.

Unspecified behavior is more what you're thinking of, though in that case the standard still provides a list of possibilities that compiler authors have to pick from, they just don't have to document it or always make the same choice for every program.

There's no allowed subset of behavior where compiler authors are free to pick whatever they want and document it (but must do so). IMO there should be, most "Undefined Behavior" could be specified and documented, even where that choice would be "the compiler assumes such situations are unreachable and optimizes based on that assumption" like much of current UB. At least it'd be explicit!

aw1621107 · on Aug 21, 2023

> Implementation Defined behavior means the standards authors provided a list of possible behaviors

The standard definitely does not require implementations to pick from a list of possible behaviors. All the standard requires is that the implementation document the behavior.

For example, the behavior on integer demotion is implementation-defined and there's no list of possible behaviors:

> When an integer is demoted to a signed integer with smaller size, or an unsigned integer is converted to its corresponding signed integer, if the value cannot be represented the result is implementation-defined.

> Unspecified behavior is more what you're thinking of, though in that case the standard still provides a list of possibilities that compiler authors have to pick from

That contradicts the standard's definition of unspecified behavior. For example, from the C89 draft (emphasis added) [0]:

> Unspecified behavior --- behavior, for a correct program construct and correct data, for which the Standard imposes no requirements.

[0]: https://port70.net/%7Ensz/c/c89/c89-draft.html#1.6sd

umanwizard · on Aug 20, 2023

That’s indeed incorrect. Undefined behavior anywhere means that the entirety of your program is undefined and may do anything.

dale_glass · on Aug 20, 2023

No. See this for details on how UB is handled by compilers:

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

The TL;DR is that compilers compile code based on assumptions that UB won't be invoked. This sometimes produces extremely surprising results which have nothing to do with the hardware/OS.

circuit10 · on Aug 20, 2023

For C to be portable this needs to be undefined behaviour because there are CPUs that don’t support unaligned access

chaboud · on Aug 20, 2023

That’s why, while much of the linked blog is kind of off the mark (signs of someone knowing less than they think they know), the general conclusion, using aligned pointers is recommended, is one that I typically recommend to developers new to C or C++ anyway.

I’m alright with folks sticking to aligned pointer operations, largely for performance reasons. On some platforms, unaligned operations are really expensive.

saagarjha · on Aug 21, 2023

I know the author. Your guess here is wrong.

astrange · on Aug 20, 2023

There are some other reasons, but that's one of them.

Another is that you want to guarantee objects are stored aligned in memory because that gives you some free bits in pointers you can hide stuff in. (This has less hardware support than it should.)

Dylan16807 · on Aug 21, 2023

It doesn't need to be undefined to say it crashes on some architectures.

circuit10 · on Aug 21, 2023

I guess it could also be implementation defined

My point here is that you can’t have “everything works as it does in the native assembly language” and “portable assembly” at the same time because if you rely on implementation defined or undefined behaviour then it’s not portable any more

Dylan16807 · on Aug 21, 2023

That depends on what you mean by "portable". I think being able to use the same code across many platforms is enough to qualify. Being able to access raw machine behavior is part of the premise of portable assembly, not a disqualifier.

circuit10 · on Aug 21, 2023

It’s definitely much more portable than assembly, but code that relies on unaligned pointer accesses won’t work on every platform

noselasd · on Aug 20, 2023

indeed, I still have ~20 years old code that picks up and rectifies unaligned memory so gcc does the right thing. To claim a compiler bugs out on unaligned memory sounds very weird, I assumed that was common knowledge.

nomel · on Aug 20, 2023

My first 10 minutes of trying to talk to hardware, and then googling the error message. taught me.

hinkley · on Aug 21, 2023

27 years ago I was helping someone rearrange structs because word-sized fields were being word aligned, and you would waste a good deal of memory if you arranged by concept instead of for byte packing. I believe that was under Microsoft’s compiler.

saagarjha · on Aug 21, 2023

What you’re saying and what the blog post is implying are different things. This is an admission that GCC optimizes on this behavior in practice. Your claim is that GCC could optimize on this, which is a much less interesting claim.

j16sdiz · on Aug 20, 2023

That's what the author meant when he said "The shift of the C language from “portable assembly” to “high-level programming language without the safety of high-level programming languages”"

Back in the 1980s, C was expected to do what hardware does. There was no "the C abstract machine".

The abstract machine idea was introduced much later.

> The arguments in this blogpost are fundamentally flawed.

The "fundamentally flawed" comment is revisionist idea.

JonChesterfield · on Aug 20, 2023

This turns out to be contentious. There are two histories of the C language and which one you get told is true depends on who you ask.

1/ a way to emit specific assembly with a compiler dealing with register allocation and instruction selection

2/ an abstract machine specification that permits optimisations and also happens to lower well defined code to some architectures

My working theory is that the language standardisation effort invented the latter. So when people say C was always like this, they mean since ansi c89, and there was no language before that. And when people say C used to be typed/convenient assembly language, they're referring to the language that was called C that existed in reality prior to that standards document.

The WG14 mailing list was insistent (in correspondence to me) that C was always like this, some of whom were presumably around at the time. A partial counterargument is the semi-infamous message from Dennis Richie copied in various places, e.g. https://www.lysator.liu.se/c/dmr-on-noalias.html

An out of context quote from that email to encourage people to read said context and ideally reply here with more information on this historical assessment

"The fundamental problem is that it is not possible to write real programs using the X3J11 definition of C. The committee has created an unreal language that no one can or will actually use."

Regards

lonjil · on Aug 20, 2023

> My working theory is that the language standardisation effort invented the latter. So when people say C was always like this, they mean since ansi c89, and there was no language before that. And when people say C used to be typed/convenient assembly language, they're referring to the language that was called C that existed in reality prior to that standards document.

But the committee has always had a lot of C compiler developers in it. The people who wrote the C89 standard were the same people who developed many of the C compilers in use before C89. The people who created the reality prior to C89 created the reality after C89. Any perception of "portable assembly" probably stemmed simply from the fact that optimizers were much less sophisticated.

bigbillheck · on Aug 20, 2023

> Back in the 1980s, C was expected to do what hardware does. There was no "the C abstract machine".

There was also a huge variety of compilers that were buggy and incomplete each in their own ways, often with mutually-incompatible extensions, not to mention prone to generating pretty awful code.

astrange · on Aug 20, 2023

Indeed, those two statements are the same thing.

If you want a correct compiler it has to be correct according to a model, which means it can't handle things outside that model, and now you have "undefined behavior".

People want compilers to limit how much they transform UB, but that's not possible unless it gets defined. Which you can do, of course, but it's more limiting than it looks.

wbl · on Aug 20, 2023

How does C do what hardware does and store things in registers when it can?

pjmlp · on Aug 20, 2023

It doesn't, it is up to the compiler and optimizer to decide how to go at it.

Vector instructions, replacing library functions with compiler intrisics, splitting structs across registers and stack, unrolling loops are all examples absent from the language standard.

JonChesterfield · on Aug 20, 2023

Two ways. One is the platform ABI sometimes says specific arguments are passed in specific registers. The second is (essentially) assigning local variables offsets on a machine stack where some offsets are stored in registers.

User23 · on Aug 20, 2023

To the best of my recollection the “abstract machine” is a C++ism that unfortunately crept into C.

zokier · on Aug 20, 2023

From C89 document:

> 2.1.2.3 Program execution

> The semantic descriptions in this Standard describe the behavior of an abstract machine in which issues of optimization are irrelevant

[...]

> Alternatively, an implementation might perform various optimizations within each translation unit, such that the actual semantics would agree with the abstract semantics only when making function calls across translation unit boundaries. In such an implementation, at the time of each function entry and function return where the calling function and the called function are in different translation units, the values of all externally linked objects and of all objects accessible via pointers therein would agree with the abstract semantics. Furthermore, at the time of each such function entry the values of the parameters of the called function and of all objects accessible via pointers therein would agree with the abstract semantics.

armitron · on Aug 20, 2023

The "abstract machine" is present in the first C standard, published in 1989.

compiler-guy · on Aug 20, 2023

If you really are targeting the x86_64 instruction set, you should be writing x86_64 instructions. Then you get exactly what the hardware does and don’t get any of those pesky compiler assumptions.

Of course you don’t get any of those pleasant optimizations either. But those optimizations are only possible because of the assumptions.

astrange · on Aug 20, 2023

There's some optimizing x86 assemblers actually. Of course, now you have to follow their rules.

captainmuon · on Aug 20, 2023

I think it is a good blog post, because it highlights an issue that I was not aware of and that I think many programmers are not. I do think I am a decent C programmer, and I spotted the strict aliasing issue immediately, but I didn't know that unaligned pointer access is UB. Because let's face it, the majority of programmers didn't read the standard, and those who did don't remember all facets.

I first learned many years ago that you should pick apart binary data by casting structs, using pointers to the middle of fields and so on. It was ubiquitous for both speed and convenience. I don't know if it was legal even in the 90s, but it was general practice - MS Office file formats from that time were just dumped structs. Then at some point I learned about pointer alignment - but it was always framed due to performance, and due to the capabilities of exotic platforms, never as a correctness issue. But it's not just important to learn what to do, but also why to do it, which is why we need more articles highlighting these issues.

(And I have to admit, I am one of these misguided people who would love a flag to turn C into "portable assembler" again. Even if it is 10x slower, and even if I had to add annotations to every damn for loop to tell the compiler that I'm not overflowing. There are just cases where understanding what you are actually doing to the hardware trumps performance.)

saagarjha · on Aug 20, 2023

I think you (and most of the other commenters in this thread) misunderstand the perspective of the author. This is a tool meant to do static analysis of a C codebase. Their job is not to actually follow the standard, but identify what “common C” actually looks like. This is not the same as standard C.

There are a lot of things compilers do not optimize on even though they are technically illegal. As a result, people write code that relies on these kinds of manipulations. No, this is not your standard complaint about undefined behavior being the work of the devil, this is code that in certain places pushes the boundaries of what the compiler silently guarantees. The author’s job is to identify this, not what the standard says, because a tool that rejects any code that’s not entirely standards compliant is generally useless for any nontrivial codebase.

dataangel · on Aug 21, 2023

> When compiling and running a C program, the only thing that matters is "what the C abstract machine does". Programs that exhibit UB in the abstract machine are allowed to do "anything".

This view is alienating systems programmers. You're right that that's what the standard says, but nobody actually wants that except compiler writers trying to juice unrealistic benchmarks. In practice programmers want to alias things, they want to access unaligned memory, they want to cast objects right out of memory without constructing them, etc. And they have real reasons to do so! More narrowly defining how off the rails the compiler is allowed to go, rather than anything is a desirable objective for changing the standard.

tom_ · on Aug 21, 2023

My UB links list is getting on in years, but somehow remains vaguely relevant.

"These people simply don't understand what C programmers want": https://groups.google.com/forum/#!msg/boring-crypto/48qa1kWi...

"please don't do this, you're not producing value": http://blog.metaobject.com/2014/04/cc-osmartass.html

"Everyone is fired": http://web.archive.org/web/20160309163927/http://robertoconc...

"No sane compiler writer would ever assume it allowed the compiler to 'do anything' with your code": http://web.archive.org/web/20180525172644/http://article.gma...

kmeisthax · on Aug 20, 2023

Great, except no implementation of the C abstract machine actually exists. So you can't test against it. All you have are compilers that use it to justify miscompiling your code.

We need a C interpreter that intentionally implements C machine features that don't correspond to any architectural feature - i.e. pointers are (allocation provenance, offset) pairs, integer overflow panics, every pointer construction is checked, etc. If only to point out how hilariously absurd the ISO C UB rules are and how nobody actually follows them.

My personal opinion is that "undefined behavior" was a spec writing mistake that has been rules-lawyered into absurdity. For example, signed integer overflow being UB was intended to allow compiling C to non-twos-compliment machines. This was interpreted to allow inventing new misbehaviors for integer overflow instead of "do whatever the target architecture does."

codedokode · on Aug 20, 2023

> For example, signed integer overflow being UB was intended to allow compiling C to non-twos-compliment machines.

This is indeed a design mistake, but in another sense. Ordinary arithmetic ops like + or - should throw an exception on overflow (with both signed and unsigned operands) because most of the times you need an ordinary math, not math modulo 2^32. For those rare cases where wrap around is desired, there should be a function like add_and_wrap() or a special operator.

plorkyeran · on Aug 20, 2023

UBSan covers each of those except provenance checking, and ASan mostly catches provenance problems even though that's not directly the goal. There are some dumb forms of UB not caught by any of the sanitizers, but most of them are.

Making your program UBSan-clean is the bare minimum you should do if you're writing C or C++ in 2023, not an absurd goal. I know it'll never happen, but I'm increasingly of the opinion that UBSan should be enabled by default.

throwawaylinux · on Aug 21, 2023

> Great, except no implementation of the C abstract machine actually exists. So you can't test against it. All you have are compilers that use it to justify miscompiling your code.

All C compilers implement the C abstract machine. It is not used to justify miscompiling code, it is used to specify behavior of compiled code.

> We need a C interpreter

Interpreter or not is not relevant, there must be some misconception. Any behavior you can implement with an interpreter can be implemented with compiled code. E.g., add a test and branch after each integer operation if you want to crash on overflow.

> that intentionally implements C machine features that don't correspond to any architectural feature - i.e. pointers are (allocation provenance, offset) pairs, integer overflow panics, every pointer construction is checked, etc.

As others have mentioned there are static and dynamic checkers (sanitizers) that test for such things nowadays. In compiled, not interpreted code, mind you.

> If only to point out how hilariously absurd the ISO C UB rules are and how nobody actually follows them.

It's not that bad.

> My personal opinion is that "undefined behavior" was a spec writing mistake that has been rules-lawyered into absurdity. For example, signed integer overflow being UB was intended to allow compiling C to non-twos-compliment machines. This was interpreted to allow inventing new misbehaviors for integer overflow instead of "do whatever the target architecture does."

The spec uses implementation defined behavior for that. Although you can argue that they went the wrong way on some choices -- signed integer overflow "depends on the machine at hand" in the first K&R, which you could say would be reasonable to call it implementation specific and enumerate the behaviors of supported machines.

C had a long history with hardware manufacturers, compiler writers, and software developers though, so the standard can never universally please everybody. The purpose of standardization was never to make something that was easiset for software development, ignoring the other considerations. So a decision is not an example of design by committee gone wrong just because happened to be worse for software writers (e.g., choosing to make overflow undefined instead of implementation dependent). You would have to know why such a decision was made.

dorianh · on Aug 20, 2023

The blog post company does sell a C interpreter that checks for all undefined behaviors (with provenance and offset).

saagarjha · on Aug 21, 2023

The general problem with this argument is that “do what the hardware does” is actually not easy to reason about. The end results of this typically are impossible to grok.

throwawaylinux · on Aug 21, 2023

Not if possible implementations are specified, and especially if you can target machines of particular behavior. Which is of course how you can write endian and bit size portable code.

tom_ · on Aug 20, 2023

And one of the anythings permitted would be to behave in a documented manner characteristic of the target environment. The program is after all almost certainly being built to run on an actual machine; if you know what that actual machine does, it would sometimes be useful to be able to take advantage of that. We might not be able to demand this on the basis that the standard requires it, but as a quality of implementation issue I think it a reasonable request.

This is such an obvious thing to do that I'm surprised the C standard doesn't include wording along those lines to accommodate it. But I suppose even if it did, people would just ignore it.

robinsonb5 · on Aug 20, 2023

The problem is that what the machine does isn't necessarily consistent. If you're using old-as-the-green-hills integer instructions then yes, the CPU supports unaligned access. If you want to benefit from the speedup afforded by the latest vector instructions, now it suddenly it doesn't.

Also, to be fair, GCC does appear to back off the optimisations when dealing with, for example, a struct with the packed attribute.

saagarjha · on Aug 21, 2023

That depends on which vector instructions you use.

hedora · on Aug 20, 2023

Honestly, I think you are both incorrect.

C has always had a concept of implementation defined behavior, and unaligned memory accesses used to be defined to work correctly on x86.

Intel added instructions that can’t handle unaligned access, so they broke that contract. I’d argue that it is an instruction set architecture bug.

Alternatively, Intel could argue that compilers shouldn’t emit vector instructions unless they can statically prove the pointer is aligned. That’s not feasible in general for languages like C/C++, so that’s a pretty weak defense of having the processor pay the overhead of supporting unaligned access on some, but not all, paths.

SkiFire13 · on Aug 20, 2023

> C has always had a concept of implementation defined behavior, and unaligned memory accesses used to be defined to work correctly on x86.

There are a bunch of misconceptions here:

- unaligned loads were never implementation defined, they are undefined;

- even if they were implementation defined, this would give the compiler the choice of how to define them, not the instruction set;

- unaligned memory accesses on x86 for non-vector registers still work fine, so old instructions were not impacted and there's no bug. It's just that the expectations were not fulfilled for the new extension of those instructions.

chaboud · on Aug 20, 2023

Note: SIMD on x86 has unaligned instructions that used to be much slower (decoded differently) than their aligned counterparts.

For example, on Pentium 3 and Pentium Core 2, the unaligned instructions took twice as many cycles to execute. On modern x86 family processors, it’s the same cycle count either way. The only perf penalty one should account for is crossing of cache lines, generally a much smaller problem.

hedora · on Aug 21, 2023

Here's a link to the final C89 draft spec (the ratified spec is paywalled):

https://www.open-std.org/JTC1/sc22/wg14/www/docs/n1256.pdf

From section 6.7.2.1, semantics #10:

> The alignment of the addressable storage unit is unspecified.

This is for struct field access, but it clearly implies the compiler can choose to use unaligned struct fields. Also, the size of the integer types are all implementation defined:

then #12:

> Each non-bit-field member of a structure or union object is aligned in an implementation- defined manner appropriate to its type.

Alignment is defined as:

> requirement that objects of a particular type be located on storage boundaries with addresses that are particular multiples of a byte address

It doesn't say which multiple. 1 is a multiple. (So is 0.5, just in case the complier wants to go nuts with arcane code gen.) The spec even allows chars to be 7 bits. I didn't bother looking up the definition of byte in the spec for those architectures. (7 bits? 8 bits?)

In section 6.2.5, they talk about implementation-defined restrictions on integer types + alignment requirements:

> For each of the signed integer types, there is a corresponding (but different) unsigned integer type (designated with the keyword unsigned) that uses the same amount of storage (including sign information) and has the same alignment requirements.

So, the alignment of integers has to be the same for signed + unsigned types. That still doesn't say byte-aligned integers are disallowed.

Later:

> An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned,

Again, the alignment behavior is clearly implementation-defined.

I can't find a definition of implementation in the spec, but it clearly includes the compiler, standard library, and operating system. There is this quote:

> For implementations with a signed zero (including all IEC60559 implementations)

which, according to the IEC60559 abstract "An implementation of a floating-point system conforming to this standard may be realized entirely in software, entirely in hardware, or in any combination of software and hardware." I doubt they were trying to constrain floating point to be done in software by compilers, so it's pretty clear they intended to incorporate the physical hardware in the definition of the "implementation".

Later, they say:

> ...is defined if and only if the implementation supports the floating-point exception

which was definitely in the realm of hardware support back in 1989. Some later sections says that some macros (such as for FMA) are defined iff the implementation implements the primitive in hardware, and not just software.

assbuttbuttass · on Aug 20, 2023

Undefined and implementation defined are different in C. The number of bits in an int is implementation defined. Unaligned access is undefined.

hedora · on Aug 21, 2023

See sibling comment. The alignment requirements are implementation defined, and any multiple is legal. 1 byte is definitely a legal multiple.

armitron · on Aug 21, 2023

Unaligned memory access is UB period, no ifs no buts.

JonChesterfield · on Aug 20, 2023

Loads of architectures can't do misaligned memory access. Even x86 has problems when variables span cache lines. The compiler usually deals with this for the programmer, e.g. by rounding the address down then doing multiple operations and splicing the result together.

throwawaylinux · on Aug 21, 2023

Most modern architectures that target high performance implementations can do unaligned accesses, even ones crossing page boundaries.

Less common is support for atomic RMW access to unaligned location. x86 does support it but crossing a cache line causes the operation to be very slow.

armitron · on Aug 20, 2023

Unaligned memory accesses are undefined behavior in C. If you're writing C, you should be abiding by C rules. "Used to work correctly" is more guesswork and ignorance than "abiding by C rules". In C, playing fast&loose with definitions hurts, BAD.

Frankly, I'd be ashamed to write this blog post since the only thing it accomplishes is exposing its writers as not understanding the very thing they're signaling expertise on.

iso8859-1 · on Aug 20, 2023

What makes you think they don't understand it? They acknowledge that it is UB. I read them as realistic, since they know that people rely on C compilers working a certain way. They even wrote an interpreter that detects UB: https://github.com/TrustInSoft/tis-interpreter

I understand why people like the compiler being able to leverage UB. I suspect this philosophy actually makes Trust-In-Soft more money: You could argue that if there was no UB, there would be no need for the tis-interpreter.

So isn't it in fact quite self-less that they encourage the world to optimize a bit less (spending more money on 'compute'), while standing to profit from the unintended behaviour they'd otherwise be contracted to help debug?

hedora · on Aug 21, 2023

I made a comment a few levels up to a sibling where I point out the parts of the C89 spec that are relevant.

Alignment requirements for integers are implementation defined, not undefined behavior. On x86, the implementation used to define the alignment requirement to be one byte.

In fact, if you've do enought hardware register and bus-level (e.g., PCIe) programming, you'll quickly realize that there are all sorts of other exotic implementation-defined alignment constraints on modern systems.

armitron · on Aug 21, 2023

Pretty much everything you wrote in that comment is wrong since you're interpreting the spec in a way that's clearly not what the spec describes (e.g. the spec is talking about alignment requirements for conversions, but you generalize it to "alignment requirements" which is dead wrong).

bigbillheck · on Aug 20, 2023

> C has always had a concept of implementation defined behavior,

Surely only after standardization tho?

volta87 · on Jan 26, 2021

From the POV of an HPC cluster user, when using SLURM's `srun` or similar to schedule a job, this now allows you to use `srun --container=<your container>`, and it will start each node where you app run using the container, and make sure MPI, GPUs, etc, all work.

If you don't know anything about containers, it probably will be a bit hard to imagine what this buys you, but don't worry, as more clusters start moving towards this model, you'll have to learn about containers at some point.

From the POV of the HPC cluster, it means that the `module` system can be replaced with containers, and that can significantly lower the maintenance overhead of the cluster. In a sense, it turns HPC cluster users into HPC cluster maintainers (that have to build their own images, preparing their own environment, etc).

volta87 · on Jan 21, 2021

In which category is Intel not being outperformed by competitors?

mhh__ · on Jan 21, 2021

Money. Intel's operating income is about 66 times bigger than AMD's for example.

Intel will probably never get a run like the last decade, but they have enough market share and money to get themselves back on track.

klelatti · on Jan 21, 2021

But AMD isn't the right comparator:

Q1 2021 Revenue Guidance

Intel $17.5bn - 12% YOY

TSMC $13bn + 20% YOY

volta87 · on Jan 19, 2021

> can you compile for ARM and move the binary around as easily as you can for x86?

Yes.

volta87 · on Jan 19, 2021

And then every time you change a view option at run-time, your `.htoprc` gets modified.....

nitsky · on Jan 19, 2021

Yeah it’s annoying, but I find I rarely change view options.

volta87 · on Jan 18, 2021

The article touches on this, since its an important realization. In C, however, if the integers are not unsigned, the add/sub version exhibits UB on overflow, because C unnecessarily ties signed/unsignedness with overflow behavior (UB, two's complement, etc.).

Not that this is important, but could hint why this trick is more often shown with XOR than with add/sub.

volta87 · on Jan 15, 2021

If this happens I might replace my macbook air mid 2012. Magsafe, no touchbar, hopefully 14'' with 32GB of RAM and M1X.

Also, I hope a decent webcam, and ability to support at least 2 external displays..

rconti · on Jan 15, 2021

The new machines are great. I used a 2011 Air forever, having a high-DPI air was on my list, so when the new ones dropped, I ordered an i7 in March... which I recently sold on Swappa since I had to upgrade to an M1 :)

But of course, no magsafe, and only 16GB of RAM, so if it's your daily workhorse I can see waiting for the 14" MBP with more RAM and beefier chips.

volta87 · on Jan 15, 2021

The main killer feature for me in the M1 is that it only supports 1 external display, and that the webcam isn't great.

I hope they fix those.

volta87 · on Jan 14, 2021

When developing ML models, you rarely train "just one".

The article mentions that they explored a not-so-large hyper-parameter space (i.e. they trained multiple models with different parameters each).

It would be interesting to know how long does the whole process takes on the M1 vs the V100.

For the small models covered in the article, I'd guess that the V100 can train them all concurrently using MPS (multi-process service: multiple processes can concurrently use the GPU).

In particular it would be interesting to know, whether the V100 trains all models in the same time that it trains one, and whether the M1 does the same, or whether the M1 takes N times more time to train N models.

This could paint a completely different picture, particularly for the user perspective. When I go for lunch, coffee, or home, I usually spawn jobs training a large number of models, such that when I get back, all these models are trained.

I only start training a small number of models at the latter phases of development, when I have already explored a large part of the model space.

---

To make the analogy, what this article is doing is something similar to benchmarking a 64 core CPU against a 1 core CPU using a single threaded benchmark. The 64 core CPU happens to be slightly beefier and faster than the 1 core CPU, but it is more expensive and consumes more power because... it has 64x more cores. So to put things in perspective, it would make sense to also show a benchmark that can use 64x cores, which is the reason somebody would buy a 64-core CPU, and see how the single-core one compares (typically 64x slower).

---

To me, the only news here is that Apple GPU cores are not very far behind NVIDIA's cores for ML training, but there is much more to a GPGPU than just the perf that you get for small models in a small number of cores. Apple would still need to (1) catch up, and (2) extremely scale up their design. They probably can do both if they set their eyes on it. Exciting times.

sdenton4 · on Jan 14, 2021

The low gpu utilization rate in the first graph is kind of a tell... Seems like the M1 is a little bit worse than 40% of a v100?

volta87 · on Jan 15, 2021

If that's the case that would be very good. One can buy lots of M1 mac minis for the price of a V100..

sdenton4 · on Jan 15, 2021

Well, you can also get many RTX 3080's (~$700) for the price of a V100 (~$6000), and the RTX 3080's are faster: https://browser.geekbench.com/cuda-benchmarks

As I understand it, the V100 price is mostly artificial datacenter markup, enabled by lack of competition...

orlp · on Jan 14, 2021

> When developing ML models, you rarely train "just one".

Depends on your field. In Reinforcement Learning you often really do train just one, at least on the same data set (since the data set often is dynamically generated based on the behavior of the previous iteration of the model).

volta87 · on Jan 14, 2021

Even in reinforcement learning you can train multiple model with different data-sets concurrently and combine them for the next iteration.

lukas · on Jan 14, 2021

Do you really train more than one model at the same time on a single GPU? In my experience that's pretty unusual.

I completely agree with your conclusion here.

volta87 · on Jan 15, 2021

Depends on model size, but if the model is small enough that I actually do training on a PCIe board, I do. I partition an A100 in 8, and train 8 models at a time, or just use MPS on a V100 board. The bigger A100 boards can fit multiple of the same models that do fit in a single V100..

Also I tend to do this initially, when I am exploring the hyperparameter space, for which I tend to use smaller but more models.

I find that using big models initially is just a waste of time. You want to try many things as quickly as possible.

junipertea · on Jan 14, 2021

I found training multiple models on same GPU hit other bottlenecks (mainly memory capacity/bandwidth) fast. I tend to train one model per GPU and just scale the number of computers. Also, if nothing else, we tend to push the models to fit the GPU memory.

volta87 · on Jan 15, 2021

Memory became less of an issue for me with V100, and isn't really an issue with A100, at least when quickly iterating for newer models, when the sizes are still relatively small.

volta87 · on Jan 14, 2021

> We don’t have apples-to-apples benchmarks

We do: https://mlperf.org/

Just run their benchmarks. Submitting your results there is a bit more complicated, because all results there are "verified" by independent entities.

If you feel like your AI use case is not well represented by any of the MLPerf benchmarks, open a discussion thread about it, propose a new benchmark, etc.

The set of benchmarks there increases all the time to cover new applications. For example, on top of the MLPerf Training and MLPerf Inference benchmark suites, we now have a new MLPerf HPC suite to capture ML of very large models.

solidasparagus · on Jan 14, 2021

Those benchmarks are absurdly tuned to the hardware. Just look at the result Google gets with BERT on V100s vs the result NVIDIA gets with V100s. It's an interesting measurement of what experts can achieve when they modify their code to run on the hardware they understand well, but it isn't useful beyond that.

volta87 · on Jan 14, 2021

> Just look at the result Google gets with BERT on V100s vs the result NVIDIA gets with V100s.

These benchmarks measure the combination of hardware+software to solve a problem.

Google and NVIDIA are using the same hardware, but their software implementation is different.

---

The reason mlperf.org exists is to have a meaningful set of relevant practical ML problems that can be used to compare and improve hardware and software for ML.

For any piece of hardware, you can create an ML benchmark that's irrelevant in practice, but perform much better on that hardware than the competition. That's what we used to have before mlperf.org was a thing.

We shouldn't go back there.

sradman · on Jan 14, 2021

> on top of the MLPerf Training and MLPerf Inference benchmark suites, we now have a new MLPerf HPC suite to capture ML of very large models.

I think the challenge is selecting the tests that best represent the typical ML/DL use cases for the M1 and comparing it to an alternative such as the V100 using a common toolchain like Tensorflow. One of the problems that I see is that the optimizer/codegen of the toolchain is a key component; the M1 has both GPU and Neural Engine and we don’t know which accelerator is targeted or even possibly both. Should we benchmark ML Create on M1 vs A14 or A12X? Perhaps it is my ignorance but I don’t think we are at a point where our existing benchmarks can be applied meaningfully with the M1 but I’m sure we will get there soon.

volta87 · on Jan 15, 2021

> The challenge is selecting the tests that best represent the typical ML/DL use cases for the M1 and comparing it to an alternative such as the V100 using a common toolchain like Tensorflow.

The benchmarks there are actual applications of ML, that people use to solve real world problems. To get a benchmark accepted you need to argue and convince people that the problem the benchmark solves must be solved by a lot of people, and that doing so burns enough cycles worldwide to be helpful to design ML hardware and software.

The hardware and software then gets developed to make solving these problems fast, which then in turns make real-world applications of ML fast.

Suggesting that the M1 is a solution, and now we just need to find a good problem that this solution solves well and add it there as a benchmark is the opposite to how mlperf works, and hardware vendors suggesting this is the reason mlperf exists. We already have common ML problems that a lot of people need to solve. Either the M1 is good at those or it isn't. If it isn't, it should become better at those. Being better at problems people don't want / need to solve does not help anybody.

volta87 · on Jan 12, 2021

> The tariff will have a significant effect on production of the Airbus A320 in Mobile, Alabama, something which Airbus has claimed will only hurt US workers.

How will this hurt US workers?

Sounds like this is suggesting that either they will pay the workers less to compensate, or maybe even move the factory elsewhere to avoid the tariffs (moving the factory to mexico might mean the tariffs disappear until new tariffs against shipping planes from mexico get approved, which might take some time).

harg · on Jan 12, 2021

That’s assuming that all the planes manufactured in Alabama are not exported outside the USA. E.g. to airlines registered in Mexico, Canada, S America. I imagine that at least some are exported so moving manufacturing outside the USA would definitely hurt employment.

volta87 · on Jan 12, 2021

IIUC the goal of the Alabama factory is to work around the existing tarifs on shipping planes to the US. Instead of shipping the planes, Airbus ships the parts, and assembles them in the US, thus avoiding the plane tarifs. So now there are tariffs on "plane parts".

Maybe they'll just stop doing this, close the factory, and ship whole planes from the EU again. Depends what the difference is in the plane and plane parts tariff, and whats more worth it.