upvote
I think in the example the parent gave `arr[3]` is past the end of the 3 element array, where `i` might reside, potentially changing its value.
reply
It's clear in the AST that there is undefined behaviour and it is malformed code. It is not valid C code, so what the compiler chooses to do with it is not defined by the language.
reply
Note that if you change the code to this you have the same issue:

    int g(int n) {
        int arr[3], i = 0;
        arr[n] = 5;
        return i;
    }
Without "exploiting UB" it's incorrect to optimize this to "return 0", because of the possibility that i was allocated right after arr and n == 3.
reply
My thinking disagrees with yours, but I can't say I have fully made up my mind yet. To me a compiler that is not "exploiting UB" has several valid choices about how to compile this. i may be stored in memory, in a register or may be deleted entirely as it's known at compile time that it will have value 0. The store to arr[n] may go through or it may be deleted as it's a dead store.

You may say I'm "exploiting ub" when making these deductions but I would disagree. My argument is not of the form "x is ub so I can do whatever I want".

To elaborate on the example, if arr was volatile then I would expect the write to always go through. And if i was volatile then I would expect i to always be read. However it's still not guaranteed that i is stored immediately after arr, as the compiler has some discretion about where to place variables afaik. But if i is indeed put immediately after, then the function should indeed return 5 for n=3. For n>3 it should either return 0 (if writing to unused stack memory), page fault (for small n outside of the valid stack space), or stomp on random memory (for unlucky and large n). For negative n, many bad things are likely to happen.

Edit: I think I mixed up which way the stack grows but yeah.

reply
> But if i is indeed put immediately after, then the function should indeed return 5 for n=3.

That's not how compilers work. The optimization changing `return i;` into `return 0;` happens long before the compiler determines the stack layout.

In this case, because `return i;` was the only use of `i`, the optimization allows deleting the variable `i` altogether, so it doesn't end up anywhere on the stack. This creates a situation where the optimization only looks valid in the simple "flat memory model" because it was performed; if the variable `i` hadn't been optimized out, it would have been placed directly after `arr` (at least in this case: https://godbolt.org/z/df4dhzT5a), so the optimization would have been invalid.

There's no infrastructure in any compiler that I know of that would track "an optimization assumed arr[3] does not alias i, so a later stage must take care not to place i at that specific point on the stack". Indeed, if array index was a runtime value, the compiler would be prevented from ever spilling to the stack any variable that was involved in any optimizations.

So I think your general idea "the allowable behaviors of an out-of-bounds write is specified by the possible actual behaviors in a simple flat memory model for various different stack layouts" could work as a mathematical model as an alternative to UB-based specifications, but it would end up not being workable for actual optimizing compiler implementations -- unless the compiler could guarantee that a variable can always stay in a register and will never be spilled (how would the compiler do that for functions calls?), it'd have to essentially treat all variables as potentially-modified by basically any store-via-pointer, which would essentially disable all optimizations.

reply
If we consider writing out of bounds to be legal, we make it impossible to reason about the behavior of programs.
reply
Hence why I (and many other compiler developers) are inherently skeptical whenever anyone says "just stop exploiting undefined behavior".
reply
I'm not a compiler developer but I'm at least as skeptical as you because there is no sign that the "just stop exploiting UB" people actually want any specific semantics, IMO they want Do What I Mean, which isn't a realizable language feature.

If you could somehow "stop exploiting UB" they'd just be angry either that you're still exploiting an actual language requirement they don't like and so have decided ought to be excluded or that you followed the rules too literally and obviously the thing they meant ought to happen even though that's not what they actually wrote. It's lose-lose for compiler vendors.

reply
I am one of the "stop exploiting UB" camp. [1]

I agree that some of us are unreasonable, but I do recognize that DWIM is not feasible.

I just want compilers to treat UB the same as unspecified behavior, which cannot be assumed away.

[1]: https://gavinhoward.com/2023/08/the-scourge-of-00ub/

reply
> I just want compilers to treat UB the same as unspecified behavior, which cannot be assumed away.

Unspecified behavior is defined as the "use of an unspecified value, or other behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance".

Which (two or more) possibilities should the standard provide for out-of-bounds writes? Note that "do what the hardware does" wouldn't be a good specification because it would either (a) disable all optimizations or (b) be indistinguishable from undefined behavior.

reply
You mention that "Note that those surprised programmers are actually Rust compiler authors" but I can't figure out which of the many links is to some "surprised programmers" who are actually rustc authors, and so I don't even know if you're right.

Rust's safe subset doesn't have any UB, but the unsafe Rust can of course cause UB very easily, because the rules in Rust are extremely strict and only the safe Rust gets to have the compiler ensure it doesn't break the rules. So it seems weird for people who work on the compiler guts to be "surprised".

reply
I'm a Rust compiler author, and I'm fully in favor of "UB exploitation". In fact, LLVM should be doing more of it. LLVM shouldn't be holding back optimizations in memory-safe languages for edge cases that don't really matter in practice.
reply
reply
I don't see any surprised compiler authors in that thread. The reporter immediately suggests the correct underlying reason for the bug and another compiler author even says that they wondered how long it would take for someone to notice this.

Even if you read any surprise into their messages they wouldn't be surprised that C does something completely unreasonable, they would be surprised that LLVM does something unreasonable (by default).

reply
Wait, that's not even linked in your post AFAICT. It's also about an LLVM bug and not in fact exploiting UB.

"LLVM shouldn't miscompile programs" is uncontroversial, but claiming that these miscompilations are somehow "Exploiting Undefined Behaviour" is either incompetent or an attempt to sell your position as something it isn't.

reply
There is also a completely different scenario where out-of-bounds writes aren't undefined behavior anymore. And that's when you've manually defined the arrays in an assembly source file, and exported their symbols. In that situation, you know what's before the array or after the array, so doing pointer math into an adjacent area has a well known effect.
reply