[1]: https://www.intel.com/content/www/us/en/developer/articles/t...
Like, I get that leaf functions with truly huge computational cores are a thing that would benefit from more ISA-visible registers, but... don't we have GPUs for that now? And TPUs? NPUs? Whatever those things are called?
It's up to the compiler to decide how many registers it needs to preserve at a call. It's also up to the compiler to decide which registers shall be the call-clobbered ones. "None" is a valid choice here, if you wish.
Any easy way to see that is that the system with more registers can always use the same register allocation as the one with fewer, ignoring the extra registers, if that's profitable (i.e. it's not forced into using extra caller-saved registers if it doesn't want to).
On a 16 register machine with 9 call-clobbered registers and 7 call-invariant ones (one of which is the stack pointer) we put 6 temporaries into call-invariant registers (so there are 6 spills in the prologue of this big function), another 9 into the call-clobbered registers; 2 of those 9 are the helper function's arguments, but 7 other temporaries have to be spilled to survive the call. And the rest 25 temporaries live on the stack in the first place.
If we instead take a machine with 31 registers, 19 being call-clobbered and 12 call-invariant ones (one of which is a stack pointer), we can put 11 temporaries into call-invariant registers (so there are 11 spills in the prologue of this big function), and another 19 into the call-clobbered registers; 2 of those 19 are the helper function's arguments, so 17 other temporaries have to be spilled to survive the call. And the rest of 10 temporaries live on the stack in the first place.
So, there seems to be more spilling/reloading whether you count pre-emptive spills or the on-demand-at-the-call-site spills, at least to me.
The actual counter proof here would be that in either case, the temporaries have to end up on the stack at some point anyways, so you’d need to look at the total number of loads/stores in the proximity of the call site in general.
- XSAVE / XRSTOR
- XSAVEOPT / XRSTOR
- XSAVEC / XRSTOR
- XSAVES / XRSTORS
That would be a major headache — even if current instruction encodings were somehow preserved.
It’s not just about compilers and assemblers. Every single system implementing virtualization has a software emulation of the instruction set - easily 10k lines of very dense code/tables.
Presumably this is gated behind cpuid and/or model specific registers, so it would tend to not be exposed by virtualization software that doesn't support it. But yeah, if you decode and process instructions, it's more things to understand. That's a cost, but presumably the benefit outweighs the cost, at least in some applications.
It's the same path as any x86 extension. In the beginning only specialty software uses it, at some point libraries that have specialized code paths based on processor featurses will support it, if it works well it becomes standard on new processors, eventually most software requires it. Or it doesn't work out and it gets dropped from future processors.
The longer prefix has extra functionality such as adding a third operand (i.e. add r8, r15, r16), blocking flags update, and accessing a few new instructions (push2, pop2, ccmp, ctest, cfcmov).
Data registers could be bigger. There's no reason `sizeof int` has to equal `sizeof intptr_t`, many older architectures had separate address & data register sizes. SIMD registers are already a case of that in x86_64.
* Four-bit processors can only count to 15,or from -8 to 7, so their use has been pretty limited. It is very difficult for them to do any math, and they've mostly been used for state machines.
* Eight-bit processors can count to 255, or from -128 to 127, so much more useful math can run in a single instruction, and they can directly address hundreds of bytes of RAM, which is low enough an entire program still often requires paging, but at least a routines can reasonably fit in that range. Very small embedded systems still use 8-bit processors.
* Sixteen-bit processors can count to 65,535, or from -32,768 to 32,767, allowing far more math to work in a single instruction, and a computer can have tens of kilobytes of RAM or ROM without any paging, which was small but not uncommon when sixteen-bit processors initially gain popularity.
* Thirty-two-bit processors can count to 4,294,967,295, or from -2,147,483,648 to 2,147,483,647, so it's rare to ever need multiple instructions for a single math operation, and a computer can address four gigabytes of RAM, which was far more than enough when thirty-two-bit processors initially gain popularity. The need for more bits in general-purpose computing plateaus at this point.
* Sixty-four-bit processors can count to 18,446,744,073,709,551,615, or from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807, so only special-case calculations need multiple instructions for a single math operation, and a computer can address up to sixteen zettabytes of RAM, which is thousands of times more than current supercomputers use. There's so many bits that programs only rarely perform 64-bit operations, and 64-bit instructions are often performing single-instruction-multiple-data operations that use multiple 8-, 16-, or 32-bit numbers stored in a single register.
We're already at the point where we don't gain a lot from true 64-bit instructions, with the registers being more-so used with vector instructions that store multiple numbers in a single register, so a 128-bit processor is kind of pointless. Sure, we'll keep growing the registers specific to vector instructions, but those are already 512-bits wide on the latest processors, and we don't call them 512-bit processors.
Granted, before 64-bit consumer processors existed, no one would have conceived that simultaneously running a few chat interfaces, like Slack and Discord, while browsing a news web page, could fill up more RAM than a 32-bit processor can address, so software using zettabytes of RAM will likely happen as soon as we can manufacture it, thanks to Wirth's Law (https://en.wikipedia.org/wiki/Wirth%27s_law), but until then there's no likely path to 128-bit consumer processors.