RISC-V will get there, eventually.
I remember that ARM started as a speed demon with conscious power consumption, then was surpassed by x86s and PPCs on desktops and moved to embedded, where it shone by being very frugal with power, only to now be leaving the embedded space with implementations optimised for speed more than power.
1) https://github.com/llvm/llvm-project/issues/150263
2) https://github.com/llvm/llvm-project/issues/141488
Another example is hard-coded 4 KiB page size which effectively kneecaps ISA when compared against ARM.
All of these extensions are mandatory in the RVA22 and RVA23 profiles and so will be implemented on any up to date RISC-V core. It's definitely worth setting your compiler target appropriately before making comparisons.
The RISC-V ecosystem being handicapped by backwards compatibility does not make sense at this point.
Every new RISC-V board is going to be RVA23 capable. Now is the time to draw a line in the sand.
Nope. See https://github.com/llvm/llvm-project/issues/110454 which was linked in the first issue. The spec authors have managed to made a mess even here.
Now they want to introduce yet another (sic!) extension Oilsm... It maaaaaay become part of RVA30, so in the best case scenario it will be decades before we will be able to rely on it widely (especially considering that RVA23 is likely to become heavily entrenched as "the default").
IMO the spec authors should've mandated that the base load/store instructions work only with aligned pointers and introduced misaligned instructions in a separate early extension. (After all, passing a misaligned pointer where your code does not expect it is a correctness issue.) But I would've been fine as well if they mandated that misaligned pointers should be always accepted. Instead we have to deal the terrible middle ground.
>atomic memory operations are made mandatory in Ziccamoa
In other words, forget about potential performance advantages of load-link/store-conditional instructions. `compare_exchange` and `compare_exchange_weak` will always compile into the same instructions.
And I guess you are fine with the page size part. I know there are huge-page-like proposals, but they do not resolve the fundamental issue.
I have other minor performance-related nits such `seed` CSR being allowed to produce poor quality entropy which means that we have bring a whole CSPRNG if we want to generate a cryptographic key or nonce on a low-powered micro-controller.
By no means I consider myself a RISC-V expert, if anything my familiarity with the ISA as a systems language programmer is quite shallow, but the number of accumulated disappointments even from such shallow familiarity has cooled my enthusiasm for RISC-V quite significantly.
This is primarily because core is primarily a teaching ISA. One of the best parts about RiscV is that you can teach a freshman level architecture class or a senior level chip building project with an ISA that is actually used. Anything powerful to run (a non built from source manually) linux will support a profile that bundles all the commonly needed instructions to be fast.
https://five-embeddev.com/riscv-bitmanip/1.0.0/bitmanip.html
I can see quite a few items on that list that imnsho should have been included in the core and for the life of me I can't see the rationale behind leaving them out. Even the most basic 8 bit CPU had various shifts and rolls baked in.
Same could be said of MIPS.
My understanding is the RISC-V raison d'etre is rather avoidance of patented/copywritten designs.
Why did it fall to them to do it? Impressive that he did, but it shouldn't have been necessary.
https://wren.wtf/hazard3/doc/#extension-xh3bextm-section
There are also four other custom extensions implemented.
Not trolling: I legitimately don't see why this is assumed to be true. It is one of those things that is true only once it has been achieved. Otherwise we would be able to create super high performance Sparc or SuperH processors, and we don't.
As you note, Arm once was fast, then slow, then fast. RISC-V has never actually been fast. It has enabled surprisingly good implementations by small numbers of people, but competing at the high end (mobile, desktop or server) it is not.
I'm a chip designer and I see people using RISC-V as small processor cores for things like PCIE link training or various bookkeeping tasks. These don't need to be fast, they need to be small and low power which means they will be relatively slow.
Most people on tech review sites only care about desktop / laptop / server performance. They may know about some of the ARM Cortex A series CPUs that have MMUs and can run desktop or smartphone Linux versions.
They generally don't care about the ARM Cortex M or R versions for embedded and real time use. Those are the areas where you don't need high performance and where RISC-V is already replacing ARM.
EDIT:
I'll add that there are companies that COULD make a fast RISC-V implementation.
Intel, AMD, Apple, Qualcomm, or Nvidia could redirect their existing teams to design a high performance RISC-V CPU. But why should they? They are heavily invested in their existing x86 and ARM CPU lines. Amazon and Google are using licensed ARM cores in their server CPUs.
What is the incentive for any of them to make a high performance RISC-V CPU? The only reason I can think of is that Softbank keeps raising ARM licensing costs and it gets high enough that it is more profitable to hire a team and design your own RISC-V CPU.
Especially the lack of integer overflow detection is a choice of great stupidity, for which there exists no excuse.
Detecting integer overflow in hardware is extremely cheap, its cost is absolutely negligible. On the other hand, detecting integer overflow in software is extremely expensive, increasing both the program size and the execution time considerably, because each arithmetic operation must be replaced by multiple operations.
Because of the unacceptable cost, normal RISC-V programs choose to ignore the risk of overflows, which makes them unreliable.
The highest performance implementations of RISC-V from previous years were forced to introduce custom extensions for indexed addressing, but those used inefficient encodings, because something like indexed addressing must be in the base ISA, not in an extension.
this just isn't true. both addition and multiplication can check for overflow in <2 instructions.
Read the RISC-V documentation to see that you need at least 3 instructions.
For addition, overflow happens when you add 2 positive numbers and the result is negative, or when you add 2 negative numbers and the result is positive.
After using an instruction to do the addition, how can you detect this complex condition with a single instruction in the impoverished RISC-V ISA?
EDIT: Someone has downvoted this, presumably because I have not been polite enough.
That may be true, but I consider much more impolite the kind of misleading false information that has been written by the poster to whom I have replied. It is difficult to read that kind of b*s*t and be cool about it.
Most languages don't care about integer overflow. Your typical C program will happily wrap around.
If I really want to detect overflow, I can do this:
add t0, a0, a1
blt t0, a0, overflow
Which is one more instruction, which is not great, not terrible.And what did I find? Yep that code is right from the manual for unsigned integer overflow.
For signed addition if you know one of the signs (eg it’s a compile time constant) the manual says
addi t0, t1, +imm
blt t0, t1, overflow
But the general case for signed addition if you need to check for overflow and don’t have knowledge of the signs add t0, t1, t2
slti t3, t2, 0
slt t4, t0, t1
bne t3, t4, overflow
From what I’ve read most native compiled code doesn’t really check for overflows in optimised builds, but this is more of an issue for JavaScript et al where they may detect the overflow and switch the underlying type? I’m definitely no expert on this.The correct sequence of instructions is given in the RISC-V documentation and it needs more instructions.
"Integer overflow" means "overflow in operations with signed integers". It does not mean "overflow in operations with non-negative integers". The latter is normally referred as "carry".
The 2 instructions given above detect carry, not overflow.
Carry is needed for multi-word operations, and these are also painful on RISC-V, but overflow detection is required much more frequently, i.e. it is needed at any arithmetic operation, unless it can be proven by static program analysis that overflow is impossible at that operation.
It's easy to believe you're replying to something that has an element of hyperbole.
It's hard to believe "just do 2x as many instructions" and "ehhh who cares [i.e. your typical C program doesn't check for overflow]", coupled to a seemingly self-conscious repetition of a quip from the television series Chernobyl that is meant to reference sticking your head in the sand, retire the issue from discussion.
The sequence of instructions given above is incorrect, it does not detect integer overflow (i.e. signed integer overflow). It detects carry, which is something else.
The correct sequence, which can be found in the official RISC-V documentation, requires more instructions.
Not checking for overflow in C programs is a serious mistake. All decent C compilers have compilation options for enabling checking for overflow. Such options should always be used, with the exception of the functions that have been analyzed carefully by the programmer and the conclusion has been that integer overflow cannot happen.
For example with operations involving counters or indices, overflow cannot normally happen, so in such places overflow checking may be disabled.
I know this is a very negative take. I don't try to hide my pro-Power ISA bias, but that doesn't mean I wouldn't like another choice. So far, however, I've been repeatedly disappointed by RISC-V. It's always "five or six years" from getting there.
SPARC (formerly called Berkeley RISC) and MIPS were pioneers that experimented with various features or lack of features, but they were inferior from many points of view to the earlier IBM 801.
The RISC ISAs developed later, including ARM, HP PA-RISC and IBM POWER, have avoided some of the mistakes of SPARC and MIPS, while also taking some features from IBM 801 (e.g. its addressing modes), so they were better.
You're saying ISA design does have implementation performance implications then? ;)
> There's no one that expects it'll be hard to optimize for
[Raises hand]
> There are at least 2 designs that have taped out in small runs and have high end performance.
Are these public?
Edit: I should add, I'm well aware of the cultural mismatch between HN and the semi industry, and have been caught in it more than a few times, but I also know the semi industry well enough to not trust anything they say. (Everything from well meaning but optimistic through to outright malicious depending on the company).
> There's no one that expects it'll be hard to optimize for
No one who is an expert in the field, and we (at Red Hat) talk to them routinely.
The most promising RISC-V companies today have not set out to compete directly with Intel, AMD, Apple or Samsung, but are targeting a niche such as AI, HPC and/or high-end embedded such as automotive.
And you can bet that Qualcomm has RISC-V designs in-house, but only making ARM chips right now because ARM is where the market for smartphone and desktop SoCs is. Once Google starts allowing RVA23 on Android / ChromeOS, the flood gates will open.
Is this because Fedora 44 is going to beta?
The optimizations that'd be applied to ARM and MIPS would be equally applicable to RISC-V. I do not believe this is a lack of software optimization issue.
We are well past the days where hand written assembly gives much benefit, and modern compilers like gcc and llvm do nearly identical work right up until it comes to instruction emissions (including determining where SIMD instructions could be placed).
Unless these chips have very very weird performance characteristics (like the weirdness around x86's lea instruction being used for arithmetic) there's just not going to be a lot of missed heuristics.
There's a reason, for example, why the linux distros all target a generic x86 architecture rather than a specific architecture.
However, other applications which must do cryptographic operations, audio/video processing, scientific/technical/engineering computing, etc. may have wildly different performances when compiled for different x86-64 ISA versions, for which dedicated assembly-language functions exist.
Java, interestingly enough, is somewhat leading the way here with their Vector API. I think they actually have one of the better setups for allowing someone to write fast code that is platform independent.
C++ is also diving into this realm. 26 just merged in now SIMD instructions.
That is the bulk of the benefit of diving down into assembly.
Most of the applications whose performance matters for me, because I must wait a non-negligible time for them to do their job, are dependent on assembly implementation for certain functions invoked inside critical loops. I do not see any sign of replacements for them. On the contrary, Intel, AMD and Arm continue to introduce special instructions that are useful in certain niche applications and taking advantage of them will require additional assembly language functions, not less.
For me, there is only one application that I use and which consumes non-negligible computer time and which does not depend on SIMD optimizations, which is the compilation of software projects.
There's no carry bit, and no widening multiply(or MAC)
That's true, but tautological.
The issue is that the RISC-V core is the easy part of the problem, and nobody seems to even be able to generate a chip that gets that right without weirdness and quirks.
The more fundamental technical problem is that things like the cache organization and DDR interface and PCI interface and ... cannot just be synthesized. They require analog/RF VLSI designers doing things like clock forwarding and signal integrity analysis. If you get them wrong, your performance tanks, and, so far, everybody has gotten them wrong in various ways.
The business problem is the fact that everybody wants to be the "performance" RISC-V vendor, but nobody wants to be the "embedded" RISC-V vendor. This is a problem because practically anybody who is willing to cough up for a "performance" processor is almost completely insensitive to any cost premium that ARM demands. The embedded space is hugely sensitive to cost, but nobody is willing to step into it because that requires that you do icky ecosystem things like marketing, software, debugging tools, inventory distribution, etc.
This leads to the US business problem which is the fact that everybody wants to be an IP vendor and nobody wants to ship a damn chip. Consequently, if I want actual RISC-V hardware, I'm stuck dealing with Chinese vendors of various levels of dodginess.
A lot of times the path to the highest performing CPU seems to be to optimize for power first, then speed, then repeat. That's because power and heat are a major design constraint that limits speed.
I first noticed this way back with the Pentium 4 "Netburst" architecture vs. the smaller x86 cores that became the ancestor of the Core architecture. Intel eventually ran into a wall with P4 and then branched high performance cores off those lower-power ones and that's what gave us the venerable Core architecture that made Intel the dominant CPU maker for over a decade.
ARM's history is another example.
In comparison, I think Arm is actually a very strong cautionary tale that focusing on power will not get you to performance. Arm processors remained pretty poor performance until designers from other CPU families entirely (PowerPC and Intel) took it on at Apple and basically dragged Arm to the performance level they are today.
Banias was hyper optimized for power, the mantra was to get done quickly and go to sleep to save power. Somewhere along the line someone said "hey what happens if we don't go to sleep?" and Core was born.
In theory you can spend a lot of effort to make a flawed ISA perform, but it will be neither easy nor pretty e.g. real world Linux distros can't distribute optimised packages for every uarch from dual-issue in-order RV64GC to 8-wide OoO RV64 with all the bells and whistles. Only in (deeply) embedded systems can you retarget the toolchain and optimise for each damn architecture subset you encounter.
At this point the most likely place for truly competitive RISC-V to appear is China.
Or we just adopt Loongson.
> In the current version of this architecture specification, TLB refill and consistent maintenance between TLB and page tables are still [sic] all led by software.
https://loongson.github.io/LoongArch-Documentation/LoongArch...
BTW, it's quite impressive how the s390x is so fast per core compared to the others. I mean, of course it's fast - we all knew that.
And don't let IBM legal see this can be considered a published benchmark, because they are very shy about s390x performance numbers.
What is the current fastest platform that isn’t exorbitantly expensive? Not upcoming releases, but something I can actually buy.
I check in every 3-6 months but the situation hasn’t changed significantly yet.
The cores are in my experience moderately fast at most. Note that there are a lot of licencing options and I think some are speed-capped - but I don't think that applies to IFL - a standard CPU licence-restricted to only run linux.
Except the K3 kills it on AI (60 TOPS).
SpacemiT K3 is 2010 Macbook performance single-core, 2019 Macbook Air multi-core, and better than M4 Apple Silicon for AI.
So I guess it depends on what you are going to do with it.
I think the ban of SOPHGO is part to blame for the slow development.[2] They had the most performant and interesting SOCs. I had a bunch of pre-orders for the Milk-V Oasis before it was cancelled. It was supposed to come out a while ago, using the SG2380, supposedly much more performant than the Milk-V Titan mentioned in the article (which still isn't out).
It was also SOPHGO's SOCs that powered the crazy cheap/performant/versatile Milk-V DUO boards. They have the ability to switch ARM/RISC-V architecture.
[1]: https://archriscv.felixc.at/
[2]: https://www.tomshardware.com/tech-industry/artificial-intell...
Obviously a solvable problem to split build and test but perhaps the time savings aren't worth the complexity.
https://src.fedoraproject.org/rpms/rpy/pull-request/4#reques...
Anyway, it's hardly surprising that a young ISA with not a 1/1000th of the investment of x86 or ARM has slower chips than them x)
I guess that may be the true use case for 'Open-Source' cores.
That being said, the advertised SPEC2007 scores are close to a M1 in IPC.
Cortex A57 is 14 years old and is significantly faster than the 9 year old Cortex A55 these RISC-V cores are being compared against.
So yes it's many years behind. Many, many years.
Tenstorrent Atlantis (first Ascalon silicon) should ship in Q2/Q3 and be twice as fast. About as fast as Ryzen5. So, about 5 years behind AMD.
But even the K3 has faster AI than Apple Silicon or Qualcomm X Elite.
Current trend-lines suggest ARM64 and RISC-V performance parity before 2030.
That'd be ~7 years behind, not 4. Cortex A76 came out in late 2018. Also what benchmarks are you looking at?
> Tenstorrent Atlantis (first Ascalon silicon) should ship in Q2/Q3 and be twice as fast. About as fast as Ryzen5. So, about 5 years behind AMD.
Which Ryzen 5? The first Ryzen 5 came out in 2017, which was a lot more than 5 years ago.
> But even the K3 has faster AI than Apple Silicon or Qualcomm X Elite.
Which isn't RISC-V. Might as well brag about a RISC-V CPU with an RTX 5090 being faster at CUDA than a Nintendo Switch. That's a coprocessor that has nothing to do with the ISA or CPU core.
> Current trend-lines suggest ARM64 and RISC-V performance parity before 2030.
L. O. fucking. L. That's not how this works. That's not how any of this works.
It just takes time, people who believe in it and tons of money. Will see where the journey goes, but I am a big risc-v believer
The fact that i686 is 14% faster than x86_64 is a little suspicious, because usually the same software runs _faster_ on x86_64 (despite the increased memory use) thanks to a larger register set, an optimized ABI, and more vector instructions.
Of course, if you are compiling an i686 binary on i686, and an x86_64 binary on x86_64, then the compilers aren't really doing the same work, since their output is different. I'm not a compiler expert, but I could imagine that compiling x86_64 binaries is intrinsically slower than for i686 for a variety of reasons. For example, x86_64 is mostly a superset of i686, so a compiler has way more instructions to consider, including potential optimizations using e.g. SIMD instructions that don't exist on i686 at all. Or a compiler might assume a larger instruction cache size, by default, and do more unrolling or inlining when compiling for x86_64. And so on.
In that case, compiling on x86_64 is slower not because the hardware is bad but because the compiler does more work. Perhaps something similar is happening on RISC-V.
But yeah, it may mean the benchmark is not representative.
i. llvm presentation can thrash caches if setup wrong (given the plethora of RISC-V fragmented versions, most compilers won't cover every vanity silicon.)
ii. gcc is also "slow" in general, but is predictable/reliable
iii. emulation is always slower than kvm in qemu
It may seem silly, but I'd try a gcc build with -O0 flag, and a toy unit test with -S to see if the ASM is actually foobar. One may have to force the -mtune=boom flag to narrow your search. Best regards =3