To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels, we made that very fast! we didn't tell them :-)
At the time 24-bit framebuffers were so slow that before we built graphics acceleration hardware people would switch back to 8-bit to get stuff done, making 24-bit/true colour your daily driver was a big step forward.
PS – I am looking through the NuBus cards that I have... did you work for SuperMac or RasterOps?
I did the architectural design for the SuperMac cards. I figured out what needed to be accelerated, dropping code into people's machines to see where the cycles were going. Others did the physical design for the first 2 cards, I did the design of the chip in the Thunder and later cards (designed the data paths and state machines and a full simulation, someone else actually laid the gates)
If your card has a SQD01 on it it's my work. It peaks at 1.5Gb/s on solid fills
One of the other bugs (the Quark/ATM one) was also because of the programmers were worried about writing over stuff that hadn't been completely erased, the Quark guys wrote a string with 2 spaces at the end through a box that masked the end of the string, the ATM font renderer saw it couldn't fit the text so it split it in half and tried again so it drew N/2 N/4 N/8 ... strings. It spent all it's time in the 68k's multiply instructions figuring out how wide the strings (and substrings) were, our fancy 24-bit character rendering hardware was an afterthought
I feel like I'm having a stroke trying to read this, what does it mean??
I was capturing QuickDraw library calls - the low level graphics primitives, to figure out where the graphics time in apps was going and found out sometimes excel did it 9 times
Of course users didn't see it more than once, but our hardware made all that wasted time run faster
Another dev who's fixing a bug, realizes if they call a certain function either directly or indirectly, their particular bug gets fixed.
Oh, and as a side effect, the cell gets erased (again).
A few more fixes/new features added like this and the code is inadvertently erasing the same cell multiple times.
It takes a certain type of dev to step through in a debugger and Notice the app is doing way too much work and then to untangle the mess of code without causing regressions.
8 bit psuedo color, so the color palette switched with every focus-follows-mouse window boundary crossing. 16 bit direct color with banding but no more palette psychedlia.
This was equal parts to make it faster and to allow for higher framebuffer resolutions with limited VRAM.
Back then you did what you could with graphics and it wasn't a lot. After I got a PC I had indexed color for a long time and working with indexed color was pretty rough because anything physics-based like rendering or raytracing was going to be difficult. You could render a photo pretty well with 256 carefully chosen colors and dithering but if you wanted to, say, composite two photos and do general sorts of things you'd need to convert to "true color", do the math there, then re-quantize for display.
Was it a workaround for things that didn’t fully complete on one iteration, so the devs kept hammering away at it until it worked?
Not every bug results in the program doing the wrong thing, they often just make the program do the right thing very slowly.
And nobody notices, since it still produces the right result.
Now the bugs that get ignored for new features cause bad results AND bad performance.
If the stream is buffered, then all operations, including fread, are supposed to go through the buffer.
All three of these should issue buffer-sized reads to the operating system:
1. A loop which calls getc(stream) 65536 times.
2. fread(buf, 1, 65536, stream)
3. fread(buf, 65536, 1, stream)
The more direct behavior of fread should only kick in if the stream is configured as unbuffered.
I would say that the way low-level reads are issued to the host operating system is a "visible effect" of the program, so I suspect this may actually be a matter of conformance. I.e. it's not okay to issue those reads however the stream library wants as long as the data is read.
Edit: removed incorrect information.
See the original post and discussion for the whole story:
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times... https://news.ycombinator.com/item?id=26296339
What software did that that badly? If the code asks for (up to) 65,536 single byte items, why would you split that into 65,536 calls?
Also, that change changes behavior. The old call could read anything from zero to 65,536 bytes, the new one only can read zero or 65,536 bytes.
(Reading the source of a few implementations, I think most implementations will fill the output buffer with partial objects if the input doesn’t supply an integral number of them, but the return value of fread cannot signal that to the caller)
> For each object, size calls are made to the fgetc function and the results stored, in the order read, in an array of unsigned char exactly overlaying the object
(wording unchanged since C99)
If the file is unbuffered, depending on how the implementation handles buffering, and how it interprets the standard, then perhaps it does end up hitting a path where there's 1 ReadFile call per byte...
I don't know how most implementations get around this. Presumably it's valid to interpret "calls are made" as "behaving as if calls are made", meaning fread can copy data out of the FILE's buffer directly, or make calls directly to whatever routine fgetc defers to, rather than calling fgetc N times literally. Looks like glibc's fread does this.
As to why you'd do that? - well, who knows the exact circumstances in this case. Perhaps this was faster in some meaningful case that was relevant to some other project (and then maybe the fread doesn't call fgetc after all!). I'm just speculating. Well-reused code often ends up with stuff that needs rethinking, that, even if noticed, nobody has the time or inclination to attempt to fix.
I had to convince people with benchmarks regularly that, yes, you could write the handful of lines to do proper user-space buffering and trivially run rings around any code that did extra context switches, because a lot of people didn't realise the cost difference between system calls and calling their own functions.
This included, by the way, the MySQL client library, at one point, which would do small read for length fields instead of larger non-blocking reads into a buffer all the time
But I think the parent comment's point is that the issue is in the implementation of fread itself in the standard library. It's perfectly reasonable for an application to pass it 1, 65536 (i.e. one byte, up to 65536 times) and expect it not to issue 65536 separate OS calls.
No, I'm not saying that's why. I'm simply saying there is a difference between asking for 1 byte or 65k bytes of something. Even dd runs the same under Linux.
dd bs=10k count=1 is faster than bs=1 count=10k
I remember trying to recover some data from a spinning disk, and trying to slowly creep up on the data. So I wanted 1 byte per, I wanted it to nibble, until it hit whatever the errored part was. If I just grabbed the lot, it'd error out from the whole read.
The latter (as usual when comparing OpenBSD and Linux) is more complex, but both multiply count by size and then go their way.
Also, the API contract allows fread to read fewer bytes than requested. I would except any implementation to do that.
But maybe, somebody interpreted the contract differently than major OSes, in the sense that a call isn’t allowed to write partial size-sized chunks to user memory and/or advance the file position further than its return value advocates (that, I think, is something that the implementations above can do, and might be considered a bug)
Yes it's different. As others have noted, the difference is what is returned if less than 65536 are available to read in the file: total failure vs partial read.
There is, unsurprisingly, no requirement that it has an unnecessarily inefficient implementation to meet this behavioral requirement. (The C standard doesn't talk about such things as syscalls but, even if it did, it surely wouldn't require such a thing.)
The irony is that that partial read is actually the default on both Windows and Posix (i.e. both ReadFile and read() will read up to the number of bytes specified). So a one-syscall implementation for fread would have been easier than multiple calls, and certainly would be standard compliant.
The dd example isn't comparable because dd is much lower level, and you really are specifying how the syscalls should be made.
I've not looked at the code (or even the man pages) and it is a long time since I touched anything that low level, so this might be completely wrong, but if there is an error before the next 64KiB (including just hitting EOF) then the semantics could be different. Asking for 1x64KiB I would expect to just error as there aren't the requested number of bytes. Asking for 64Ki lots of 1 byte might simple error just the same, or it might at least populate the buffer with what it can read, or if the meaning of 1,65536 is actually “up to 64Ki lots of 1B” then it would populate the buffer as far as possible and return the amount read rather than an error condition.
If the per-byte option is slow but still fast enough, and dealing with the semantics is less faf, then people will go for that because the tiny time loss is worth the larger effort reduction. Of course this assumes the underlying system doesn't change, as with the “making local code to run as on-demand networked code” example higher in the thread which changes the relative performance characteristics of the two calling methods significantly.
fread(data, 1, sizeof(buffer), f);
with the rationale that I'm interested in reading sizeof(buffer) individual bytes. The buffer size is incidental, not the size of the items I'm trying to read from the file; "read one item whose size is sizeof(buffer)" seems semantically wrong.Is this just the case of Windows having a bad stdlib fread implementation 15 years ago or is my thinking here actually wrong?
The C runtime authors did (presumably Microsoft, if it's MSVCRT).
He's hooking into ReadFile, a layer below the stdlib. By the time it reaches the hook, it's already split.
For example, I run TortoiseGit which has a caching feature which is supposed to make it faster at showing what to commit. Disabling it increases the number of items I can delete per second in my Windows Explorer from about 1000 to about 3000 while making not making TortoiseGit operations meaningfully slower (that I can tell).
This is a Dev Drive [0] on my machine, it would probably be slower on my C: drive which has full Windows Defender real time file scanning.
This is a great article on why it's so unreasonably slow to modify these archives: https://textslashplain.com/2021/06/02/leaky-abstractions/
But it doesn't seem to explain why it's so much slower at regular extraction.
That's because the OS does the same thing too. It's the right fix, when I implemented something similar, we implemented caching right away.
But in this case if the code was calling fread 65536 times in a loop and getting 64KiB each time it wouldn't be good either!
Sounds like the parent comment had to fix this with the internal cache thing to speed up the small freads. I think they meant the easy fix would have been swapping the args in the original / caller code.
Edit: mort96: So did you check the return value or not?
I really hope that was not the case and rather think incompetence or to deal with obscure legacy problems, but the gamer in me gets enraged at the thought someone would artificially increase loading times.
Which is the obvious reason you'd pass an element size of 1: you want to know how many bytes were read.