The latter (as usual when comparing OpenBSD and Linux) is more complex, but both multiply count by size and then go their way.
Also, the API contract allows fread to read fewer bytes than requested. I would except any implementation to do that.
But maybe, somebody interpreted the contract differently than major OSes, in the sense that a call isn’t allowed to write partial size-sized chunks to user memory and/or advance the file position further than its return value advocates (that, I think, is something that the implementations above can do, and might be considered a bug)
Yes it's different. As others have noted, the difference is what is returned if less than 65536 are available to read in the file: total failure vs partial read.
There is, unsurprisingly, no requirement that it has an unnecessarily inefficient implementation to meet this behavioral requirement. (The C standard doesn't talk about such things as syscalls but, even if it did, it surely wouldn't require such a thing.)
The irony is that that partial read is actually the default on both Windows and Posix (i.e. both ReadFile and read() will read up to the number of bytes specified). So a one-syscall implementation for fread would have been easier than multiple calls, and certainly would be standard compliant.
The dd example isn't comparable because dd is much lower level, and you really are specifying how the syscalls should be made.
I've not looked at the code (or even the man pages) and it is a long time since I touched anything that low level, so this might be completely wrong, but if there is an error before the next 64KiB (including just hitting EOF) then the semantics could be different. Asking for 1x64KiB I would expect to just error as there aren't the requested number of bytes. Asking for 64Ki lots of 1 byte might simple error just the same, or it might at least populate the buffer with what it can read, or if the meaning of 1,65536 is actually “up to 64Ki lots of 1B” then it would populate the buffer as far as possible and return the amount read rather than an error condition.
If the per-byte option is slow but still fast enough, and dealing with the semantics is less faf, then people will go for that because the tiny time loss is worth the larger effort reduction. Of course this assumes the underlying system doesn't change, as with the “making local code to run as on-demand networked code” example higher in the thread which changes the relative performance characteristics of the two calling methods significantly.