upvote
> Note there is no intrinsic reason running multiple streams should be faster than one.

The issue is the serialization of operations. There is overhead for each operation which translates into dead time between transfers.

However there are issues that can cause singular streams to underperform multiple streams in the real world once you reach a certain scale or face problems like packet loss.

reply
Is it certain that this is the reason?

rsync's man page says "pipelining of file transfers to minimize latency costs" and https://rsync.samba.org/how-rsync-works.html says "Rsync is heavily pipelined".

If pipelining is really in rsync, there should be no "dead time between transfers".

reply
The simple model for scp and rsync (it's likely more complex in rsync): for loop over all files. for each file, determine its metadata with fstat, then fopen and copy bytes in chunks until done. Proceed to next iteration.

I don't know what rsync does on top of that (pipelining could mean many different things), but my empirical experience is that copying 1 1 TB file is far faster than copying 1 billion 1k files (both sum to ~1 TB), and that load balancing/partitioning/parallelizing the tool when copying large numbers of small files leads to significant speedups, likely because the per-file overhead is hidden by the parallelism (in addition to dealing with individual copies stalling due to TCP or whatever else).

I guess the question is whether rsync is using multiple threads or otherwise accessing the filesystem in parallel, which I do not think it does, while tools like rclone, kopia, and aws sync all take advantage of parallelism (multiple ongoing file lookups and copies).

reply
> I guess the question is whether rsync is using multiple threads or otherwise accessing the filesystem in parallel

No, that is not the question. Even Wikipedia explains that rsync is single-threaded. And even if it was multithreaded "or otherwise" used concurent file IO:

The question is whether rsync _transmission_ is pipelined or not, meaning: Does it wait for 1 file to be transferred and acknowledged before sending the data of the next?

Somebody has to go check that.

If yes: Then parallel filesystem access won't matter, because a network roundtrip has brutally higher latency than reading data sequentially of an SSD.

reply
Note that rsync on many small files is slow even within the same machine (across two physical devices), suggesting that the network roundtrip latency is not the major contributor.
reply
I’m not sure why, but just like with scp, I’ve achieved significant speeds ups by tarring the directory first (optionally compressing it), transferring and then decompressing. Maybe because it makes the tar and submit, and the receive, untar/uncompress, happen on different threads?
reply
It's typically a disk-latency thing, as just stat-ing the many files in a directory can have significant latency implications (especially on spinning HDDs) vs opening a single file (the tar) and read-()ing that one file in memory before writing to the network.

If copying a folder with many files is slower than tarring that folder and the moving the tar (but not counting the untar) then disk latency is your bottleneck.

reply
The ideal solution to that is pipelining but it can be complex to implement.
reply
In general TCP just isn't great for high performance. In the film industry we used to use a commercial product Aspera (now owned by IBM) which emulated ftp or scp but used UDP with forward error correction (instead of TCP retransmission). You could configure it to use a specific amount of bandwidth and it would just push everything else off the network to achieve it.
reply
What does "high performance" mean here?

I get 40 Gbit/s over a single localhost TCP stream on my 10 years old laptop with iperf3.

So the TCP does not seem to be a bottleneck if 40 Gbit/s is "high" enough, which it probably is currently for most people.

I have also seen plenty situations in which TCP is faster than UDP in datacenters.

For example, on Hetzner Cloud VMs, iperf3 gets me 7 Gbit/s over TCP but only 1.5 Gbit/s over UDP. On Hetzner dedicated servers with 10 Gbit links, I get 10 Gbit/s over TCP but only 4.5 Gbit/s over UDP. But this could also be due to my use of iperf3 or its implementation.

I also suspect that TCP being a protocol whose state is inspectable by the network equipment between endpoints allows implementing higher performance, but I have not validated if that is done.

reply
There's an open source implementation that does something similar but for a more specific use case: https://github.com/apernet/tcp-brutal

There's gotta be a less antisocial way though. I'd say using BBR and increasing the buffer sizes to 64 MiB does the trick in most cases.

reply
Have you tried searching for "tcp-kind"?
reply
Was the torrent protocol considered at some point? Always surprised how little presence has in the industry considering how good the technology is.
reply
If you strip out the swarm logic (ie. downloading from multiple peers), you're just left with a protocol that transfers big files via chunks, so there's no reason that'd be faster than any other sort of download manager that supports multi-thread downloads.

https://en.wikipedia.org/wiki/Download_manager

reply
torrent is great for many-to-one type downloads but I assume GP is talking about single machine to single machine transfers.
reply
Aspera's FASP [0] is very neat. One drawback to it is that the TCP stuff not being done the traditional way must be done on CPU. Say if one packet is missing or if packets are sent out of order, the Aspera client fixes those instead of all that being done as TCP.

As I understand it, this is also the approach of WEKA.io [1]. Another approach is RDMA [2] used by storage systems like Vast which pushes those order and resend tasks to NICs that support RDMA so that applications can read and write directly to the network instead of to system buffers.

0. https://en.wikipedia.org/wiki/Fast_and_Secure_Protocol

1. https://docs.weka.io/weka-system-overview/weka-client-and-mo...

2. https://en.wikipedia.org/wiki/Remote_direct_memory_access

reply
> has a hardcoded maximum receive buffer of 2MiB

For completeness, I want to add:

The 2MiB are per SSH "channel" -- the SSH protocol multiplexes multiple independent transmission channels over TCP [1], and each one has its own window size.

rsync and `cat | ssh | cat` only use a single channel, so if their counterparty is an OpenSSH sshd server, their throughput is limited by the 2MiB window limit.

rclone seems to be able to use multiple ssh channels over a single connection; I believe this is what the `--sftp-concurrency` setting controls.

Some more discussion about the 2MiB limit and links to work for upstreaming a removal of these limits can be found in my post [3].

Looking into it just now, I found that the SSH protocol itself already supports dynamically growing per-channel window sizes with `CHANNEL_WINDOW_ADJUST`, and OpenSSH seems to generally implement that. I don't fully grasp why it doesn't just use that to extend as needed.

I also found that there's an official `no-flow-control` extension with the description

> channel behaves as if all window sizes are infinite. > > This extension is intended for, but not limited to, use by file transfer applications that are only going to use one channel and for which the flow control provided by SSH is an impediment, rather than a feature.

So this looks exactly as designed for rsync. But no software implements this extension!

I wrote those things down in [4].

It is frustrating to me that we're only a ~200 line patch away from "unlimited" instead of shitty SSH transfer speeds -- for >20 years!

[1]: https://datatracker.ietf.org/doc/html/rfc4254#section-5

[2]: https://rclone.org/sftp/#sftp-concurrency

[3]: https://news.ycombinator.com/item?id=40856136

[4]: https://github.com/djmdjm/openssh-portable-wip/pull/4#issuec...

reply
> It almost always indicates some bottleneck in the application or TCP tuning.

Yeah, this has been my experience with low-overhead streams as well.

Interestingly, I see a ubiquity of this "open more streams to send more data" pattern all over the place for file transfer tooling.

Recent ones that come to mind have been BackBlaze's CLI (B2) and taking a peek at Amazon's SDK for S3 uploads with Wireshark. (What do they know that we don't seem to think we know?)

It seems like they're all doing this? Which is maybe odd, because when I analyse what Plex or Netflix is doing, it's not the same? They do what you're suggesting, tune the application + TCP/UDP stack. Though that could be due to their 1-to-1 streaming use case.

There is overhead somewhere and they're trying to get past it via semi-brute-force methods (in my opinion).

I wonder if there is a serialization or loss handling problem that we could be glossing over here?

reply
that is a different problem. For S3-esque transfers you might very well be limited by ability for target to receive X MB/s and not more and so starting parallel streams will make it faster.

I used B2 as third leg for our backups and pretty much had to give rclone more connections at once because defaults were nowhere close to saturating bandwidth

reply
Tuning on Linux requires root and is systemwide. I don't think BBR is even available on other systems. And you need to tune the buffer sizes of both ends too. Using multiple streams is just less of a hassle for client users. It can also fool some traffic shaping tools. Internal use is a different story.
reply
not sure about B2 but AWS S3 SDK not assuming that people will do any tuning makes total sense

cuz in my experience no one is doing that tbh

reply
Uhh.. I work with this stuff daily and there are a LOT of intrinsic reasons a single stream would be slower than running multiple: MPLS ECMP hashing you over a single path, a single loss event with a high BDP causing congestion control to kick in for a single flow, CPU IRQ affinity, probably many more I’m not thinking like the inner workings of NIC offloading queues.

Source: Been in big tech for roughly ten years now trying to get servers to move packets faster

reply
Ha, it sounds like the best way to learn something is to make a confident and incorrect claim :)

> MPLS ECMP hashing you over a single path

This is kinda like the traffic shaping I was talking about though, but fair enough. It's not an inherent limitation of a single stream, just a consequence of how your network is designed.

> a single loss event with a high BDP

I thought BBR mitigates this. Even if it doesn't, I'd still count that as a TCP stack issue.

At a large enough scale I'd say you are correct that multiple streams is inherently easier to optimize throughput for. But probably not a single 1-10gb link though.

reply
> This is kinda like the traffic shaping I was talking about though, but fair enough. It's not an inherent limitation of a single stream, just a consequence of how your network is designed.

It is. one stream gets you traffic of one path to the infrastructure. Multiple streams get you multiple and possibly also hit different servers to accelerate it even more. Just the limitation isn't hardware but "our networking device have 4 10Gbit ports instead of single 40Gbit port"

Especially if link is saturated, you'd be essentially taking n-times your "fair share" of bandwidth on link.

reply
Note there is no intrinsic reason running multiple streams should be faster than one

If the server side scales (as cloud services do) it might end up using different end points for the parallel connections and saturate the bandwidth better. One server instance might be serving other clients as well and can't fill one particular client's pipe entirely.

reply
Wouldn't lots of streams speed up transfers of thousands of small files?
reply
If the application handles them serially, then yeah. But one can imagine the application opening files in threads, buffering them, and then finally sending it at full speed, so in that sense it is an application issue. If you truly have millions of small files, you're more likely to be bottlenecked by disk IO performance rather than application or network, though. My primary use case for ssh streams is zfs send, which is mostly bottlenecked by ssh itself.
reply
It's an application issue but implementation wise it's probably way more straightforward to just open a separate network connection per thread.
reply
The author tried running rsyncd demon so it's not _just_ the ssh protocol.
reply
Single file overheads (opening millions of tiny files whose metadata is not in the OS cache and reading them) appears to be an intrinsic reason (intrinsic to the OS, at least).
reply
the majority of that will be big files. And to NVMe it is VERY fast even if you run single threaded 10Gbit should be easy
reply
IOPs and disk read depth are common limits.

Depending on what you're doing it can be faster to leave your files in a solid archive that is less likely to be fragmented and get contiguous reads.

reply
I mean isn't a single TCP connections throughput limited by the latency? Which is why in high(er) latency WAN links you generally want to open multiple connections for large file transfers.

https://wintelguy.com/wanperf.pl

reply