SSH was never really meant to be a high performance data transfer tool, and it shows. For example, it has a hardcoded maximum receive buffer of 2MiB (separate from the TCP one), which drastically limits transfer speed over high BDP links (even a fast local link, like the 10gbps one the author has). The encryption can also be a bottleneck. hpn-ssh [1] aims to solve this issue but I'm not so sure about running an ssh fork on important systems.
The issue is the serialization of operations. There is overhead for each operation which translates into dead time between transfers.
However there are issues that can cause singular streams to underperform multiple streams in the real world once you reach a certain scale or face problems like packet loss.
rsync's man page says "pipelining of file transfers to minimize latency costs" and https://rsync.samba.org/how-rsync-works.html says "Rsync is heavily pipelined".
If pipelining is really in rsync, there should be no "dead time between transfers".
I get 40 Gbit/s over a single localhost TCP stream on my 10 years old laptop with iperf3.
So the TCP does not seem to be a bottleneck if 40 Gbit/s is "high" enough, which it probably is currently for most people.
I have also seen plenty situations in which TCP is faster than UDP in datacenters.
For example, on Hetzner Cloud VMs, iperf3 gets me 7 Gbit/s over TCP but only 1.5 Gbit/s over UDP. On Hetzner dedicated servers with 10 Gbit links, I get 10 Gbit/s over TCP but only 4.5 Gbit/s over UDP. But this could also be due to my use of iperf3 or its implementation.
I also suspect that TCP being a protocol whose state is inspectable by the network equipment between endpoints allows implementing higher performance, but I have not validated if that is done.
There's gotta be a less antisocial way though. I'd say using BBR and increasing the buffer sizes to 64 MiB does the trick in most cases.
As I understand it, this is also the approach of WEKA.io [1]. Another approach is RDMA [2] used by storage systems like Vast which pushes those order and resend tasks to NICs that support RDMA so that applications can read and write directly to the network instead of to system buffers.
0. https://en.wikipedia.org/wiki/Fast_and_Secure_Protocol
1. https://docs.weka.io/weka-system-overview/weka-client-and-mo...
2. https://en.wikipedia.org/wiki/Remote_direct_memory_access
Yeah, this has been my experience with low-overhead streams as well.
Interestingly, I see a ubiquity of this "open more streams to send more data" pattern all over the place for file transfer tooling.
Recent ones that come to mind have been BackBlaze's CLI (B2) and taking a peek at Amazon's SDK for S3 uploads with Wireshark. (What do they know that we don't seem to think we know?)
It seems like they're all doing this? Which is maybe odd, because when I analyse what Plex or Netflix is doing, it's not the same? They do what you're suggesting, tune the application + TCP/UDP stack. Though that could be due to their 1-to-1 streaming use case.
There is overhead somewhere and they're trying to get past it via semi-brute-force methods (in my opinion).
I wonder if there is a serialization or loss handling problem that we could be glossing over here?
cuz in my experience no one is doing that tbh
If the server side scales (as cloud services do) it might end up using different end points for the parallel connections and saturate the bandwidth better. One server instance might be serving other clients as well and can't fill one particular client's pipe entirely.
Source: Been in big tech for roughly ten years now trying to get servers to move packets faster
> MPLS ECMP hashing you over a single path
This is kinda like the traffic shaping I was talking about though, but fair enough. It's not an inherent limitation of a single stream, just a consequence of how your network is designed.
> a single loss event with a high BDP
I thought BBR mitigates this. Even if it doesn't, I'd still count that as a TCP stack issue.
At a large enough scale I'd say you are correct that multiple streams is inherently easier to optimize throughput for. But probably not a single 1-10gb link though.
Depending on what you're doing it can be faster to leave your files in a solid archive that is less likely to be fragmented and get contiguous reads.
I'm currently working on the GUI if you're interested: https://github.com/rclone-ui/rclone-ui
Related to this is the very useful:
rclone serve restic ...
.. workflow that allows you to create append-only (immutable) backups.This howto is not rsync.net-specific - you can follow this recipe at any standard SSH endpoint:
https://www.rsync.net/resources/notes/2025-q4-rsync.net_tech...
My goal is to smooth out some of the operational rough edges I've seen companies deal with when using the tool:
- Team workspaces with role-based access control
- Event notifications & webhooks – Alerts on transfer failure or resource changes via Slack, Teams, Discord, etc.
- Centralized log storage
- Vault integrations – Connect 1Password, Doppler, or Infisical for zero-knowledge credential handling (no more plain text files with credentials)
- 10 Gbps connected infrastructure (Pro tier) – High-throughput Linux systems for large transfersThis idea that one must “give back” after receiving a gift freely given is simply silly.
I've adjusted threads and the various other controls rclone offers but I still feel like I'm not see it's true potential because the second it hits a rate limit I can all but guarantee that job will have to be restarted with new settings.
That hasn't been true for more than 8 years now.
Source: https://github.com/rclone/rclone/blob/9abf9d38c0b80094302281...
And the PR adding it: https://github.com/rclone/rclone/pull/2622
Edit: oh I see, delta transfer only sends the changed parts of files?
You can also run multiple instances of rsync, the problem seems how to efficiently divide the set of files.
It turns out, fpart does just that! Fpart is a Filesystem partitioner. It helps you sort file trees and pack them into bags (called "partitions"). It is developed in C and available under the BSD license.
It comes with an rsync wrapper, fpsync. Now I'd like to see a benchmark of that vs rclone! via https://unix.stackexchange.com/q/189878/#688469 via https://stackoverflow.com/q/24058544/#comment93435424_255320...
find a-bunch-of-files | xargs -P 10 do-something-with-a-file
-P max-procs
--max-procs=max-procs
Run up to max-procs processes at a time; the default is 1.
If max-procs is 0, xargs will run as many processes as
possible at a time.>In fact, some compression modes would actually slow things down as my energy-efficient NAS is running on some slower Arm cores
Depending on the number/type of devices in the setup and usage patterns, it can be effective sometimes to have a single more powerful router and then use it directly as a hop for security or compression (or both) to a set of lower power devices. Like, I know it's not E2EE the same way to send unencrypted data to one OPNsense router, Wireguard (or Nebula or whatever tunnel you prefer) to another over the internet, and then from there to a NAS. But if the NAS is in the same physically secure rack directly attached by hardline to the router (or via isolated switch), I don't think in practice it's significantly enough less secure at the private service level to matter. If the router is a pretty important lynchpin anyone, it can be favorable to lean more heavily on that so one can go cheaper and lower power elsewhere. Not that more efficiency, hardware acceleration etc are at all bad, and conversely sometimes might make sense to have a powerful NAS/other servers and a low power router, but there are good degrees of freedom there. Handier then ever in the current crazy times where sometimes hardware that was formerly easily and cheaply available is now a king's ransom or gone and one has to improvise.
rsync -e "ssh -o Compression=no" ...> Specifies whether to use compression. The argument must be yes or no (the default).
So I'm surprised you see speedups with your invocation.
With rsync, you upload hashes of what you have, then the source has to do all the hashing work to figure out what to send you. It's slightly more efficient, but If you are supporting even 10s of downloads it's a lot of work for the source.
The other option is to send just a diff, which I believe e.g. Google Chrome does. Google invented Courgette and Zucchini which partially decompile binaries then recompile them on the other end to reduce the size of diffs. These only work for exact known previous versions, though.
I wonder if the ideas of Courgette and Zucchini can be incorporated into zsync's hashes so that you get the minimal diff, but the flexibility of not having a perfect previous version to work from.
So the question "does rclone have that" doesn't make much sense, because it usually wouldn't be rclone implementing it.
For example, zsh does it here for rsync, which actually invokes `ssh` itself:
https://github.com/zsh-users/zsh/blob/3e72a52e27d8ce8d8be0ee...
https://github.com/zsh-users/zsh/blob/3e72a52e27d8ce8d8be0ee...
That said, some CLI tools come with tools for shells to help them implement such things. E.g. `mytool completion-helper ...`
But I don't get rclone SSH completions in zsh, as it doesn't call `_remote_files` for rclone:
https://github.com/zsh-users/zsh/blob/3e72a52e27d8ce8d8be0ee...