undefined

upvote

points

by noirscape14 hours ago |

upvote

by stebalien1 hours ago|

[-]

I've actually spent some time debugging why git causes so many issues with the backup software I use (restic).

Ironically, I believe you have it backwards: pack files, git's solution to the "too many tiny files" problem, are the issue here; not the tiny files themselves.

In my experience, incremental backup software works best with many small files that never change. Scanning is usually just a matter of checking modification times and moving on. This isn't fast, but it's fast enough for backups and can be optimized by monitoring for file changes in a long-running daemon.

However, lots of mostly identical files ARE an issue for filesystems as they tend to waste a lot of space. Git solves this issue by packing these small objects into larger pack files, then compressing them.

Unfortunately, it's those pack files that cause issues for backup software: any time git "garbage collects" and creates new pack files, it ends up deleting and creating a bunch of large files filled with what looks like random data (due to compression). Constantly creating/deleting large files filled with random data wreaks havoc on incremental/deduplicating backup systems.

reply

upvote

by adithyassekhar13 hours ago|

[-]

I don’t think this is the right way to see this.

Why should a file backup solution adapt to work with git? Or any application? It should not try to understand what a git object is.

I’m paying to copy files from a folder to their servers just do that. No matter what the file is. Stay at the filesystem level not the application level.

reply

upvote

by noirscape12 hours ago|

[-]

I'm not saying Backblaze should adapt to git; the issue isn't application related (besides git being badly configured by default; there's a solution with git gc, it's just that git gc basically never runs).

It's that to back up a folder on a filesystem, you need to traverse that folder and check every file in that folder to see if it's changed. Most filesystem tools usually assume a fairly low file count for these operations.

Git, rather unusually, tends to produce a lot of files in regular use; before packing, every commit/object/branch is simply stored as a file on the filesystem (branches only as pointers). Packing fixes that by compressing commit and object files together, but it's not done by default (only after an initial clone or when the garbage collector runs). Iterating over a .git folder can take a lot of time in a place that's typically not very well optimized (since most "normal" people don't have thousands of tiny files in their folders that contain sprawled out application state.)

The correct solution here is either for git to change, or for Backblaze to implement better iteration logic (which will probably require special handling for git..., so it'd be more "correct" to fix up git, since Backblaze's tools aren't the only ones with this problem.)

reply

upvote

by masfuerte12 hours ago|

[-]

7za (the compression app) does blazingly fast iteration over any kind of folder. This doesn't require special code for git. Backblaze's backup app could do the same but rather than fix their code they excluded .git folders.

When I backup my computer the .git folders are among the most important things on there. Most of my personal projects aren't pushed to github or anywhere else.

Fortunately I don't use Backblaze. I guess the moral is don't use a backup solution where the vendor has an incentive to exclude things.

reply

upvote

by toast07 hours ago|

[-]

IMHO, you can't do blazingly fast iteration over folders with small files in Windows, because every open is hooked by the anti-virus, and there goes your performance.

reply

upvote

by noirscape3 hours ago|

[-]

Not just antivirus, there's also file locking.

Windows has a much harsher approach to file locking than Linux and backup software like BackBlaze absolutely should be making use of it (lest they back up files that are being modified while they back them up), but that also means that the software effectively has to ask the OS each time to lock the file, then release the lock when the software is done with it. With a large amount of files, that does stack up.

Linux file locking is to put it mildly, deficient. Most software doesn't even bother acquiring locks in the first place. Piling further onto that, basically nobody actually uses POSIX locks because the API has some very heavy footguns (most notably, every lock on a file is released whenever any close() for that file is called, even if another component of the same process is also having a second lock open). Most Linux file locks instead work on the honor system; you create a file called filename.lock in the same directory as the file you're working on, and then any software that detects the filename.lock file exists should stop reading the file.

Nobody using file locks is probably the bigger reason why Linux chokes less on fast iteration than Windows, given that Windows is slow with loads of files even when you aren't running a virus scanner.

reply

upvote

by NetMageSCW10 hours ago|

[-]

Actually once the initial backup is done there is no reason to scan for changes. They can just use a Windows service that tells them when any file is modified or created and add that file to their backup list.

reply

upvote

by Saris10 hours ago|

[-]

Backblaze offers 'unlimited' backup space, so they have to do this kind of thing as a result of that poor marketing choice.

reply

upvote

by conductr8 hours ago|

[-]

No they don’t. They just have to price the product to reflect changing user patterns. When backblaze started, it was simply “we back up all the files on your drive” they didn’t even have a restore feature that was your job when you needed it. Over time they realized some user behavior changed, these Cloud drives where a huge data source they hadn’t priced in, git gave them some problems that they didn’t factor in, etc. The issue is there solution to dealing with it is to exclude it and that means they’re now a half baked solution to many of their users, they should have just changed the pricing and supported the backup solution people need today.

reply

upvote

by adithyassekhar5 hours ago|

[-]

If they must scam, shouldn’t they be deduplicating on the server rather than the client?

reply

upvote

by Ajedi329 hours ago|

[-]

FWIW some other people in this thread are saying the article is wrong about .git folders not being backed up: https://news.ycombinator.com/item?id=47765788

That's a really important fact that's getting buried so I'd like to highlight it here.

reply

upvote

by rmccue13 hours ago|

[-]

I think it's understandable for both Backblaze and most users, but surely the solution is to add `.git` to their default exclusion list which the user can manage.

reply

upvote

by rcxdude13 hours ago|

[-]

It's probably primarily because Linus is a kernel and filesystem nerd, not a database nerd, so he preferred to just use the filesystem which he understood the performance characteristics of well (at least on linux).

reply

upvote

by maalhamdan14 hours ago|

[-]

I think they shouldn't back up git objects individually because git handles the versioning information. Just compress the .git folder itself and back it up as a single unit.

reply

upvote

by willis93613 hours ago|

[-]

Better yet, include dedpulication, incremental versioning, verification, and encryption. Wait, that's borg / restic.

This is a joke, but honestly anyone here shouldn't be directly backing up their filesystems and should instead be using the right tool for the job. You'll make the world a more efficient place, have more robust and quicker to recover backups, and save some money along the way.

reply

upvote

by pkaeding13 hours ago|

[-]

This is a good point, but you might expect them to back up untracked and modified files in the backup, along with everything else on your filesystem.

reply

upvote

by pixl9711 hours ago|

[-]

Eh, you really shouldn't do that for any kind of file that acts like a (an impromptu) database. This is how you get corruption. Especially when change information can be split across more than one file.

reply

upvote

by pkaeding9 hours ago|

[-]

Sorry, what are you saying shouldn't be done? Backing up untracked/modified files in a bit repo? Or compressing the .git folder and backing it up as a unit?

reply

upvote

by pixl977 hours ago|

[-]

> Backing up untracked/modified files in a bit repo?

This. It's best to do this in an atomic operation, such as a VSS style snapshot that then is consistent and done with no or paused operations on the files. Something like a zip is generally better because it takes less time on the file system than the upload process typically takes.

reply

upvote

by ciupicri14 hours ago|

[-]

> If it'd been something like an SQLite database instead (just an example really)

See Fossil (https://fossil-scm.org/)

P.S. There's also (https://www.sourcegear.com/vault/)

> SourceGear Vault Pro is a version control and bug tracking solution for professional development teams. Vault Standard is for those who only want version control. Vault is based on a client / server architecture using technologies such as Microsoft SQL Server and IIS Web Services for increased performance, scalability, and security.

reply

upvote

by grumbelbart213 hours ago|

[-]

Git packs objects into pack-files on a regular basis. If it doesn't, check your configuration, or do it manually with 'git repack'.

reply

upvote

by noirscape13 hours ago|

[-]

I decided to look into this (git gc should also be doing this), and I think I figured out why it's such a consistent issue with git in particular. Running git gc does properly pack objects together and reduce inode count to something much more manageable.

It's the same reason why the postgres autovacuum daemon tends to be borderline useless unless you retune it[0]: the defaults are barmy. git gc only runs if there's 6700 loose unpacked objects[1]. Most typical filesystem tools tend to start balking at traversing ~1000 files in a structure (depends a bit on the filesystem/OS as well, Windows tends to get slower a good bit earlier than Linux).

To fix it, running

> git config --global gc.auto 1000

should retune it and any subsequent commit to your repo's will trigger garbage collection properly when there's around 1000 loose files. Pack file management seems to be properly tuned by default; at more than 50 packs, gc will repack into a larger pack.

[0]: For anyone curious, the default postgres autovacuum setting runs only when 10% of the table consists of dead tuples (roughly: deleted+every revision of an updated row). If you're working with a beefy table, you're never hitting 10%. Either tune it down or create an external cronjob to run vacuum analyze more frequently on the tables you need to keep speedy. I'm pretty sure the defaults are tuned solely to ensure that Postgres' internal tables are fast, since those seem to only have active rows to a point where it'd warrant autovacuum.

[1]: https://git-scm.com/docs/git-gc

reply

upvote

by LetTheSmokeOut11 hours ago|

[-]

I needed to use

> git config --global gc.auto 1000

with the long option name, and no `=`.

reply

upvote

by Dylan1680711 hours ago|

[-]

A few thousand files shouldn't be a problem to a program designed to scan entire drives of files. Even in a single folder and considering sloppy programs I wouldn't worry just yet, and git's not putting them in a single folder.

reply

upvote

by bombcar7 hours ago|

[-]

I love nothing more than running strange git commands found in HN comments.

Let's ride the lightning and see if it does anything.

reply

upvote

by yangm9713 hours ago|

[-]

You don’t see ZFS/BTRFS block based snapshot replication choking on git or any sort of dataset. Use the right job for the tool or something.

reply

upvote

by 11 hours ago|

[-]

deleted

reply