Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How much data do you have? I'm using git-annex on my photos, and that are around 100k-1M files, several TB of data, on a ZFS. In the beginning, everything was fine, but it starts to become increasingly slow, such that every operation takes several minutes (5-30 mins or so).

I wonder a bit whether that is ZFS, or git-annex, or maybe my disk, or sth else.



My experience is the same, git-annex just doesn't work well with lots of small files. With annexes on slow USB disks, connected to a Raspberry Pi 3 or 4, I'm already annoyed when working with my largest annex (in file count) of 25000 files.

However, I mostly use annex as a way to archive stuff and make sure I have enough copies in distinct physical locations. So for photos I now just tar them up with one .tar file per family member per year. This works fine for for me for any data I want to keep safe but don't need to access directly very often.


It would be great to have comprehensive benchmarks for git lsf, git annex, dvc and alike. I am also always getting annoyed with one or the other , e.g. due to the hashing overhead, etc. However, in many cases the annoyances come with bad filesystem integration on Windows in my case.


My guess is the windows virus scaner


Why? WHY?! Why the heck are you using (D)VFS on your immutable data? What is the reasoning? That stuff is immutable and usually incremental.. Just throw proper syncing algoritm on it and sync w/ backups.. thats all. I wonder aby logic behind this...

Docs and other files you often change is completly different story. This is where DVFS shines. I wrote my own very simple DVFS exacly for that case. You just create directory, init repo manager.. and vioala.. Disk wide VFS is kinda useless as most of your data there just sits..


I also used to use git-annex on my photos, ended up getting frustrated with how slow it was and wrote aegis[1] to solve my use case.

I wrote a bit about why in the readme (see archiving vs backup). In my opinion, syncing, snapshots, and backup tools like restic are great but fundamentally solve a different problem from what I want out of an archive tool like aegis, git-annex, or boar[2].

I want my backups to be automatic and transparent, for that restic is a great tool. But for my photos, my important documents and other immutable data, I want to manually accept or reject any change that happens to them, since I might not always notice when something changes. For example if I fat finger an rm, or a bug in a program overrides something and I don't notice.

[1]: https://git.sr.ht/~alexdavid/aegis

[2]: https://github.com/mekberg/boar


While I understand why git-annex wouldn't work for you, what gaps did you find in boar?


I might have to give aegis a try.


I don't really need the versioning aspect too much, but sometimes I modify the photos a bit (e.g. rotating or so). But all the other things are relevant for me, like having it distributed, syncing, only partially having the data on a particular node, etc.

So, what solution would be better for that? In the end it seems that other solutions provide a similar set of features. E.g. Syncthing.

But what's the downside with Git-annex over Syncthing or other solutions?


If you want two-way distributed syncing, that is a bit more complicated and error prone, but most tools support it, even rsync. Simpler aproach is to have central primary node (whatever it desktop or storage) when you sync copy data and sync it to backups.

As I said, handling immutable data (incremental) is easy. You just copy and sync. Kinda trival. The problem I had personaly was all the importand docs (and similar) files I work on. First, I wanted snapshots and history, in case of some mistake or failure. Data checksuming, because they are importand. Also, full peer2peer syncing because I have desktop, servers, VMs, laptop, so I want to sync data around. And because I really like GIT, great tool for VCS, I wanted something similar but for generic binary data. Hence I interested in DVFS system. First I wanted full blown mountable DVFS system, but that is complicated and much harder to make it portable.. Repository aproach is easy to implement and is portable (Cygwin, Linux, UNIX, Posix). Works like a charm.

As for downside, If you think git-annex will work for you, just use it :) For me, it was far too complicated (too much moving parts) even for my DVFS usecase. For immutable data is absolutly overkill, to keep 100s of GBs of data there. I just sync :)


> Why the heck are you using (D)VFS on your immutable data?

Git-annex does not put your data in Git. What it tracks using Git is what’s available where, updating that data on an eventually consistent basis whenever two storage sites come into contact. It also borrows Git functionality for tracking moves, renames, etc. The object-storage parts, on the other hand, are essentially a separate content-addressable store from the normal one Git uses for its objects.

(The concrete form of a git-annex worktree is a Git-tracked tree of symlinks pointing to .git/annex/objects under the repo root, where the actual data is stored as read-only files, plus location-tracking data indexed by object hash in a separate branch called “git-annex”, which the git-annex commands manipulate using special merge strategies.)


I am looking into using Git for my photos/videos backup external HDDs and the reasoning is simple. It's not about keeping track of changes within the files themselves since like you said, they (almost) never change. Rather, it's about keeping track of changes in _folders_. That is, I want to keep track of when I last copied images from my phones, cameras, etc. to my HDDs, which folders did I touch, if I reorganized existing files into a different folder structure then what are the changes, etc. Also it acts as a rollback mechanism if I ever fat finger and delete something accidentally. I wonder if there's a better tool for this though


Then I think some syncing software like rsync will probably be better. Now sure how often you keep changing archived folders. I split that work TRASH like dirs and archives. When I done w/ files, I move them out of TRASH do proper place and that it. I prefer KISS aproach, but whatever works for you :)


Why... not? Git just works for syncing data and version control and we're all familiar with it. It is also secure, reliable, available everywhere, decentralized, with built-in access control, deduplication, e2ee with gitcrypt... In short, it is great.

The problem is performance in some use cases, but I don't see anything fundamentally wrong with using git for sync.


Git wasnt designed for generic binary blob handling. Sure, if you repo is small and you set proper .gitattributes, it will work fine. But I would advise to use generic DVFS for such task.


One thing to check is whether any security/monitoring software might be causing issues. Since there are so many files in git repos, it can put a lot of load on that type of software.


I had tested a git-annex repository with about 1.5M files and it got pretty slow as well. The plain git repo size grew to multiple GiB and plain git operations were super slow, so I think this was mostly a git limitation. DataLad's approach of nested subdatasets (in practice git submodules where each submodule is a git-annex repository) can help, if it fits the data and workflows.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: