One of the maintainers here. I published this link tbh to specifically emphasize...

matrss · on Nov 3, 2023

> I 100% agree that managing large datasets by moving them around is not practical, and definitely not in LFS/DVC-style. There should be a level of indirection if reproducibility is needed (pointers are versioned to files, not the data directly, data should be staying in the cloud).

I am not sure I understand that correctly. Are you saying that LFS/DVC manage the data suboptimally because they do not use some kind of pointer?

I only have some experience with DataLad[0], not with DVC or LFS. DataLad is built on git-annex, which does a pointer indirection through symlinks or pointer files in git. You basically manage the directory structure in git and can "get" and "drop" specific files as you need them. git-annex keeps track of where (e.g. on what (remote) system, which could be anything from a http server over s3 to a nextcloud via webdav and more) the data is and how it can be fetched. I always thought DVC did something similar.

[0] https://www.datalad.org/