Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One of the maintainers here. I published this link tbh to specifically emphasize the experiment management aspect of DVC. Historically because of its name (Data Version Control) users perceived it as a pure replacement for LFS scenarios, while in reality it always had pipelines, metrics, etc, etc.

I 100% agree that managing large datasets by moving them around is not practical, and definitely not in LFS/DVC-style. There should be a level of indirection if reproducibility is needed (pointers are versioned to files, not the data directly, data should be staying in the cloud).

Here, I would love to one more time mention some other cool features that DVC has. E.g. `dvc exp` set of commands where it is creating custom git refs to snapshot experiments, of DVCLive logger that helps capturing metrics, plots, etc. And also VS Code extension [1] that provides quite cool experience for experiments workflow inside VS Code.

Point here is that for DVC the ability to capture some large files and directories (that do not fit into Git) was always a low level mechanism to support higher level scenarios (e.g. you need to save a model somewhere as an output of an experiment).

[1] https://marketplace.visualstudio.com/items?itemName=Iterativ...



> I 100% agree that managing large datasets by moving them around is not practical, and definitely not in LFS/DVC-style. There should be a level of indirection if reproducibility is needed (pointers are versioned to files, not the data directly, data should be staying in the cloud).

I am not sure I understand that correctly. Are you saying that LFS/DVC manage the data suboptimally because they do not use some kind of pointer?

I only have some experience with DataLad[0], not with DVC or LFS. DataLad is built on git-annex, which does a pointer indirection through symlinks or pointer files in git. You basically manage the directory structure in git and can "get" and "drop" specific files as you need them. git-annex keeps track of where (e.g. on what (remote) system, which could be anything from a http server over s3 to a nextcloud via webdav and more) the data is and how it can be fetched. I always thought DVC did something similar.

[0] https://www.datalad.org/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: