In software engineering, version control refers to a broad class of tools (but mainly Git) to track and manage edits/contributions to projects. It’s motivated by the idea that we need a way to return to an earlier version of the program and to handle team-based projects.
Lock-based version control systems don’t allow two people to work on the same file at the same time. Older systems like RCS use this. This is terrible and we shouldn’t use these systems.
Merge-based systems are where many people can modify the same file independently in their working copies. It sorts out what to put in the repository only when each person commits and pushes. Used by systems like Git and SVN.
Data engineering
In general, the datasets we work on should not be added into the same version control system that our code uses (i.e., the Git repository for our code shouldn’t have data). There’s a few reasons for this:
- Large storage overhead: datasets take a shit ton of space, and updating the remote repository can take additional time if our datasets change.
- GitHub also has an upper limit on repository and commit sizes.
- Complexity: the nature of data often differs from source files (i.e., binary data versus source files). Large file numbers also means a robust version control system is difficult.
Critical and smaller datasets can be checked into version control. This helps aid reproducibility and collaboration, but this should ideally be rare. This also helps if the data pipeline is complicated.
But what we should do is use a different version control system for data. This doesn’t leave it out in the wild west (untracked, different versioning across teams), and ensures reproducibility and auditability, especially if the data changes frequently.