Data Versioning with DVC¶
This tutorial provides a detailed demonstration of approaches to data and model versioning.
- Instructions for setting up the environment are in the repository's README.
- Overview Data Versioning features and use cases of DVC
Tutorial: Data Versioning with DVC¶
To conveniently view the structure of files and directories in the terminal, the
tree utility can be useful. Here's how to install it on Linux:
- Open the terminal.
- Ensure that your system has internet access.
Enter the following command to install
When prompted, enter the administrator (root) password.
Wait for the installation to complete.
Once the installation is finished, you can use the
treecommand to view the structure of files and directories in the terminal. Here's an example usage:
<directory_path>with the path to the directory you want to view.
To clone the example repository "dvc-2-data-versioning," you can follow these steps:
How data versioning works?¶
To understand how data versioning works in DVC, let's perform the following steps:
Check the status of the DVC project by running the following command:
Download a file and add it under DVC control using the
dvc addcommand. The
-vflag will display more detailed logs, allowing you to peek under the hood of what's happening.3. Commit the changes to Git:
After adding the file under DVC control, we need to commit the changes to Git.
You can examine the DVC cache directory and observe the cached file.5. Examine the
Add a directory under version control¶
Adding directories under version control in DVC follows a similar process as adding files. The main difference lies in how the metadata is stored, but we don't need to worry about those details as DVC takes care of it.
Let's go through an example:
For this example, let's test the concept of a Data Registry. DVC allows us to use Git repositories for centralized data management. We can fetch the required version of the data by specifying a Git revision (
--rev), such as a branch name, commit hash, or tag.
You can explore the downloaded dataset:
Now let's add the
datadir directory under DVC control:
You can examine the
.dvc file associated with the
Ensure that the changes are reflected in Git:
To preserve the metadata in the Git history, it's crucial to make a commit:
By following these steps, you can add a directory to DVC, track its metadata, and associate it with a specific version of the data in Git.
Track data status changes
To track the status of the data, you can use the following command:
Updating Tracking Files
Let's switch to another branch and update the data by fetching the
cats-dogs-v2 version from the Data Registry:
For the sake of the experiment, let's remove the current version of the data:
Now, let's fetch the new data:
Check the status of the tracked data:
Add the updated data directory under DVC control and make a commit:
Check the Git status:
.dvc file and make a commit:
Now, let's explore how to switch between different versions of the data:
git checkoutswitches to the desired Git branch.
dvc checkoutfetches the desired version of the data.
Checkout to the initial branch
Switch to the first version of the Cats&Dogs data (branch cats-dogs-v1)
By following these steps, you can track changes to your data, update the tracked files, and switch between different versions of the data using Git and DVC.
Store and share data with Remote Storage¶
Setup your remote storage (local)
- We are using the
/tmpdirectory in the examples of this tutorial for simplicity purposes only.
- DO NOT use the
/tmpdirectory for long-term storage of files. Your system frequently clears this directory.
Create new remote
-d flag makes the local remote the default choice.
As you can see, .dvc/config is changed
Commit changes to Git
Now, you have set up a local remote storage in DVC. This allows you to push and pull data and models to and from the specified directory.
This is a local setup, and you should consider using remote storage solutions for real-world scenarios to ensure data durability and accessibility.
Push data to remote storage¶
To push your data to the remote storage, use the following command:
This command pushes the data to the remote storage specified in the DVC configuration. You can verify the changes by examining the
.dvc file associated with the data:
Additionally, you can check the contents of the remote storage directory,
/tmp/dvc/42 in this example:
Retrieve data from remote storage¶
To retrieve the data from the remote storage, follow these steps:
Remove the locally cached file and data directory:
By performing these steps, you can push your data to a remote storage location and retrieve it whenever needed using DVC commands.
Find a dataset
You can use the
dvc list command to explore the DVC registry hosted on any Git server.
For example, let's see what is available in the
use-cases/ directory of the
Download a dataset with
dvc get command allows you to download a dataset to your working area without DVC control. It fetches the dataset from the specified location:
After running this command, you will see the downloaded
cats-dogs/ folder, but it is not under DVC control. There won't be a
Download and track dataset
dvc import command downloads a dataset and automatically starts tracking its version with DVC. It allows you to easily update the dataset in your project when changes are made in the Data Registry. Note that the
dvc import command first clones the repository using SSH. Make sure to follow the instructions to set up SSH keys for your repository.
After running this command, you will see the newly downloaded
data.xml file, and a corresponding
data.xml.dvc file is created, indicating that DVC is tracking the data file.
By using these commands, you can find datasets available in the DVC registry, download datasets to your working area, and add them under DVC control for versioning and management.
In this tutorial, we explored the concept of data versioning using DVC (Data Version Control). We learned how to initialize DVC in a project, add files and directories under version control, track changes to the data, and switch between different versions of the data. We also saw how to set up a local remote storage and push data to it, as well as retrieve data from the remote storage. Additionally, we discovered how to access datasets from the DVC registry using commands like
dvc get and
Data versioning with DVC provides several benefits for data scientists and machine learning engineers. It allows for easy tracking of data changes, collaboration among team members, reproducibility of experiments, and efficient management of large datasets. By combining Git and DVC, you can have a comprehensive version control system for both your code and data.
Start using DVC in your data science projects to keep track of your data, manage different versions, and collaborate effectively with your team.
🎓 Additional Resources¶
To further enhance your understanding of data versioning with DVC, consider exploring the following resources:
- DVC Documentation: The official documentation provides comprehensive information on using DVC and its various features.
- DVC YouTube Channel: The official DVC YouTube channel hosts a collection of video tutorials and demos to deepen your knowledge of DVC.
- DVC Community Forum: Join the DVC community forum to engage with other users, ask questions, and share your experiences with DVC.
By leveraging these additional resources, you can continue to build your expertise in data versioning with DVC and unlock its full potential for your data science projects.
Contribute to the community! 🙏🏻
Hey! We hope you enjoyed the tutorial and learned a lot of useful techniques 🔥
Please 🙏🏻 take a moment to improve our tutorials and create better learning experiences for the whole community. You could
- ⭐ Put a star on our ML REPA library repository on GitHub
- 📣 Share our tutorials with others, and
- Fill out the Feedback Form We would appreciate any suggestions or comments you may have
Thank you for taking the time to help the community! 👍