Importance of Data Versioning
- While AI/ML practices always talk about data and models to develop business applications, the ecosystem cannot scale up if we do not adequately manage data, parameters, and model versions. As the practice scales up, the data versioning system acts as the spine of the platform.
- Data versioning refers to saving new copies of your files when you make changes so that you can go back later and retrieve specific versions of your datasets. The system helps in data management and orchestration while giving the input to the AI/ML models in various forms.
- Versioning is linked to a change due to reprocessing, correcting, or appending additional data in the structure, contents, or condition of the entity in question. This entity can be documents, software, pieces of code, data science models, or any other collection of information.
- Specifically, data versioning refers to the process of uniquely identifying data, similar to categorizing code, to enable pulling and using any version of the data as required. In simpler words, it is a method to track changes associated with fluctuating data.
- When data is identified uniquely, data scientists can determine whether and how data has changed, which version of a dataset they are working with, and understand if a newer version is available.
- Also, with proper versioning, data scientists can understand data provenance, what processing step caused the data to change, and how that propagates across the data processing pipeline. Explicit versioning allows for repeatability in experiments, enables comparisons, and prevents confusion.
Let us understand in greater detail why data versioning is essential:
- Ensure better training data: ML comprises rapid experimentation, iteration, and training models on data. Thus, training on incorrect data can have disastrous results for the outcomes of a ML project.
- Track data schema: Enterprise data is usually obtained in batches, and often minor changes in the ML schema are applied over the course of a project. With proper versioning, you can easily track and evolve the data schema over time. You can also understand whether these changes are backward and forward compatible.
- Continual model training: In production environments, data is refreshed periodically and may trigger a fresh run of the model training pipeline. When such automated retraining occurs, it is important to have data versioned for tracking a model’s efficacy.
- Enhance traceability and reproducibility: Data scientists must be able to track, identify the provenance of data, and point out which version of a dataset reinforces the outcomes of their experiments. They should re-run the entire ML pipeline and reproduce the exact results each time as it is a critical input for the modeling process. Thus, the original training data must always be available. Hence, from a reproducibility/traceability perspective, proper versioning is critical.
- Auditing: Proper versioning ensures that the integrity of data-based activities is upheld by identifying when modifications are made. By monitoring and analyzing the actions of both users and models, auditors can identify intentional and accidental lapses in user behavior. Data science auditors can thus examine the effect of data changes on model accuracy and determine best ML practices for the enterprise.

Pachyderm is one of the most popular data versioning frameworks that has multiple features and SDK APIs. These can be used to integrate with the MLOps platform to use the features programmatically. It acts as a repository for data management and orchestration atall the stages of MLOps. The repository may possibly contain input/output datasets, hyperparameter configurations, trained models, and experiment metadata.
We further discuss the key features of the data versioning systems used in MLOps, and the advantages forusers while using those features.
Job Execution Schemas
- The models consume data in different manners while training, and the datasets give different results for a separate set of hyperparameters.
- Different workflow models for data ingestion and hyperparameter tuning are suggested depending on the nature of the data. The data versioning systems have built-in functionality to set up any hybrid models.
- Jobs can run on various criteria.
- For incremental data, the job runs as per the set schedule for the set hyperparameters.
- In the case of static data, the jobs run based on variation in hyperparameters.
- The results explain the combination of hyperparameters that will provide the best outcome for the segregation of the learning rate and the results on the test data. Refer Annexure A for details.
Data Flow Designs
- The data is present in repositories and made available to the model for training; while the output is rendered to separate repositories. This is the simplest condition that can be configured for the pipeline.
- In complex situations, the data is presented in a combination of multiple repositories with different ways of how the data is fed to the training models.
- The data is fed to the models in various forms. The unit data fed to the model can be all data at once, each directory independently, individual files one by one, or combined data from multiple inputs together.
- The input files give input in unions, crosses, joins, and groups taking data from multiple repositories. Refer Annexure B for details.
Data Provenance
- Another vital aspect of data versioning is Data Provenance. This is the ability to track any result back to its raw input, including all analysis, code, and intermediate products.
- xpresso implements provenance for both commits and repositories. You can track revisions of the data and understand the connection between the data stored in one repository and the results in another repository.
- Data scientists use the provenance feature to build on each other’s work, share, transform, update datasets, and automatically maintaina complete audit trail so that all results are reproducible.
- For example, an individual’s mortgage application will go through multiple approval steps after checking out various data points and analyses. Let’s assume the decision is unfavorable, and the applicant challenges the decision. The applicant wants to know the reason for the decision. The historical trail of data is essential to confirm that there is no bias in making the decision.
- Reproducibility of the results is a critical factor for establishing this aspect. The results can only be reproduced when there is an exact set of the trained model on a particular dataset and the production dataset for deriving the results.
- Reproducibility is a part of regulatory requirements as well to maintain the reproducibility of results.
Data Versioning Features in xpresso.ai
xpresso integrates data versioning with code and hyperparameter version, along with security features. The artifacts and models are stored in a customized data repository. One can access these in 3 modes, as follows:
1) Data versioning libraries, available with xpresso.ai, enables you to version control your datasets.
2) The xpresso.ai Control Center GUI enables you to view the stored datasets and models via the ‘Data Repository ’ explorer, thus providing a rich yet simplified user experience.
3) You can also usepre-built components in your pipeline with no-code. At run time you can provide the parameters and perform various activities such as pull dataset, push dataset, create branch, view commits, etc.
The process of version repository creation is integrated with project creation. The security of the repository is based on the accesses given in the project to certain sections. The data versioning system is controlled.

- The set of defined steps make sure that the versioning is taken care of in an integrated manner using APIs. The Version Controller Factory class kicks off the use of the API, further using Version Controller and Data Versioning connector to check in the artifacts after authentication using Versioning Authenticator module.
- xpresso Control Center GUI is used to access and manage the data versioning system of the project. One can view the repo and datasets, create branches, push datasets and even download them on a local system. The screenshot of the same is shown below:

xpresso’s pre-built components in Docker images uses the version control repository details from the property files as parameters.
Other advantages of Data Versioning System in MLOps
Other tangible benefits of the data versioning system on the MLOps platforms are as follows.
- Various strategies stated above are essential for designing the data flow while training models. The data versioning system should give data scientists the flexibility to contain the data flow strategy within the configuration files and not be hard-coded in the model code. This allows data scientists to re-use the same model code for multiple scenarios.
- Comparing multiple repositories on the basis of their input data and their output will help the data scientist learn which data transformations and cleansing policy helped generate better results. This helps avoid underfitting/overfitting training of the models.
- The data versioning system acts as a single source of data or truth for various initiatives without having to maintain multiple copies of the large datasets. This saves disk space and improves the manageability of the datasets for bigger teams.
Summary
- Proper integration of a data versioning system with the xpresso MLOps platform helps users roll out models to production with a mature data context.
- The data versioning system helps build flexibility and traceability in the design while assisting the practice managers to keep the regulatory compliances in place, which is critical for any MLOps practice.