Data Versioning Library¶

Data Versioning features in xpresso.ai enable developers to manage versions of their data, in the same way that a Code Versioning system (e.g., Github) allows them to manage versions of their code.

Data Versioning features can be accessed either through the Control Center GUI or using a Python library. The latter is described here.

Data versioning on xpresso platform is done using version_controller which is generated by the VersionControllerFactory module. It is a factory class that checks the pre-requisites for using data versioning on xpresso platform and then allows access to fetch an object of data version controller. Since version 1.3.7 authentication to data versioning is removed and the access level check is done through xpresso login. i.e. If someone wants to use data versioning libraries, first they need to be logged into to xpresso controller. It can be either through Jupyter notebook or xprctl terminal client.

Since xpresso-version 1.2.5 repos are directly linked with xpresso solutions. Whenever we create a new solution, a repo for data versioning with the same name as the solution is also created. That means as a developer or pm, one can only access a repo if he has either developer or owner permission to its corresponding solution.

Every method available in data versioning is linked to the repo. One has to have access to repo if he wants to run any data versioning operations on it.

The main features provided by this class are:

1. Listing repositories (list_repo)

This is implemented by the list_repo method of version_controller instance. list_repo does not take any input parameters.

It will return all the repos a user has access to.

Sample code to list repositories

from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get version_controller object
version_controller = version_controller_factory.get_version_controller()
version_controller.list_repo()

2. Instantiating HDFS version controller (get_version_controller)

In the case of DistributedStructuredDataset, HDFS version controller can be used to version the dataset. The HDFS version controller object is instantiated using the following method.

Sample code to instantiate HDFS version controller

from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get hdfs_version_controller object
version_controller = version_controller_factory.get_version_controller("hdfs")

3. Creating a branch within a repository (create_branch)

This is implemented by the create_branch method of this class.

Input Parameters:

JSON object containing the following fields

Field

Type

Description

Mandatory?

Comments

repo_name

String

Repo Name

Yes

Must be the name of an existing repository within the workspace.

User trying to create the branch must have access to this repo

branch_name

String

Branch Name

Yes

type

String

Branch type

Conditional

Value should either of ‘data’ or ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’.

Sample code to create a branch

from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get version_controller object
version_controller = version_controller_factory.get_version_controller()
# create the branch
version_controller.create_branch (repo_name="my_new_repo", branch_name="my_new_branch", type="data")

4. Pushing a dataset into a repository (push_dataset)

This is implemented by the push_dataset method of this class. This method functions in exactly the same manner as the push_dataset command of the xprctl CLI. The key difference is that the CLI expects a list of files to push since there is no dataset defined when using the CLI). However, the PachydermRequestManager class expects a dataset object to push into the repository. A dataset can be pushed to a repo to which the user has access to.

Input Parameters:

Field	Type	Description	Mandatory?	Comments
repo_name	String	Repo Name	Yes	Must be the name of an existing repository within the workspace
branch_name	String	Branch Name	Yes	Must be the name of an existing branch within the repository
dataset	Dataset	Dataset object	Yes	dataset to be pushed
description	String	description of push	Yes
type	String	Branch Type	Conditional	Value should either of ‘data’ or ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’.

Sample code to push a dataset

Return: Commit ID of the newly pushed dataset along with the remote path where the dataset is saved

Trouble-shooting note: While executing push_dataset if the operation fails/stops intermittently before completion, any further push_dataset operation into same branch fails. This is shown by the error message “parent commit {{commit_id}} has not been finished”. This is a known limitation for now. In case of this error, contact the xpresso.ai platfrom team to resolve the issue.

IMPORTANT NOTE: xpresso.ai Data Versioning libraries use underlying open source versioning systems (such as Pachyderm), which have their own interfaces for pushing/pulling data. The xpresso.ai libraries perform certain transformations to files before using underlying open-source interfaces to push data. Hence, files pushed using underlying versioning system interfaces will not be extracted properly if pulled using xpresso.ai libraries. Similarly, files pushed using xpresso.ai versioning libraries will not be extracted properly if pulled using the underlying versioning system interfaces.

5. Listing commits made in a repository (list_commit)

Every push_dataset operation generates a commit just like in git. Currently, commit_id is returned as part of the output in push_dataset operation. This commit_id is helpful in list_dataset and pull_dataset operations to filter out the required dataset. list_commit command lists out all the commits made on a branch of a repo.

Input Parameters:

Name	Type	Description	Mandatory	Comments
repo_name	String	Repo Name	Yes	Must be the name of an existing repository within the workspace to which user has access to
branch_name	String	Branch Name	Yes	Must be the name of an existing branch within the repository
type	String	Branch Type	Conditional	Value should either of ‘data’ or ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’.
show_table	Boolean	Flag to show the output in tabular format	No	Can provide this parameter on Jupyter notebook to view the commits in tabular format

Sample code to list commits in a repository

The output of list_commit operation will contain the info of commits including the description of a commit provided during push_dataset operation.

6. Getting details of datasets in a Repository (list_dataset)

This is implemented by the list_dataset method of this class. Users should have access to the respective repo.

Input Parameters:

Name	Type	Description	Mandatory?	Comments
repo_name	string	Repo Name	Yes	Must be the name of an existing repository within the workspace
branch_name	string	Branch Name	No	Must be the name of an existing branch within the repository. Either branch name or commit ID is mandatory
type	string	Branch Type	No	Value must be either ‘data’ or ‘model’. Default = ‘model’
path	string	Path of the dataset within the repository	No	Default = ‘/’
commit_id	string	ID of commit	No	Either branch name or commit ID is mandatory

Sample code to list datasets

7. Pulling a Dataset from a Repository (pull_dataset)

This is implemented by the pull_dataset method of this class.

Input Parameters:

Name	Type	Description	Mandatory	Comments
repo_name	String	Repo Name	Yes	Must be the name of an existing repository within the workspace
branch_name	String	Branch Name	No	Must be the name of an existing branch within the repository. Either branch or commit ID is mandatory
type	String	Branch type	Conditional	Value should either of ‘data’ or ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’.
path	String	Path of the dataset within the repository	No	Default “/”
commit_id	String	ID of commit	No	Either branch or commit_id is mandatory
output_type	String	type of output	No	Value should be either one of the files or dataset. If output_type is files instead of the dataset, just files are fetched and saved.

Sample code to pull a dataset

from xpresso.ai.core.data.automl.structured_dataset import StructuredDataset
from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get version_controller object
version_controller = version_controller_factory.get_version_controller()
# pull the dataset
pull_from_repo_name = "sample_repo"
pull_from_repo_branch = "sample_branch"
dataset_object = version_controller.pull_dataset(repo_name=pull_from_repo_name, branch_name = pull_from_repo_branch, type="data")

Return: A dataset object

IMPORTANT NOTE: xpresso.ai Data Versioning libraries use underlying open source versioning systems (such as Pachyderm), which have their own interfaces for pushing/pulling data. The xpresso.ai libraries perform certain transformations to files before using underlying open-source interfaces to push data. Hence, files pushed using underlying versioning system interfaces will not be extracted properly if pulled using xpresso.ai libraries. Similarly, files pushed using xpresso.ai versioning libraries will not be extracted properly if pulled using the underlying versioning system interfaces.