Data Versioning Library

Data Versioning features in xpresso.ai enable developers to manage versions of their data, in the same way that a Code Versioning system (e.g., Github) allows them to manage versions of their code.

Data Versioning features can be accessed either through the Control Center GUI or using a Python library. The latter is described here.

Data versioning on xpresso platform is done using version_controller which is generated by the VersionControllerFactory module. It is a factory class that checks the pre-requisites for using data versioning on xpresso platform and then allows access to fetch an object of data version controller. Since version 1.3.7 authentication to data versioning is removed and the access level check is done through xpresso login. i.e. If someone wants to use data versioning libraries, first they need to be logged into to xpresso controller. It can be either through Jupyter notebook or xprctl terminal client.

Since xpresso-version 1.2.5 repos are directly linked with xpresso solutions. Whenever we create a new solution, a repo for data versioning with the same name as the solution is also created. That means as a developer or pm, one can only access a repo if he has either developer or owner permission to its corresponding solution.

Every method available in data versioning is linked to the repo. One has to have access to repo if he wants to run any data versioning operations on it.

The main features provided by this class are:

1. Listing repositories (list_repo)

This is implemented by the list_repo method of version_controller instance. list_repo does not take any input parameters.

It will return all the repos a user has access to.

Sample code to list repositories

from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get version_controller object
version_controller = version_controller_factory.get_version_controller()
version_controller.list_repo()

2. Instantiating HDFS version controller (get_version_controller)

In the case of DistributedStructuredDataset, HDFS version controller can be used to version the dataset. The HDFS version controller object is instantiated using the following method.

Sample code to instantiate HDFS version controller

from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get hdfs_version_controller object
version_controller = version_controller_factory.get_version_controller("hdfs")

3. Creating a branch within a repository (create_branch)

This is implemented by the create_branch method of this class.

Input Parameters:

JSON object containing the following fields

Field

Type

Description

Mandatory?

Comments

repo_name

String

Repo Name

Yes

Must be the name of an existing repository within the workspace.

User trying to create the branch must have access to this repo

branch_name

String

Branch Name

Yes

type

String

Branch type

Conditional

Value should either of ‘data’ or  ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’.


Sample code to create a branch

from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get version_controller object
version_controller = version_controller_factory.get_version_controller()
# create the branch
version_controller.create_branch (repo_name="my_new_repo", branch_name="my_new_branch", type="data")

4. Pushing a dataset into a repository (push_dataset)

This is implemented by the push_dataset method of this class. This method functions in exactly the same manner as the push_dataset command of the xprctl CLI. The key difference is that the CLI expects a list of files to push since there is no dataset defined when using the CLI). However, the PachydermRequestManager class expects a dataset object to push into the repository. A dataset can be pushed to a repo to which the user has access to.

Input Parameters:

Field

Type

Description

Mandatory?

Comments

repo_name

String

Repo Name

Yes

Must be the name of an existing repository within the workspace

branch_name

String

Branch Name

Yes

Must be the name of an existing branch within the repository

dataset

Dataset

Dataset object

Yes

dataset to be pushed

description

String

description of push

Yes

type

String

Branch Type

Conditional

Value should either of ‘data’ or  ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’.

Sample code to push a dataset

Return: Commit ID of the newly pushed dataset along with the remote path where the dataset is saved

Trouble-shooting note: While executing push_dataset if the operation fails/stops intermittently before completion, any further push_dataset operation into same branch fails. This is shown by the error message “parent commit {{commit_id}} has not been finished”. This is a known limitation for now. In case of this error, contact the xpresso.ai platfrom team to resolve the issue.

IMPORTANT NOTE: xpresso.ai Data Versioning libraries use underlying open source versioning systems (such as Pachyderm), which have their own interfaces for pushing/pulling data. The xpresso.ai libraries perform certain transformations to files before using underlying open-source interfaces to push data. Hence, files pushed using underlying versioning system interfaces will not be extracted properly if pulled using xpresso.ai libraries. Similarly, files pushed using xpresso.ai versioning libraries will not be extracted properly if pulled using the underlying versioning system interfaces.

5. Listing commits made in a repository (list_commit)

Every push_dataset operation generates a commit just like in git. Currently, commit_id is returned as part of the output in push_dataset operation. This commit_id is helpful in list_dataset and pull_dataset operations to filter out the required dataset. list_commit command lists out all the commits made on a branch of a repo.

Input Parameters:

Name

Type

Description

Mandatory

Comments

repo_name

String

Repo Name

Yes

Must be the name of an existing repository within the workspace to which user has access to

branch_name

String

Branch Name

Yes

Must be the name of an existing branch within the repository

type

String

Branch Type

Conditional

Value should either of ‘data’ or ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’.

show_table

Boolean

Flag to show the

output in tabular format

No

Can provide this parameter on Jupyter notebook to view the commits in tabular format

Sample code to list commits in a repository

The output of list_commit operation will contain the info of commits including the description of a commit provided during push_dataset operation.

6. Getting details of datasets in a Repository (list_dataset)

This is implemented by the list_dataset method of this class. Users should have access to the respective repo.

Input Parameters:

Name

Type

Description

Mandatory?

Comments

repo_name

string

Repo Name

Yes

Must be the name of an existing repository within the workspace

branch_name

string

Branch Name

No

Must be the name of an existing branch within the repository. Either branch name or commit ID is mandatory

type

string

Branch Type

No

Value must be either ‘data’ or ‘model’. Default = ‘model’

path

string

Path of the dataset within the repository

No

Default = ‘/’

commit_id

string

ID of commit

No

Either branch name or commit ID is mandatory

Sample code to list datasets

7. Pulling a Dataset from a Repository (pull_dataset)

This is implemented by the pull_dataset method of this class.

Input Parameters:

Name

Type

Description

Mandatory

Comments

repo_name

String

Repo Name

Yes

Must be the name of an existing repository within the workspace

branch_name

String

Branch Name

No

Must be the name of an existing branch within the repository. Either branch or commit ID is mandatory

type

String

Branch type

Conditional

Value should either of ‘data’ or  ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’.

path

String

Path of the dataset within the repository

No

Default “/”

commit_id

String

ID of commit

No

Either branch or commit_id is mandatory

output_type

String

type of output

No

Value should be either one of the files or dataset.

If output_type is files instead of the dataset, just files are fetched and saved.


Sample code to pull a dataset

from xpresso.ai.core.data.automl.structured_dataset import StructuredDataset
from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get version_controller object
version_controller = version_controller_factory.get_version_controller()
# pull the dataset
pull_from_repo_name = "sample_repo"
pull_from_repo_branch = "sample_branch"
dataset_object = version_controller.pull_dataset(repo_name=pull_from_repo_name, branch_name = pull_from_repo_branch, type="data")

Return: A dataset object

IMPORTANT NOTE: xpresso.ai Data Versioning libraries use underlying open source versioning systems (such as Pachyderm), which have their own interfaces for pushing/pulling data. The xpresso.ai libraries perform certain transformations to files before using underlying open-source interfaces to push data. Hence, files pushed using underlying versioning system interfaces will not be extracted properly if pulled using xpresso.ai libraries. Similarly, files pushed using xpresso.ai versioning libraries will not be extracted properly if pulled using the underlying versioning system interfaces.