Data Versioning Library¶
Data Versioning features in xpresso.ai enable developers to manage versions of their data, in the same way that a Code Versioning system (e.g., Github) allows them to manage versions of their code.
Data Versioning features can be accessed either through the Control Center GUI or using a Python library. The latter is described here.
Data versioning on xpresso platform is done using version_controller which is generated by the VersionControllerFactory module. It is a factory class that checks the pre-requisites for using data versioning on xpresso platform and then allows access to fetch an object of data version controller. Since version 1.3.7 authentication to data versioning is removed and the access level check is done through xpresso login. i.e. If someone wants to use data versioning libraries, first they need to be logged into to xpresso controller. It can be either through Jupyter notebook or xprctl terminal client.
Since xpresso-version 1.2.5 repos are directly linked with xpresso solutions. Whenever we create a new solution, a repo for data versioning with the same name as the solution is also created. That means as a developer or pm, one can only access a repo if he has either developer or owner permission to its corresponding solution.
Every method available in data versioning is linked to the repo. One has to have access to repo if he wants to run any data versioning operations on it.
The main features provided by this class are:
1. Listing repositories (list_repo)
This is implemented by the list_repo method of version_controller instance. list_repo does not take any input parameters.
It will return all the repos a user has access to.
Sample code to list repositories
from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get version_controller object
version_controller = version_controller_factory.get_version_controller()
version_controller.list_repo()
2. Instantiating HDFS version controller (get_version_controller)
In the case of DistributedStructuredDataset, HDFS version controller can be used to version the dataset. The HDFS version controller object is instantiated using the following method.
Sample code to instantiate HDFS version controller
from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get hdfs_version_controller object
version_controller = version_controller_factory.get_version_controller("hdfs")
3. Creating a branch within a repository (create_branch)
This is implemented by the create_branch method of this class.
Input Parameters:
JSON object containing the following fields
Field |
Type |
Description |
Mandatory? |
Comments |
---|---|---|---|---|
repo_name |
String |
Repo Name |
Yes |
Must be the name of an existing repository within the workspace. User trying to create the branch must have access to this repo |
branch_name |
String |
Branch Name |
Yes |
|
type |
String |
Branch type |
Conditional |
Value should either of ‘data’ or ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’. |
Sample code to create a branch
from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get version_controller object
version_controller = version_controller_factory.get_version_controller()
# create the branch
version_controller.create_branch (repo_name="my_new_repo", branch_name="my_new_branch", type="data")
4. Pushing a dataset into a repository (push_dataset)
This is implemented by the push_dataset method of this class. This method functions in exactly the same manner as the push_dataset command of the xprctl CLI. The key difference is that the CLI expects a list of files to push since there is no dataset defined when using the CLI). However, the PachydermRequestManager class expects a dataset object to push into the repository. A dataset can be pushed to a repo to which the user has access to.
Input Parameters:
Field |
Type |
Description |
Mandatory? |
Comments |
---|---|---|---|---|
repo_name |
String |
Repo Name |
Yes |
Must be the name of an existing repository within the workspace |
branch_name |
String |
Branch Name |
Yes |
Must be the name of an existing branch within the repository |
dataset |
Dataset |
Dataset object |
Yes |
dataset to be pushed |
description |
String |
description of push |
Yes |
|
type |
String |
Branch Type |
Conditional |
Value should either of ‘data’ or ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’. |
Sample code to push a dataset
Return: Commit ID of the newly pushed dataset along with the remote path where the dataset is saved
Trouble-shooting note: While executing push_dataset if the operation fails/stops intermittently before completion, any further push_dataset operation into same branch fails. This is shown by the error message “parent commit {{commit_id}} has not been finished”. This is a known limitation for now. In case of this error, contact the xpresso.ai platfrom team to resolve the issue.
IMPORTANT NOTE: xpresso.ai Data Versioning libraries use underlying open source versioning systems (such as Pachyderm), which have their own interfaces for pushing/pulling data. The xpresso.ai libraries perform certain transformations to files before using underlying open-source interfaces to push data. Hence, files pushed using underlying versioning system interfaces will not be extracted properly if pulled using xpresso.ai libraries. Similarly, files pushed using xpresso.ai versioning libraries will not be extracted properly if pulled using the underlying versioning system interfaces.
5. Listing commits made in a repository (list_commit)
Every push_dataset operation generates a commit just like in git. Currently, commit_id is returned as part of the output in push_dataset operation. This commit_id is helpful in list_dataset and pull_dataset operations to filter out the required dataset. list_commit command lists out all the commits made on a branch of a repo.
Input Parameters:
Name |
Type |
Description |
Mandatory |
Comments |
---|---|---|---|---|
repo_name |
String |
Repo Name |
Yes |
Must be the name of an existing repository within the workspace to which user has access to |
branch_name |
String |
Branch Name |
Yes |
Must be the name of an existing branch within the repository |
type |
String |
Branch Type |
Conditional |
Value should either of ‘data’ or ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’. |
show_table |
Boolean |
Flag to show the output in tabular format |
No |
Can provide this parameter on Jupyter notebook to view the commits in tabular format |
Sample code to list commits in a repository
The output of list_commit operation will contain the info of commits including the description of a commit provided during push_dataset operation.
6. Getting details of datasets in a Repository (list_dataset)
This is implemented by the list_dataset method of this class. Users should have access to the respective repo.
Input Parameters:
Name |
Type |
Description |
Mandatory? |
Comments |
repo_name |
string |
Repo Name |
Yes |
Must be the name of an existing repository within the workspace |
branch_name |
string |
Branch Name |
No |
Must be the name of an existing branch within the repository. Either branch name or commit ID is mandatory |
type |
string |
Branch Type |
No |
Value must be either ‘data’ or ‘model’. Default = ‘model’ |
path |
string |
Path of the dataset within the repository |
No |
Default = ‘/’ |
commit_id |
string |
ID of commit |
No |
Either branch name or commit ID is mandatory |
Sample code to list datasets
7. Pulling a Dataset from a Repository (pull_dataset)
This is implemented by the pull_dataset method of this class.
Input Parameters:
Name |
Type |
Description |
Mandatory |
Comments |
---|---|---|---|---|
repo_name |
String |
Repo Name |
Yes |
Must be the name of an existing repository within the workspace |
branch_name |
String |
Branch Name |
No |
Must be the name of an existing branch within the repository. Either branch or commit ID is mandatory |
type |
String |
Branch type |
Conditional |
Value should either of ‘data’ or ‘model’. Must be provided if the branch type is ‘data’. Default value is ‘model’. |
path |
String |
Path of the dataset within the repository |
No |
Default “/” |
commit_id |
String |
ID of commit |
No |
Either branch or commit_id is mandatory |
output_type |
String |
type of output |
No |
Value should be either one of the files or dataset. If output_type is files instead of the dataset, just files are fetched and saved. |
Sample code to pull a dataset
from xpresso.ai.core.data.automl.structured_dataset import StructuredDataset
from xpresso.ai.core.data.versioning.controller_factory import VersionControllerFactory
# Create instance of VersionControllerFactory
version_controller_factory = VersionControllerFactory()
# Get version_controller object
version_controller = version_controller_factory.get_version_controller()
# pull the dataset
pull_from_repo_name = "sample_repo"
pull_from_repo_branch = "sample_branch"
dataset_object = version_controller.pull_dataset(repo_name=pull_from_repo_name, branch_name = pull_from_repo_branch, type="data")
Return: A dataset object
IMPORTANT NOTE: xpresso.ai Data Versioning libraries use underlying open source versioning systems (such as Pachyderm), which have their own interfaces for pushing/pulling data. The xpresso.ai libraries perform certain transformations to files before using underlying open-source interfaces to push data. Hence, files pushed using underlying versioning system interfaces will not be extracted properly if pulled using xpresso.ai libraries. Similarly, files pushed using xpresso.ai versioning libraries will not be extracted properly if pulled using the underlying versioning system interfaces.