DataVersioning


The data_versioning component provides commands to push and pull data from xpresso.ai.html Data Repository.

Name

data_versioning

Purpose

To To push and pull data using xpresso Data Versioning Libraries

Usage Scenarios

  • It can be used at the beginning of a Machine Learning Pipeline to fetch training data from the repository.

  • It can be used to push the dataset after exploration and visualization.

Created By

xpresso.ai Team

Support e-mail

support@xpresso.ai

Binary / Source / Both Versions

Binary

Docker Image Reference

  • For non-Abzooba instances: dockerregistry.xpresso.ai/library/data_versioning:2.2

  • For Abzooba sandbox instance: dockerregistrysb.xpresso.ai/library/data_versioning:2.2

  • For Abzooba QA instance: xpresso.ai/library/data_versioning:2.2

  • For Abzooba PROD instance: dockerregistryprod.xpresso.ai/library/data_versioning:2.2

Type of component

pipeline_job

Usage Instructions

When Deploying Components:
  1. Specify Mount Path (Mount Path is a shared directory between components in a pipeline which is used for reading/writing data)

  2. Specify the Docker image referred above in the ‘Custom Docker Image’ textbox

  3. Specify arguments as specified in the ‘Deploy Solution Arguments’ section below

Result

Pulls the data from the repository into out-path in case of pull_dataset Pushes the data into the repository from in-path in case of push_dataset

Example

Assume that a dataset has to be fetched from the data repository. Create a component of type ‘pipeline_job’, and deploy it using the Custom Docker image specified above. Create a pipeline using the component. To pull the dataset, the following parameters can be specified when an experiment is run on the pipeline: Run Name: <provide a unique run name> Pipeline Version: <select the version of the pipeline you want to run> repo-name: <name of data repository> (usually, the same as the solution name) branch-name: <name of branch from which to fetch data> commit-id: <commit ID of data to be fetched> pull-input-path: <specify the path within the commit, if any> pull-output-path: <mount path on the shared drive where results are to be stored, e.g., ‘/data’> The pulled dataset will be stored here

Deploy Solution Arguments:

  • Using pull_dataset

Field

Parameter key (refer run-parameters below)

Description

Mandatory?

Default Value

Dynamic arg required?

Comments

-component-name

component_name

The component name in the solution

Yes

data_fetch

Yes

-command

command

Data versioning operation

Yes

None

Yes

Specify the value as pull_dataset

-repo-name

repo_name

Name of the data versioning repository (usually, the same as the solution name)

Yes

None

Yes

-branch-name

branch_name

Name of branch within repository

Yes

None

Yes

-branch-type

branch_type

Value of the branch type

Yes

model

Yes

Can be ‘data’ for data repository operations or ‘model’ for model repository operations

-commit-id

commit_id

Value of commit_id returned after pushing the data

Yes

Latest commit ID

Yes

-dv-commit-id

dv_commit_id

Value of commit_id returned by Data Versioning system

No

None

Yes

This is the commit_id returned after push_dataset before xpresso version 2.1.1. Will be deprecated in next marketplace component release

-pull-input-path

pull_input_path

Path of the file on data versioning system

No

/dataset

Yes

This is returned as output of push_dataset. Helpful in fetching only required files rather than whole dataset

-pull-output-path

pull_output_path

Path on the container to save the fetched data

No

/data/pull_data

Yes

This parameter can be used to save the files at required location and use it in other components.

  • Using push_dataset

Field

Parameter key (refer run-parameters below)

Description

Mandatory?

Default Value

Dynamic arg required?

Comments

-component-name

component_name

The component name in the solution

Yes

data_fetch

Yes

-command

command

Data versioning operation

Yes

None

Yes

Specify the value as push_dataset

-push-input-path

push_input_path

Path of the file to be pushed

Yes

/data

Yes

-repo-name

repo_name

Name of the data versioning repository (usually, the same as the solution name)

Yes

None

Yes

-branch-name

branch_name

Name of branch within repository

Yes

None

Yes

-branch-type

branch_type

Value of the branch type

Yes

model

Yes

Can be ‘data’ for data repository operations or ‘model’ for model repository operations

-dataset-name

dataset_name

Name of the dataset on data versioning system

Yes

None

Yes

-description

description

Description of the dataset

Yes

None

Yes

Whitespaces are prohibited

For a detailed reference of data versioning parameter’s usage refer to Data Versioning Library documentation

Dynamic-args:

Specify dynamic argument right after its static argument and check the “Dynamic” checkbox. The value of this dynamic arg should be a placeholder string. This string will appear on Run Experiment form where an actual run-time value for its static argument should be filled in.

Eg: If the static argument is -out-path, then it’s dynamic arg could be out_path. This out_path will be reflected as an input field in the Run Experiment form. Value to this input field can be a string-valued path which is the expected value for -out-path arg.

Run-parameters (file or commit ID):

While loading parameters from a file or data versioning repository use mentioned keys. For more details refer Guide For Dynamically Loading Run Parameters From File Or Data Versioning Repository