DataExplorer


The data_explorer component performs exploration on the data.

Pre-requisite: Dataset object saved into a path (in-path) or pushed onto a data versioning repository.

Name

data_explorer

Purpose

To perform univariate and bivariate analysis of data

Usage Scenarios

It can be used to perform Exploratory Data Analysis (refer Data Exploration Library)

Created By

xpresso.ai Team

Support e-mail

support@xpresso.ai

Binary / Source / Both Versions

Binary

Docker Image Reference

  • For non-Abzooba instances: dockerregistry.xpresso.ai/library/data_explorer:2.2

  • For Abzooba sandbox instance: dockerregistrysb.xpresso.ai/library/data_explorer:2.2

  • For Abzooba QA instance: xpresso.ai/library/data_explorer:2.2

  • For Abzooba PROD instance: dockerregistryprod.xpresso.ai/library/data_explorer:2.2

Type of component

pipeline_job

Usage Instructions

When deploying Components :
  1. Specify Mount Path (Mount Path is a shared directory between components in a pipeline which is used for reading/writing data)

  2. Specify the Docker image referred above in the ‘Custom Docker Image’ textbox

  3. Specify arguments as specified in the ‘Deploy Solution Arguments’ section below

Result

Stores the explored dataset files and exploration excel files into the location specified by out-path. Note: For saving data into NFS specify the out-path within mount_path

Example

Assume that a dataset has to be fetched from the data repository and explored. Create a component of type ‘pipeline_job’, and deploy it using the Custom Docker image specified. above. Create a pipeline using the component To explore the dataset, the following parameters can be specified when an experiment is run on the pipeline: Run Name: <provide a unique run name> Pipeline Version: <select the version of the pipeline you want to run> bins: 5 validity-threshold: 95 repo-name: <name of data repository> (usually, the same as the solution name) branch-name: <name of branch from which to fetch data for exploration> commit-id: <commit ID of data to be fetched for exploration> out-path: <path in NFS mount where you want to store the data, e.g., ‘/data’> The explored dataset and exploration results will be stored here


Deploy Solution Arguments:

  • Using data from mount path

Field

Parameter key (refer run-parameters below)

Description

Mandatory?

Dynamic arg required?

Comments

-component-name

component_name

The component name in the solution

Yes

Yes

-explorer-output-path

explorer_output_path

The path where exploration results are saved

Yes

Yes

Explored dataset and exploration results are saved here

-explorer-input-path

explorer_input_path

Path of the file to load data for exploration

Yes

Yes

-validity-threshold

validity_threshold

Indicates the minimum percentage of numeric values allowed in the column

No

Yes

Applicable only to structured datasets (refer to Data Exploration Library

-bins

bins

Indicates the number of bins to be considered for the numeric probability distribution

No

Yes

Applicable only to structured datasets (refer to Data Exploration Library

  • Fetching data from the data versioning system

Field

Parameter key (refer run-parameters below)

Description

Mandatory?

Dynamic arg required?

Comments

-component-name

component_name

The component name in the solution

Yes

Yes

-explorer-output-path

explorer_output_path

The path where exploration results are saved

Yes

Yes

Explored dataset and exploration results are saved here

-repo-name

repo_name

Name of the data versioning repository (usually, the same as the solution name)

Yes

Yes

-branch-name

branch_name

Name of branch from which to fetch data for exploration

Yes

Yes

-commit-id

commit_id

Commit ID of data to be fetched for exploration

Yes

Yes

-validity-threshold

validity_threshold

Indicates the minimum percentage of numeric values allowed in the column

No

Yes

Applicable only to structured datasets (refer to Data Exploration Library

-bins

bins

Indicates the number of bins to be considered for the numeric probability distribution

No

Yes

Applicable only to structured datasets (refer to Data Exploration Library

Dynamic-args:

Specify dynamic argument right after its static argument and check the “Dynamic” checkbox. The value of this dynamic arg should be a placeholder string. This string will appear on Run Experiment form where an actual run-time value for its static argument should be filled in.

For example, if the static argument is -out-path, then it’s dynamic arg could be out_path. This out_path will be reflected as an input field in the Run Experiment form. Value to this input field can be a string-valued path which is the expected value for -out-path arg.

Run-parameters (file or commit ID):

While loading parameters from a file or data versioning repository use mentioned keys. For more details refer Guide For Dynamically Loading Run Parameters From File Or Data Versioning Repository