DataConnector


The data_connector component fetches data from different sources into a dataset and saves it to a specified out path.

Name

data_connector

Purpose

To fetch data from different sources using xpresso data source connectivity

Usage Scenarios

  • It can be used at the beginning of the Exploratory data analysis pipeline to fetch data from different sources.

  • It can be used at the beginning of a Machine Learning Pipeline

Created By

xpresso.ai Team

Support e-mail

support@xpresso.ai

Binary / Source / Both Versions

Binary

Docker Image Reference

  • For non-Abzooba instances: dockerregistry.xpresso.ai/library/data_connector:2.2

  • For Abzooba sandbox instance: dockerregistrysb.xpresso.ai/library/data_connector:2.2

  • For Abzooba QA instance: xpresso.ai/library/data_connector:2.2

  • For Abzooba PROD instance: dockerregistryprod.xpresso.ai/library/data_connector:2.2

Type of component

pipeline_job

Usage Instructions

When Deploying Components :
  1. Specify Mount Path (Mount Path is a shared directory between components in a pipeline which is used for reading/writing data)

  2. Specify the Docker image referred above in the ‘Custom Docker Image’ textbox

  3. Specify arguments as specified in the ‘Deploy Solution Arguments’ section below

Result

Loads data into a dataset

Example

Assume that a dataset named ‘test.csv’ has to be loaded from NFS. Create a component of type ‘pipeline_job’, and deploy it using the Custom Docker image specified above. Create a pipeline using the component | To fetch the dataset, the following parameters must be specified when an experiment is run on the pipeline: Run Name: <provide a unique run name> Pipeline Version: <select the version of the pipeline you want to run> dataset-type: ‘structured’ data-config-type: ‘FS’ data-config-path: <path of file you want to connect to> out-path: <path in NFS mount where you want to store the data, e.g., ‘/data’> The fetched dataset will be stored here

Deploy Solution Arguments:

Field

Parameter key (refer run-parameters below)

Description

Mandatory?

Dynamic arg required?

Comments

-component-name

component_name

The component name in the solution

Yes

Yes

-connector-output-path

connector_output_path

The path where data is saved

Yes

Yes

Absolute path for output directory starting from mount path

-data- config

data_config

Data configuration to load data from

Yes (Refer ‘alternative args’ section below)

Yes

Takes in the dict as string required by xpresso.ai Data Connectivity Library for data connection configuration. Whitespaces are prohibited in this string. e.g. ‘{“type“:“FS“,“path“:“/path/to/data“}’ While providing input in JSON file escape double quotes with “\“. e.g. “{\“type\“:\“FS\“,\“path\“:\“/path/to/data\“}”

-dataset-type

dataset_type

Dataset type

Yes

Yes

Only structured and unstructured datasets are supported. (Case insensitive)

-dataset-name

dataset_name

Name of the dataset to be stored

No

Yes

-project-name

project_name

Name of the solution

Yes

Yes

-description

description

Description of the dataset

No

Yes

Whitespaces are prohibited in this string

-created-by

created_by

created by

No

Yes

-file-name

file_name

The filename for the CSV file to be stored

No

Yes

The dataset.data attribute can be optionally stored as a SCV file in the mount path for future use. Note: Only CSV output is supported

Alternative arguments for -data-config:

Specific combinations of these arguments become mandatory when not using -data-config arg from above table. Refer Define Connection parameters for better understanding of argument combinations.

Field

Parameter key (refer run-parameters below)

Description

Dynamic arg required? (refer section below if Yes)

Comments

-data-config-type

data_config_type

Type of data source. Specify FS or DB

Yes

-data-config-data-source

data_config_data_source

Special argument for local/BigQuery connection. Specify value as ‘Local’. Not supported for ‘BigQuery’

Yes

Refers to data saved in mount path if value is ‘Local’

-data-config-path

data_config_path

Path of the file to load data from

Yes

-data-config-dsn

data_config_dsn

Data Source Name

Yes

-data-config-table

data_config_table

Name of the table

Yes

-data-config-columns

data_config_columns

List of clumn names in a table that need to be fetched. Use ‘*’ to specify all columns

Yes

-data-config-options

data_config_options

Extra keyword arguments to be specified as key-value pairs for better importing through files

Yes

Supported for structured dataset-type only

Dynamic-args:

Specify dynamic argument right after it’s static argument and check the “Dynamic” check box. Value to this dynamic arg should be a placeholder string. This string will appear on Run Experiment form where an actual run-time value for it’s static argument should be filled in.

For example, if the static argument is -data-config, then it’s dynamic arg could be data_config. This data_config will be reflected as an input field in Run Experiment form. Value to this input field can be {“type”:”FS”,”path”:”/path/to/data”} which is the expected value for -data-config arg.

Run-parameters (file or commit ID):

While loading parameters from a file or data versioning repository use mentioned keys. For more details refer Guide For Dynamically Loading Run Parameters From File Or Data Versioning Repository