DataConnector¶

The data_connector component fetches data from different sources into a dataset and saves it to a specified out path.

Name	data_connector
Purpose	To fetch data from different sources using xpresso data source connectivity
Usage Scenarios	It can be used at the beginning of the Exploratory data analysis pipeline to fetch data from different sources. It can be used at the beginning of a Machine Learning Pipeline
Created By	xpresso.ai Team
Support e-mail	support@xpresso.ai
Binary / Source / Both Versions	Binary
Docker Image Reference	For non-Abzooba instances: dockerregistry.xpresso.ai/library/data_connector:2.2 For Abzooba sandbox instance: dockerregistrysb.xpresso.ai/library/data_connector:2.2 For Abzooba QA instance: xpresso.ai/library/data_connector:2.2 For Abzooba PROD instance: dockerregistryprod.xpresso.ai/library/data_connector:2.2
Type of component	pipeline_job
Usage Instructions	When Deploying Components : Specify Mount Path (Mount Path is a shared directory between components in a pipeline which is used for reading/writing data) Specify the Docker image referred above in the ‘Custom Docker Image’ textbox Specify arguments as specified in the ‘Deploy Solution Arguments’ section below
Result	Loads data into a dataset
Example	Assume that a dataset named ‘test.csv’ has to be loaded from NFS. Create a component of type ‘pipeline_job’, and deploy it using the Custom Docker image specified above. Create a pipeline using the component \| To fetch the dataset, the following parameters must be specified when an experiment is run on the pipeline: Run Name: <provide a unique run name> Pipeline Version: <select the version of the pipeline you want to run> dataset-type: ‘structured’ data-config-type: ‘FS’ data-config-path: <path of file you want to connect to> out-path: <path in NFS mount where you want to store the data, e.g., ‘/data’> The fetched dataset will be stored here

Deploy Solution Arguments:

Field	Parameter key (refer run-parameters below)	Description	Mandatory?	Dynamic arg required?	Comments
-component-name	component_name	The component name in the solution	Yes	Yes
-connector-output-path	connector_output_path	The path where data is saved	Yes	Yes	Absolute path for output directory starting from mount path
-data- config	data_config	Data configuration to load data from	Yes (Refer ‘alternative args’ section below)	Yes	Takes in the dict as string required by xpresso.ai Data Connectivity Library for data connection configuration. Whitespaces are prohibited in this string. e.g. ‘{“type“:“FS“,“path“:“/path/to/data“}’ While providing input in JSON file escape double quotes with “\“. e.g. “{\“type\“:\“FS\“,\“path\“:\“/path/to/data\“}”
-dataset-type	dataset_type	Dataset type	Yes	Yes	Only structured and unstructured datasets are supported. (Case insensitive)
-dataset-name	dataset_name	Name of the dataset to be stored	No	Yes
-project-name	project_name	Name of the solution	Yes	Yes
-description	description	Description of the dataset	No	Yes	Whitespaces are prohibited in this string
-created-by	created_by	created by	No	Yes
-file-name	file_name	The filename for the CSV file to be stored	No	Yes	The dataset.data attribute can be optionally stored as a SCV file in the mount path for future use. Note: Only CSV output is supported

Alternative arguments for -data-config:

Specific combinations of these arguments become mandatory when not using -data-config arg from above table. Refer Define Connection parameters for better understanding of argument combinations.

Field	Parameter key (refer run-parameters below)	Description	Dynamic arg required? (refer section below if Yes)	Comments
-data-config-type	data_config_type	Type of data source. Specify FS or DB	Yes
-data-config-data-source	data_config_data_source	Special argument for local/BigQuery connection. Specify value as ‘Local’. Not supported for ‘BigQuery’	Yes	Refers to data saved in mount path if value is ‘Local’
-data-config-path	data_config_path	Path of the file to load data from	Yes
-data-config-dsn	data_config_dsn	Data Source Name	Yes
-data-config-table	data_config_table	Name of the table	Yes
-data-config-columns	data_config_columns	List of clumn names in a table that need to be fetched. Use ‘*’ to specify all columns	Yes
-data-config-options	data_config_options	Extra keyword arguments to be specified as key-value pairs for better importing through files	Yes	Supported for structured dataset-type only

Dynamic-args:

Specify dynamic argument right after it’s static argument and check the “Dynamic” check box. Value to this dynamic arg should be a placeholder string. This string will appear on Run Experiment form where an actual run-time value for it’s static argument should be filled in.

For example, if the static argument is -data-config, then it’s dynamic arg could be data_config. This data_config will be reflected as an input field in Run Experiment form. Value to this input field can be {“type”:”FS”,”path”:”/path/to/data”} which is the expected value for -data-config arg.

Run-parameters (file or commit ID):

While loading parameters from a file or data versioning repository use mentioned keys. For more details refer Guide For Dynamically Loading Run Parameters From File Or Data Versioning Repository