Define connection parameters


Earlier, we described connection parameters to be certain key-value pairs defined in a dictionary object. In this wiki we will learn how multiple combinations of such key-value pairs link a developer to different data sources.

We are aware that sub-class of Dataset object is named as per the data they carry. Like StructuredDataset object has data of Structured kind. Similarly, there’s a difference between a set of connection parameters to a data source for every Dataset object. For the ease of understanding, let us call connection parameters as config, henceforth, which is a dictionary object.

Below are the various tables that list key-values in config definition for a data source against a Dataset object. Refer Supported data sources here.

  • StructuredDataset object

Connection parameters for supported file systems:

Populates class variable data as a Pandas DataFrame object of file specified in value of “path“ key.

Key

Type

M andatory?

Value

De scription

type

str

Yes

Should be set to “FS”

Type of data source

data_source

str

No

Should be set as “local”

Special key to connect to local file system only.

path

str

Yes

Any existing path to file on Al luxio/local file system. For supported extensions, refer the Pandas section in options table at the end of this page.

Path of file to be imported. Make sure the target data source is mounted. Contact xpresso.ai ad ministrator if there is an issue.

options

dict

No

Extra keyword arguments to be specified as key-value pair, for better importing through file system.

Refer the Pandas section in options table at the end of this page.

Connection parameters for supported database systems:

Populates class variable data as a Pandas DataFrame object of table specified in value of “table“ key.

Key

Type

** Mandatory?**

Value

D escription

type

str

Yes

Should be set to “DB”

Type of data source

DSN

str

Yes

Data source name

Should refer to a Data Source registered with the Presto server. Contact the xpresso.ai System Ad ministrators to register a data source.

table

str

Yes

Name of table to import data.

Should refer to an existing table.

columns

str/list of str

Yes

com ma-separated list of columns to be imported (“*” will import all columns)

  • UnstructuredDataset object

Connection parameters for supported file system:

Populates class variable data as a Pandas DataFrame object with information regarding files/directories specified in value of “path“ key.

Key

Type

M andatory?

Value

De scription

type

str

Yes

Should be set to “FS”

Type of data source

data_source

str

No

Should be set as “local”

Special key to connect to local file system only.

path

str

Yes

Any existing path to a directory on Al luxio/local file system.

Path of folder can be given for importing multiple files. Make sure the target data source is mounted. Contact xpresso.ai ad ministrator if there is an issue.

  • DistributedStructuredDataset object

Connection parameters for HDFS:

Populates class variable data as a Koalas DataFrame object of file specified in value of “path“ key.

Limitation: For all Big Data solutions, xpresso.ai Data connectivity module supports importing data only through a registered HDFS. Please contact xpresso.ai administrator to get your HDFS mounted.

Key

Type

M andatory?

Value

De scription

type

String

Yes

Should be set to “FS”

Type of data source

d ataset_type

String

Yes

Should be set to “d istributed“

Special key to state the type of dataset as distributed

path

String

Yes

Path of file to be imported. For supported extensions, refer the Koalas section in options table at the end of this page.

Path of file to be imported. Make sure the HDFS has the file present. Contact xpresso.ai ad ministrator if there is an issue.

options

dict

No

Extra keyword arguments to be specified as key-value pair, for better importing through file system.

Refer the Koalas section in options table at the end of this page.

Following is a sample config. Please define all of the above key-value pairs according to this sample config:

from xpresso.ai.core.data.automl.structured_dataset import StructuredDataset
dataset = StructuredDataset()
config = { "type": "FS", "options": {"sep": "|"}, "path": "/path/to/your/data/file.txt" }
dataset.import_dataset(config)
# Print the Pandas DataFrame
dataset.data.head()

Sample output - DataFrame representation

image1

Options table for supported file extensions:

The Data Connectivity library can be used to create custom connectors for different data sources, e.g., Google BigQuery

What do you want to do next?