Define connection parameters¶

Earlier, we described connection parameters to be certain key-value pairs defined in a dictionary object. In this wiki we will learn how multiple combinations of such key-value pairs link a developer to different data sources.

We are aware that sub-class of Dataset object is named as per the data they carry. Like StructuredDataset object has data of Structured kind. Similarly, there’s a difference between a set of connection parameters to a data source for every Dataset object. For the ease of understanding, let us call connection parameters as config, henceforth, which is a dictionary object.

Below are the various tables that list key-values in config definition for a data source against a Dataset object. Refer Supported data sources here.

StructuredDataset object

Connection parameters for supported file systems:

Populates class variable data as a Pandas DataFrame object of file specified in value of “path“ key.

Key	Type	M andatory?	Value	De scription
type	str	Yes	Should be set to “FS”	Type of data source
data_source	str	No	Should be set as “local”	Special key to connect to local file system only.
path	str	Yes	Any existing path to file on Al luxio/local file system. For supported extensions, refer the Pandas section in options table at the end of this page.	Path of file to be imported. Make sure the target data source is mounted. Contact xpresso.ai ad ministrator if there is an issue.
options	dict	No	Extra keyword arguments to be specified as key-value pair, for better importing through file system.	Refer the Pandas section in options table at the end of this page.

Connection parameters for supported database systems:

Populates class variable data as a Pandas DataFrame object of table specified in value of “table“ key.

Key	Type	Mandatory?	Value	D escription
type	str	Yes	Should be set to “DB”	Type of data source
DSN	str	Yes	Data source name	Should refer to a Data Source registered with the Presto server. Contact the xpresso.ai System Ad ministrators to register a data source.
table	str	Yes	Name of table to import data.	Should refer to an existing table.
columns	str/list of str	Yes	com ma-separated list of columns to be imported (“*” will import all columns)

UnstructuredDataset object

Connection parameters for supported file system:

Populates class variable data as a Pandas DataFrame object with information regarding files/directories specified in value of “path“ key.

Key	Type	M andatory?	Value	De scription
type	str	Yes	Should be set to “FS”	Type of data source
data_source	str	No	Should be set as “local”	Special key to connect to local file system only.
path	str	Yes	Any existing path to a directory on Al luxio/local file system.	Path of folder can be given for importing multiple files. Make sure the target data source is mounted. Contact xpresso.ai ad ministrator if there is an issue.

DistributedStructuredDataset object

Connection parameters for HDFS:

Populates class variable data as a Koalas DataFrame object of file specified in value of “path“ key.

Limitation: For all Big Data solutions, xpresso.ai Data connectivity module supports importing data only through a registered HDFS. Please contact xpresso.ai administrator to get your HDFS mounted.

Key	Type	M andatory?	Value	De scription
type	String	Yes	Should be set to “FS”	Type of data source
d ataset_type	String	Yes	Should be set to “d istributed“	Special key to state the type of dataset as distributed
path	String	Yes	Path of file to be imported. For supported extensions, refer the Koalas section in options table at the end of this page.	Path of file to be imported. Make sure the HDFS has the file present. Contact xpresso.ai ad ministrator if there is an issue.
options	dict	No	Extra keyword arguments to be specified as key-value pair, for better importing through file system.	Refer the Koalas section in options table at the end of this page.

Following is a sample config. Please define all of the above key-value pairs according to this sample config:

from xpresso.ai.core.data.automl.structured_dataset import StructuredDataset
dataset = StructuredDataset()
config = { "type": "FS", "options": {"sep": "|"}, "path": "/path/to/your/data/file.txt" }
dataset.import_dataset(config)
# Print the Pandas DataFrame
dataset.data.head()

Sample output - DataFrame representation

Options table for supported file extensions:

The Data Connectivity library can be used to create custom connectors for different data sources, e.g., Google BigQuery

What do you want to do next?

Connect to Google BigQuery