Define Dataset object to import DataFrame¶

DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular format of any existing data. xpresso.ai Data Management uses two open source libraries to get data as a DataFrame;

Pandas - de facto standard (single-node) DataFrame implementation in Python. (small datasets)
Koalas - a multi-node implementation for big data processing on top of Apache Spark (distributed datasets)

As explained briefly in this documentation, xpresso.ai Data Management is based on a concept called Dataset. The sub-class of this type decides whether the data underneath will be a Pandas DataFrame or Koalas DataFrame. Refer below table for mapping of various Dataset objects v/s suppported DataFrames.

Dataset	DataFrame
StructuredDataset	Pandas DataFrame of the file in given path
UnstructuredDataset	Pandas DataFrame of file properties in given path
DistributedStructuredDataset	Koalas DataFrame of the file in given path

Each of the above stated Dataset object has a method called import_dataset()

import_dataset()

This method is used to import a DataFrame from a data source specified by the connection parameters. Refer below table for method parameters:

Name	Type	Description	Mandatory?	Comments
user_config	dict	data source connection parameters	Yes	Explained here in detail

The user_config parameter indicates data source connection parameters.

Refer below code snippet to define a StructuredDataset object:

from xpresso.ai.core.data.automl.structured_dataset import StructuredDataset
dataset = StructuredDataset()
config = { # Some connection parameters as a dict }
dataset.import_dataset(config)
# Print the Pandas DataFrame
dataset.data.head()

Refer below code snippet to define an UntructuredDataset object:

from xpresso.ai.core.data.automl.unstructured_dataset import UnstructuredDataset
dataset = UnstructuredDataset()
config = { # Some connection parameters as a dict }
dataset.import_dataset(config)
# Print the Pandas DataFrame
dataset.data.head()

Refer below code snippet to define a DistributedStructuredDataset object:

from xpresso.ai.core.data.distributed.automl.distributed_structured_dataset import DistributedStructuredDataset
dataset = DistributedStructuredDataset()
config = { # Some connection parameters as a dict }
dataset.import_dataset(config)
# Print the Koalas DataFrame
dataset.data.head()

Sample output - DataFrame representation

What do you want to do next?

Define connection parameters