Define Dataset object to import DataFrameΒΆ


DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular format of any existing data. xpresso.ai Data Management uses two open source libraries to get data as a DataFrame;

  • Pandas - de facto standard (single-node) DataFrame implementation in Python. (small datasets)

  • Koalas - a multi-node implementation for big data processing on top of Apache Spark (distributed datasets)

As explained briefly in this documentation, xpresso.ai Data Management is based on a concept called Dataset. The sub-class of this type decides whether the data underneath will be a Pandas DataFrame or Koalas DataFrame. Refer below table for mapping of various Dataset objects v/s suppported DataFrames.

Dataset

DataFrame

StructuredDataset

Pandas DataFrame of the file in given path

UnstructuredDataset

Pandas DataFrame of file properties in given path

DistributedStructuredDataset

Koalas DataFrame of the file in given path

Each of the above stated Dataset object has a method called import_dataset()

import_dataset()

This method is used to import a DataFrame from a data source specified by the connection parameters. Refer below table for method parameters:

Name

Type

Description

Mandatory?

Comments

user_config

dict

data source connection parameters

Yes

Explained here in detail

The user_config parameter indicates data source connection parameters.

Refer below code snippet to define a StructuredDataset object:

from xpresso.ai.core.data.automl.structured_dataset import StructuredDataset
dataset = StructuredDataset()
config = { # Some connection parameters as a dict }
dataset.import_dataset(config)
# Print the Pandas DataFrame
dataset.data.head()

Refer below code snippet to define an UntructuredDataset object:

from xpresso.ai.core.data.automl.unstructured_dataset import UnstructuredDataset
dataset = UnstructuredDataset()
config = { # Some connection parameters as a dict }
dataset.import_dataset(config)
# Print the Pandas DataFrame
dataset.data.head()

Refer below code snippet to define a DistributedStructuredDataset object:

from xpresso.ai.core.data.distributed.automl.distributed_structured_dataset import DistributedStructuredDataset
dataset = DistributedStructuredDataset()
config = { # Some connection parameters as a dict }
dataset.import_dataset(config)
# Print the Koalas DataFrame
dataset.data.head()

Sample output - DataFrame representation

image1

What do you want to do next?