Define Dataset object to import DataFrameΒΆ
DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular format of any existing data. xpresso.ai Data Management uses two open source libraries to get data as a DataFrame;
Pandas - de facto standard (single-node) DataFrame implementation in Python. (small datasets)
Koalas - a multi-node implementation for big data processing on top of Apache Spark (distributed datasets)
As explained briefly in this documentation, xpresso.ai Data Management is based on a concept called Dataset. The sub-class of this type decides whether the data underneath will be a Pandas DataFrame or Koalas DataFrame. Refer below table for mapping of various Dataset objects v/s suppported DataFrames.
Dataset |
DataFrame |
---|---|
StructuredDataset |
Pandas DataFrame of the file in given path |
UnstructuredDataset |
Pandas DataFrame of file properties in given path |
DistributedStructuredDataset |
Koalas DataFrame of the file in given path |
Each of the above stated Dataset object has a method called import_dataset()
import_dataset()
This method is used to import a DataFrame from a data source specified by the connection parameters. Refer below table for method parameters:
Name |
Type |
Description |
Mandatory? |
Comments |
user_config |
dict |
data source connection parameters |
Yes |
Explained here in detail |
The user_config parameter indicates data source connection parameters.
Refer below code snippet to define a StructuredDataset object:
from xpresso.ai.core.data.automl.structured_dataset import StructuredDataset
dataset = StructuredDataset()
config = { # Some connection parameters as a dict }
dataset.import_dataset(config)
# Print the Pandas DataFrame
dataset.data.head()
Refer below code snippet to define an UntructuredDataset object:
from xpresso.ai.core.data.automl.unstructured_dataset import UnstructuredDataset
dataset = UnstructuredDataset()
config = { # Some connection parameters as a dict }
dataset.import_dataset(config)
# Print the Pandas DataFrame
dataset.data.head()
Refer below code snippet to define a DistributedStructuredDataset object:
from xpresso.ai.core.data.distributed.automl.distributed_structured_dataset import DistributedStructuredDataset
dataset = DistributedStructuredDataset()
config = { # Some connection parameters as a dict }
dataset.import_dataset(config)
# Print the Koalas DataFrame
dataset.data.head()
Sample output - DataFrame representation
What do you want to do next?