Connect to various data sources


There are two major kinds of data sources that host/store data;

  • Database systems (further referred as DB)

  • File systems (further referred as FS)

It is the responsibility of xpresso.ai Data Source Connectivity to gain access to a correct data source with minimal steps.

Since the schema for both data sources is not at all similar, xpresso.ai approaches connection to them in a unique fashion. This means, a developer might see huge differences in defining connection parameters between a database system and a file system. However, defining connection parameters for a particular database system is almost as similar to any database system supported by xpresso.ai Data Source Connectivity. Similar is the case for file systems.

Supported data sources:

  • Database systems - RDBMS / NoSQL databases such as MySQL, Microsoft SQL Server, MongoDB and Cassandra.

  • File systems - File systems such as Local File System, Network File System (NFS), Hadoop Distributed File System (HDFS), Amazon Simple Storage Service (S3), etc.

Connection parameters are a few key-value pairs defined in a dictionary object. Click here to know more.

FAQs

  1. How does xpresso.ai Data Source Connectivity library connect to above mentioned database systems?

  2. How does xpresso.ai Data Source Connectivity library connect to above mentioned file systems?

Answer to both of these questions is that xpresso.ai leverages use to two open-source tools called PrestoDB and Alluxio, respectively. Refer below documentation to get an idea of their role within xpresso.ai

Connecting to database systems:

PrestoDB is a distributed SQL query engine designed to query large datasets distributed over one or more heterogeneous data sources. xpresso.ai has a Presto server hosted on each of it’s environment. These servers use catalogs to store information of various databases under respective supported connectors. Catalogs are needed to be defined prior to starting a Presto server.

Presto specific terms;

  • Connector - Driver for database that interact with a resource using a standard API.

  • Catalog - A Presto catalog contains schemas and references a database via a connector.

  • Schema - Schemas are a way to organize tables. Together, a catalog and schema define a set of tables that can be queried. This is synonymous to referring a database in a DBMS. Eg: default database called test in MySQL will be referred as schema in Presto standards.

Client calls to Presto server:

xpresso.ai uses PrestoDB’s python client library. Using them xpresso.ai imports tables in the form of a DataFrame from a remote file system. This requires defining connection parameters. Connection parameters are a few key-value pairs defined in a dictionary object. Click here to know more.

Since catalogs are needed to be written before starting the service, dynamically connecting to a remote database using any API calls or client libraries is not possible. Hence, we cannot mount, unmount or list tables under PrestoDB, while the service is running. Contact xpresso.ai administrator to get your remote database mounted on an environment.

Go to http://<presto server IP>:<Presto server port> to check if service is running on your environment.

Sample output - UI

image1

Connecting to file systems:

Alluxio is an open source software solution that connects analytics applications to heterogeneous data sources through a data orchestration layer that sits between compute and storage. It runs on commodity hardware, creating a shared data layer abstracting the files or objects in underlying persistent storage systems. Applications connect to Alluxio via a standard interface, accessing data from a single unified source. xpresso.ai has an Alluxio server hosted on each of it’s environment. These servers are configured in order to support caching of filesystems mounted on them. They actively run a Unified File system.

Alluxio specific terms;

  • Unified File System - Alluxio file system where at root level we can mount and unmount a new directory that host different understorages.

  • Caching - Alluxio supports caching of files. (Persistent storage)

  • Understorage - Different file systems mounted on Alluxio.

Client calls to Alluxio server:

xpresso.ai uses Alluxio’s python client library. Using them xpresso.ai imports files in the form of a DataFrame from a remote file system. This requires defining connection parameters. Connection parameters are a few key-value pairs defined in a dictionary object. Click here to know more.

Below mentioned are the xpresso.ai defined methods that can be used to mount, unmount and list directories in Alluxio’s UFS, while the service is running.

mount_fs()

This method is used to mount a remote file system under Alluxio’s UFS. Refer below tables for method parameters and keyword arguments.

Parameter

Type

Mandatory?

Description

path

str

Yes

Path on Alluxio FS under which you will find your remote file system

src

str

Yes

URI with path to file/directory on remote file system to be mounted on Alluxio FS.

Keyword arguments

Type

M andatory?

Default

De scription

properties

dict

No

None

A dictionary mapping property key strings to value strings

read_only

bool

No

None

Whether the mount point is read-only.

shared

bool

No

None

Whether the mount point is shared with all Alluxio users.

Refer below code snippet showing how to mount a remote FS;

from xpresso.ai.core.data.connections import Connector
# Mount S3 Bucket
properties = { "aws.accessKeyId": "<accessKeyId1>", "aws.secretKey": "<secretKey1>" }
Connector.getconnector("FS").mount_fs("/mount_s3", "s3://<S3_BUCKET>/<S3_DIRECTORY>", properties=properties)
# Check the result on Alluxio UI - http://<Alluxio server IP>:19999/browse?path=/mount_s3
# Mount NFS
Connector.getconnector("FS").mount_fs("/mount_nfs", "file://<IP>/mnt/nfs/data")
# Check the result on Alluxio UI - http://<Alluxio server IP>:19999/browse?path=/mount_nfs

unmount_fs()

This method is used to unmount a remote file system under Alluxio’s UFS. Refer below table for method parameters.

Parameter

Type

Mandatory?

Description

path

str

Yes

Path on Alluxio FS under which you will find your remote file system

Refer below code snippet showing how to unmount a remote FS;

from xpresso.ai.core.data.connections import Connector
# Unmount S3 Bucket
Connector.getconnector("FS").unmount_fs("/mount_s3")
# Check the result on Alluxio UI - http://<Alluxio server IP>:19999/browse?path=/
# Unmount NFS
Connector.getconnector("FS").unmount_fs("/mount_nfs")
# Check the result on Alluxio UI - http://<Alluxio server IP>:19999/browse?path=/

list_fs()

This method is used to list directory on a remote file system under Alluxio’s UFS. Refer below table for method parameters.

Parameter

Type

Mandatory?

Description

path

str

Yes

Path on Alluxio FS under which you will find your remote file system

Refer below code snippet showing how to list directory on a remote FS;

from xpresso.ai.core.data.connections import Connector
# list directory under S3 Bucket
Connector.getconnector("FS").list_fs("/mount_s3")
# Check the result on Alluxio UI - http://<Alluxio server IP>:19999/browse?path=/mount_s3
# list directory under NFS
Connector.getconnector("FS").list_fs("/mount_nfs")
# Check the result on Alluxio UI - http://<Alluxio server IP>:19999/browse?path=/mount_nfs

Go to http://<Alluxio server IP>:19999 to check if service is running on your environment.

Sample output - UI

image2

What do you want to do next?

Define a Dataset object to import a file/table as DataFrame.