Connecting to Google BigQuery¶

The xpresso.ai Data Connectivity library can be used to fetch tables from Google BigQuery.

BigQuery is Google’s serverless, highly scalable enterprise data warehouse. It is data agnostic and is used to manage RDBMS tables to produce meaningful insights using familiar SQL without the need for a database administrator.

Prerequisites for using Google BigQuery through xpresso.ai Data Connectivity module;

Path to private key file (JSON/P12) created using a service account on GCP stack. Ask a person with role as “Owner“ to grant you appropriate access permissions in the service account prior to generating any key. Download this key and note it’s location in your local filesystem.
Appropriate Project ID.
Dataset consisting the table to be fetched from the BigQuery.
Required Table name.

The objective here is to make a table on Google BigQuery readily available on xpresso.ai platform for further use. Maintaining the table and it’s schema is the sole responsibility of the person with IAM role as “Owner“ on GCP stack.

Fetch table - import_dataset()

Above stated method from StructuredDataset class in xpresso.ai EDA module is used to fetch data from GCP BigQuery. This method takes a dictionary object as parameter. The key-value pair object defines data source and acts as DSN for a specific table.

These are the keys required to import a specific table from BigQuery;

Name	Type	De scription	Mandatory	Comments*
type	string	Decides the type of data_source needed. Either database or filesystem (DB/FS).	Yes	Here we need to specify as “DB“ since GCP BigQuery is an RDBMS
data_source	string	States a specific d ata_source.	Yes	Here we need to specify “BigQuery“
cred_path	string	Location of the private key downloaded from GCP console.	Yes	Provide an absolute path to the file
project_id	string	Project ID of the targeted table	Yes	Project ID from private key and target table can be different.
dataset	string	Dataset under project ID that consists of the targeted table	Yes
table	string	Target table	Yes
columns	string/list of string	List of column names required from the table.	Yes	Put “*“ if aim is to fetch all columns

Sample code to fetch data from stackoverflow public dataset’s posts_questions table :

from xpresso.ai.core.data.automl.structured_dataset import StructuredDataset
# config object to specify DSN
config = { "type": "DB", "data_source": "BigQuery", "cred_path": "/path/to/private/key/JSON/file", "project_id": "Project ID of the required table", "dataset": "Dataset consisting required table", "table": "required table", "columns": "*" }
# creating a StructuredDataset instance
dataset = StructuredDataset()
# Method call to fetch data from GCP BigQuery
dataset.import_dataset(config)
# print dataset.data.head() to view sample data fetched
dataset.data.head()

Sample Output :