Connecting to Google BigQuery


The xpresso.ai Data Connectivity library can be used to fetch tables from Google BigQuery.

BigQuery is Google’s serverless, highly scalable enterprise data warehouse. It is data agnostic and is used to manage RDBMS tables to produce meaningful insights using familiar SQL without the need for a database administrator.

Prerequisites for using Google BigQuery through xpresso.ai Data Connectivity module;

  • Path to private key file (JSON/P12) created using a service account on GCP stack. Ask a person with role as “Owner“ to grant you appropriate access permissions in the service account prior to generating any key. Download this key and note it’s location in your local filesystem.

  • Appropriate Project ID.

  • Dataset consisting the table to be fetched from the BigQuery.

  • Required Table name.

The objective here is to make a table on Google BigQuery readily available on xpresso.ai platform for further use. Maintaining the table and it’s schema is the sole responsibility of the person with IAM role as “Owner“ on GCP stack.

Fetch table - import_dataset()

Above stated method from StructuredDataset class in xpresso.ai EDA module is used to fetch data from GCP BigQuery. This method takes a dictionary object as parameter. The key-value pair object defines data source and acts as DSN for a specific table.

These are the keys required to import a specific table from BigQuery;

Name

Type

De scription

** Mandatory**

Comments*

type

string

Decides the type of data_source needed. Either database or filesystem (DB/FS).

Yes

Here we need to specify as “DB“ since GCP BigQuery is an RDBMS

data_source

string

States a specific d ata_source.

Yes

Here we need to specify “BigQuery“

cred_path

string

Location of the private key downloaded from GCP console.

Yes

Provide an absolute path to the file

project_id

string

Project ID of the targeted table

Yes

Project ID from private key and target table can be different.

dataset

string

Dataset under project ID that consists of the targeted table

Yes

table

string

Target table

Yes

columns

string/list of string

List of column names required from the table.

Yes

Put “*“ if aim is to fetch all columns

Sample code to fetch data from stackoverflow public dataset’s posts_questions table :

from xpresso.ai.core.data.automl.structured_dataset import StructuredDataset
# config object to specify DSN
config = { "type": "DB", "data_source": "BigQuery", "cred_path": "/path/to/private/key/JSON/file", "project_id": "Project ID of the required table", "dataset": "Dataset consisting required table", "table": "required table", "columns": "*" }
# creating a StructuredDataset instance
dataset = StructuredDataset()
# Method call to fetch data from GCP BigQuery
dataset.import_dataset(config)
# print dataset.data.head() to view sample data fetched
dataset.data.head()

Sample Output :

image1