Connecting to Google BigQuery¶
The xpresso.ai Data Connectivity library can be used to fetch tables from Google BigQuery.
BigQuery is Google’s serverless, highly scalable enterprise data warehouse. It is data agnostic and is used to manage RDBMS tables to produce meaningful insights using familiar SQL without the need for a database administrator.
Prerequisites for using Google BigQuery through xpresso.ai Data Connectivity module;
Path to private key file (JSON/P12) created using a service account on GCP stack. Ask a person with role as “Owner“ to grant you appropriate access permissions in the service account prior to generating any key. Download this key and note it’s location in your local filesystem.
Appropriate Project ID.
Dataset consisting the table to be fetched from the BigQuery.
Required Table name.
The objective here is to make a table on Google BigQuery readily available on xpresso.ai platform for further use. Maintaining the table and it’s schema is the sole responsibility of the person with IAM role as “Owner“ on GCP stack.
Fetch table - import_dataset()
Above stated method from StructuredDataset class in xpresso.ai EDA module is used to fetch data from GCP BigQuery. This method takes a dictionary object as parameter. The key-value pair object defines data source and acts as DSN for a specific table.
These are the keys required to import a specific table from BigQuery;
Name |
Type |
De scription |
** Mandatory** |
Comments* |
---|---|---|---|---|
type |
string |
Decides the type of data_source needed. Either database or filesystem (DB/FS). |
Yes |
Here we need to specify as “DB“ since GCP BigQuery is an RDBMS |
data_source |
string |
States a specific d ata_source. |
Yes |
Here we need to specify “BigQuery“ |
cred_path |
string |
Location of the private key downloaded from GCP console. |
Yes |
Provide an absolute path to the file |
project_id |
string |
Project ID of the targeted table |
Yes |
Project ID from private key and target table can be different. |
dataset |
string |
Dataset under project ID that consists of the targeted table |
Yes |
|
table |
string |
Target table |
Yes |
|
columns |
string/list of string |
List of column names required from the table. |
Yes |
Put “*“ if aim is to fetch all columns |
Sample code to fetch data from stackoverflow public dataset’s posts_questions table :
from xpresso.ai.core.data.automl.structured_dataset import StructuredDataset
# config object to specify DSN
config = { "type": "DB", "data_source": "BigQuery", "cred_path": "/path/to/private/key/JSON/file", "project_id": "Project ID of the required table", "dataset": "Dataset consisting required table", "table": "required table", "columns": "*" }
# creating a StructuredDataset instance
dataset = StructuredDataset()
# Method call to fetch data from GCP BigQuery
dataset.import_dataset(config)
# print dataset.data.head() to view sample data fetched
dataset.data.head()
Sample Output :