ETL & BI

Solution Name: sample_project_etl_bi

This solution demonstrates an ETL pipeline, which fetches data from a text file, cleans it and stores it in a Data Warehouse, as well as a query service to query the data in the Warehouse, through the following components:

  • data_warehouse - a component of type “database” representing a Data Warehouse (MySQL database)

  • fetch_data - a component of type “pipeline_job” to fetch data from a CSV file on the NFS shared drive and store it in the data repository

  • transform_data - a component of type “pipeline_job” that cleans up the fetched data and stores the cleaned data in the data repository

  • store_data - a component of type “pipeline_job” that fetches cleaned data from the data repository and stores it in the Data Warehouse

  • etl_pipeline - a pipeline that combines the fetch_data, transform_data and store_data components

  • query_data - a component of type “service” that queries the data warehouse according to the criteria specified by the request, and returns the number of records found (optionally grouped)

How to use this solution

You will work on a clone of this solution. The steps to be followed are:

  1. Clone the solution. Cloning a solution does not clone its code, so you need to do this manually (Steps 2-4 below)

  2. Clone the code repository of the sample solution

    1. Navigate to the code repository of the sample solution

    2. Click “Clone” and copy the git clone command

    3. Execute the command on your machine (ensure you have Git installed)

  3. Clone the code repository of the cloned solution

    1. Navigate to the code repository of the cloned solution

    2. Click “Clone” and copy the git clone command

    3. Execute the command on your machine (ensure you have Git installed)

  4. Copy code from the sample solution into the cloned solution

  5. Commit and push the code back into the code repository of the cloned solution

    1. Execute git add -A to add the changed code to the local repository

    2. Execute git commit -m “Cloned code” to commit the code to the local repository

    3. Execute git push to push the code into Bitbucket

  6. Build the cloned solution components

    1. Select the “master” branch for each component during the build

  7. Before deploying the components and pipelines, you need to upload the parameters file and the data file to the shared drive of the solution

    1. Download /pipelines/etl-pipeline/params.json from the NFS Drive of the original solution and upload it to the NFS Drive of the cloned solution, both in the root folder of the solution, as well as to the /pipelines/etl-pipeline folder. Before uploading the file, change lines 4, 5, 8 and 12 of the params.json file as follows:

      1. Line 4 - replace <your_user_id> with your xpresso.ai user id (e.g., the line should be changed to “xpresso_uid”: “john.doe”

      2. Line 5 - replace <your_password> with your xpresso.ai password (e.g., the line should be changed to “xpresso_pwd”: “my_strong_password”

      3. Line 8 - In the value for the parameter “db_url”, replace “sample-project-etl-bi”with your solution name (you will have to make this change twice in the line, so it should read “<solution_name>–data-warehouse.<solution_name>”). Replace any underscores in the solution name with dashes. Example: if the name of the solution is “sample_solution_john”, set the db_url parameter as “sample-solution-john–data-warehouse.sample-solution-john”

      4. Line 12 - replace <database_password_you_set> with a suitable database password (make sure you specify the same password in Step 8b)

    2. Download /pipelines/etl-pipeline/participant.csv from the NFS Drive of the original solution and upload it to the NFS drive of the cloned solution, into the /pipelines/etl-pipeline folder. This file represents participants in a clinical trial

  8. Deploy the components and pipelines of the cloned solution:

    1. For the etl_pipeline, specify the following deployment parameters for each component:

      1. Build Version = latest build version

    2. For data_warehouse, specify the following deployment parameters:

      1. Build Version = <Latest Build Version>

      2. Advanced Settings (Environment Variables) - name = MYSQL_ROOT_PASSWORD, value = <any password of your choice>

      3. Advanced Settings (Ports) - name = default, value = 3306

    3. For query_service, specify the following deployment parameters:

      1. Build Version = <Latest Build Version>

      2. Advanced Settings (Ports) - name = default, value = 5000

  9. Test the components (the database and service might take a few minutes to get deployed):

    1. Note down the URLs output for the query_service and data_warehouse components

    2. To test the data_warehouse, use a database tool (like Toad, or MySQL Manager) to connect to the URL you got in Step 8a. Use the user ID root and the password you specified during deployment to connect to the database. You should see a database called “dwh” with a single table in it.

    3. To test the query_service, issue a POST request to the service URL you got in Step 8, appending “/get_results” (e.g., 172.16.2.1:31133/get_results), with an empty JSON object ({}) in the request body. Use a tool such as POSTMAN or curl. You should get a response which says num_particpants = 0

  10. The pipeline has been deployed, but has not run. To run the pipeline, start an experiment using the deployed version of the pipeline. Specify the following parameters during the run:

    1. Name of the pipeline - etl_pipeline

    2. Version - latest deployed version

    3. Run Name - any run name of your choice (do not use a name which you have already used)

    4. Run Description - any description of your choice

    5. parameters_filename - params.json

  11. To ensure the pipeline has run properly, view the run details

  12. After the pipeline has run correctly, run the query in the query service again. You should get 10000 as the number of participants

  13. You can run further queries on the query service by using filters such as:

    1. {“filter”: {“gender”:“M”}} - will return the number of male participants

    2. {“filter”: “diabetes_present”: “No”}} - will return the number of participants without diabetes

    3. Use the database connection to the data warehouse to query the table and try other filters and check the data in the database against the query service results

    4. You can even add a grouping clause, e.g., {“filter”:{“gender”:”F”}, “group”:[hypertension_present]} will return the number of female participants, grouped by whether or not they exhibit symptoms of hypertension