Sample Solutions

Sample solutions have been created to demonstrate various features of xpresso.ai. All users have access to these. These include:

  • sample_project_basic - demonstrates the use of jobs, services and databases

  • sample_project_etl_bi  - demonstrates an ETL pipeline, which fetches data from a text file, cleans it and stores it in a Data Warehouse, as well as a query service to query the data in the Warehouse

  • sample_project_ml  - demonstrates Machine Learning pipelines, Inference Services and A/B Testing

  • sample_project_data_management  - demonstrates a pipeline to fetch data from a file, explore it, and visualize the results, without writing a single line of code, by using components from the xpresso.ai Component Library

  • sample_project_spark  - demonstrates a Machine Learning pipeline run on a Spark cluster

How to use a sample solution?

You will work on a clone of the sample solution. The steps to be followed are:

  1. Clone the solution. See the section on how to clone a solution

Note

Cloning a solution does not clone its code, so you need to do this manually.

  1. Copy code from the sample solution into the cloned solution.

  2. Commit and push the code back into the code repository of the cloned solution.

  3. Build the cloned solution components.

  4. Deploy the components and/or pipelines of the cloned solution.

  5. Test the components and/or pipelines of the cloned solution.

Basic Functionality

Solution Name: sample_project_basic

This solution demonstrates the use of jobs, services and databases, through the following components:

  • sample_job - a component of type “job” that implements a counter which counts down from the number provided as input.

  • sample_database - a component of type “database” (MySQL database)

  • sample_service - a component of type “service” that echoes the name of the component when a GET request is made to it.

How to use this solution?

You will work on a clone of this solution. The steps to be followed are:

  1. Clone the solution.

    Note

    Cloning a solution does not clone its code, so you need to do this manually (Steps 2-4 below).

  2. Clone the code repository of the sample solution by performing the following steps:

    1. Navigate to the code repository of the sample solution.

    2. Click “Clone” and copy the git clone command.

    3. Execute the command on your machine.

    Note

    Ensure you have Git installed.

  3. Clone the code repository of the cloned solution by performing the following steps:

    1. Navigate to the code repository of the cloned solution

    2. Click “Clone” and copy the git clone command

    3. Execute the command on your machine

    Note

    Ensure you have Git installed.

  4. Copy the code from the sample solution into the cloned solution.

  5. Commit and push the code back into the code repository of the cloned solution by performing the following steps:

    1. Execute git add -A to add the changed code to the local repository.

    2. Execute git commit -m “Cloned code” to commit the code to the local repository.

    3. Execute git push to push the code into code repository.

  6. Build the cloned solution components by performing the following step:

    1. Select the “master” branch for each component during the build.

  7. Deploy the components of the cloned solution by performing the following steps:

    1. For sample_job, specify the following deployment parameters:

      1. Build Version = latest build version

      2. Advanced Settings - Args = number of seconds you want the job to count (e.g., 100)

    2. For sample_database, specify the following deployment parameters:

      1. Build Version = <Latest Build Version>

      2. Advanced Settings (Environment Variables) - name = MYSQL_ROOT_PASSWORD, value = <any password of your choice>)

      3. Advanced Settings (Ports) - name = default, value = 3306

    3. For sample_service, specify the following deployment parameters:

      1. Build Version = <Latest Build Version>

      2. Advanced Settings (Ports) - name = default, value = 5000

  8. Test the components by performing the following steps:

    Note

    The database and service might take a few minutes to get deployed.

    a. Note the URLs output for the sample_service and sample_database components.

    b. To test the sample_job, open the Kubernetes dashboard, navigate to the pod for the sample_job and view its logs. When the job completes, you should see the counts from the number you specified, counting down to 0.

    c. To test the sample_database, use a database tool (such as Toad or MySQL Manager) to connect to the URL you got in Step 8a.

    d. Use the user ID root and the password you specified during deployment to connect to the database. You should see a database called “reporting” with some tables in it.

    e. To test the sample_service, click on the URL you got in Step 8a for the service. You should see a “Hello” message specifying the name of the component (“sample_service”)

ETL & BI

Solution Name: sample_project_etl_bi

This solution demonstrates an ETL pipeline, which fetches data from a text file, cleans it and stores it in a Data Warehouse, as well as a query service to query the data in the Warehouse, through the following components:

  • data_warehouse - a component of type “database” representing a Data Warehouse (MySQL database)

  • fetch_data - a component of type “pipeline_job” to fetch data from a CSV file on the NFS shared drive and store it in the data repository

  • transform_data - a component of type “pipeline_job” that cleans up the fetched data and stores the cleaned data in the data repository

  • store_data - a component of type “pipeline_job” that fetches cleaned data from the data repository and stores it in the Data Warehouse

  • etl_pipeline - a pipeline that combines the fetch_datatransform_data and store_data components

  • query_data - a component of type “service” that queries the data warehouse according to the criteria specified by the request, and returns the number of records found (optionally grouped)

How to use this solution?

You will work on a clone of this solution. The steps to be followed are:

  1. Clone the solution.

    Note

    Cloning a solution does not clone its code, so you need to do this manually (Steps 2-4 below)

  2. Clone the code repository of the sample solution.

    1. Navigate to the code repository of the sample solution.

    2. Click “Clone” and copy the git clone command.

    3. Execute the command on your machine.

    Note

    Ensure you have Git installed.

  3. Clone the code repository of the cloned solution.

    1. Navigate to the code repository of the cloned solution.

    2. Click “Clone” and copy the git clone command. Execute the command on your machine

    Note

    Ensure you have Git installed.

  4. Copy code from the sample solution into the cloned solution.

  5. Commit and push the code back into the code repository of the cloned solution

    1. Execute git add -A to add the changed code to the local repository

    2. Execute git commit -m “Cloned code” to commit the code to the local repository

    3. Execute git push to push the code into code repository

  6. Build the cloned solution components

  1. Select the “master” branch for each component during the build.

  1. Before deploying the components and pipelines, you need to upload the parameters file and the data file to the shared drive of the solution.

    1. Download /pipelines/etl-pipeline/params.json from the NFS Drive of the original solution and upload it to the NFS Drive of the cloned solution, both in the “root” folder of the solution, as well as to the “/pipelines/etl-pipeline” folder.

      Note

      Caution: Before uploading the file, change lines 4, 5, 8 and 12 of the params.json file as follows:

      1. Line 4 - replace <your_user_id> with your xpresso.ai user id (e.g., the line should be changed to “xpresso_uid”: “john.doe”

      2. Line 5 - replace <your_password> with your xpresso.ai password (e.g., the line should be changed to “xpresso_pwd”: “my_strong_password”

      3. Line 8 - In the value for the parameter “db_url”, replace “sample-project-etl-bi”with your solution name (you will have to make this change twice in the line, so it should read “<solution_name>–data-warehouse.<solution_name>”). Replace any underscores in the solution name with dashes. Example: if the name of the solution is “sample_solution_john”, set the db_url parameter as “sample-solution-john–data-warehouse.sample-solution-john”

      4. Line 12 - replace <database_password_you_set> with a suitable database password (make sure you specify the same password in Step 8b)

    2. Download /pipelines/etl-pipeline/participant.csv from the NFS Drive of the original solution and upload it to the NFS drive of the cloned solution, into the “/pipelines/etl-pipeline” folder. This file represents participants in a clinical trial.

  2. Deploy the components and pipelines of the cloned solution:

    1. For the etl_pipeline, specify the following deployment parameters for each component:

      1. Build Version = latest build version

    2. For data_warehouse, specify the following deployment parameters:

      1. Build Version = <Latest Build Version>

      2. Advanced Settings (Environment Variables) - name = MYSQL_ROOT_PASSWORD, value = <any password of your choice>

      3. Advanced Settings (Ports) - name = default, value = 3306

    3. For query_service, specify the following deployment parameters:

      1. Build Version = <Latest Build Version>

      2. Advanced Settings (Ports) - name = default, value = 5000

  3. Test the components (the database and service might take a few minutes to get deployed):

    1. Note down the URLs output for the query_service and data_warehouse components

    2. To test the data_warehouse, use a database tool (such as Toad or MySQL Manager) to connect to the URL you got in Step 8a. Use the user ID root and the password you specified during deployment to connect to the database. You should see a database called “dwh” with a single table in it.

    3. To test the query_service, issue a POST request to the service URL you got in Step 8 by appending “/get_results” (e.g., 172.16.2.1:31133/get_results), with an empty JSON object ({}) in the request body. The pipeline is deployed (but has not run), and you should get a response which says num_particpants = 0

    Note

    Use a tool such as POSTMAN or curl.

  4. To run the pipeline, start an experiment using the deployed version of the pipeline. Specify the following parameters during the run:

  1. Name of the pipeline - etl_pipeline

  2. Version - latest deployed version

  3. Run Name - any run name of your choice

Note

Do not use a name which you have already used.

  1. Run Description - any description of your choice

  2. parameters_filename - params.json

  1. To ensure the pipeline has run properly, view the run details.

  2. After the pipeline has run correctly, run the query in the query service again. You should get 10000 as the number of participants.

  3. You can run further queries on the query service by using filters such as:

    1. {“filter”: {“gender”:“M”}} - will return the number of male participants

    2. {“filter”: “diabetes_present”: “No”}} - will return the number of participants without diabetes

    3. Use the database connection to the data warehouse to query the table and try other filters and check the data in the database against the query service results.

    4. You can even add a grouping clause, e.g., {“filter”: {“gender”:”F”}, “group”:[hypertension_present]} will return the number of female participants, grouped by whether or not they exhibit symptoms of hypertension.

Machine Learning (Kubeflow)

Solution Name: sample_project_ml

This solution demonstrates Machine Learning pipelines, Inference Services and A/B Testing.

The models built in this solution are trained to predict the future sales of a store, using sales data for a previous time period, for training and validation.

Two types of models are built using XGBoost and Neural Networks. Once the models have been trained, an Inference Service is deployed for each model, which is used to obtain predictions from the model. The Inference Services are combined to create an A/B Test.

The solution has the following components:

  • data_fetch - a component to fetch data from the data repository for the solution using the Data Versioning component from the xpresso.ai Component Library.

  • xgboost_data_prep - a component of type “pipeline_job” to prepare data for training using the XGBoost library

  • xgboost_train - a component of type “pipeline_job” to train an XGBoost model using the prepared data

  • xgboost_training_pipeline - a pipeline that combines the data_fetchxgboost_data_prep and xgboost_train components

  • xgboost_infer - a component of type “inference_service” to provide a REST API to perform predictions on input requests using the trained XGBoost model

  • dnn_data_prep - a component of type “pipeline_job” to prepare data for training using a Deep Neural Network (using the keras and Tensorflow libraries)

  • dnn_train - a component of type “pipeline_job” to train a Deep Neural Network model using the prepared data

  • dnn_training_pipeline - a pipeline that combines the data_fetchdnn_data_prep and dnn_train components

  • dnn_infer - a component of type “inference_service” to provide a REST API to perform predictions on input requests using the trained DNN

How to use this solution?

You will work on a clone of this solution. The steps to be followed are:

  1. Clone the solution.

    Note

    Cloning a solution does not clone its code, so you need to do this manually by following the steps below.

  2. Clone the code repository of the sample solution

    1. Navigate to the code repository of the sample solution.

    2. Click “Clone” and copy the git clone command.

    3. Execute the command on your machine.

    Note

    Ensure you have Git installed.

  3. Clone the code repository of the cloned solution.

    1. Navigate to the code repository of the cloned solution.

    2. Click “Clone” and copy the git clone command.

    3. Execute the command on your machine.

    Note

    Ensure you have Git installed.

  4. Copy code from the sample solution into the cloned solution

  5. Commit and push the code back into the code repository of the cloned solution

    1. Execute git add -A to add the changed code to the local repository.

    2. Execute git commit -m “Cloned code” to commit the code to the local repository.

    3. Execute git push to push the code into code repository.

  6. Build the cloned solution components.

    1. Select the “master” branch for each component during the build.

  7. Before deploying the components and pipelines, you need to upload the parameters file to the shared drive of the solution and the data file into the data repository. To do so, perform the following steps:

    1. Download /pipelines/dnn-training-pipeline/params.json from the NFS Drive of the original solution and upload it to the following:

      • NFS Drive of the cloned solution

      • /pipelines/dnn-training-pipeline folder

      • /pipelines/xgboost-training-pipeline folders

    2. Download the data files (store.csv, train.csv, test.csv) from the root folder of the NFS Drive of the original solution. These files represent store information, training data and test data respectively.

    3. Push the files into the data repository of the cloned solution using the xpresso.ai Data Versioning library. To do so, perform the following steps:

      1. Navigate to the data repository for the solution.

      2. Create a new branch in the data repository, called “raw_data”.

      3. Upload the three data files into the branch.

  8. Deploy the pipelines of the cloned solution by specifying the following deployment parameters for the components:

    1. data_fetch (in each pipeline)

      1. Advanced Settings (Custom Docker Image) - dockerregistrysb.xpresso.ai/library/data_versioning:2.2

      2. Advanced Settings (Args) - as below:

Dynamic?

Name

No

-component-name

No

data_fetch

  1. Other components

    1. Build Version = latest build version

Note

Any other parameters required by any component of the pipeline will be taken from the parameters file specified when running an experiment on the deployed pipeline.**

  1. To run the pipeline, start an experiment by using the deployed version of each pipeline.

  2. Specify the following parameters during the run:

    1. Name of the pipeline - <name of the pipeline>

    2. Version - latest deployed version

    3. Run Name - any run name of your choice (do not use a name which you have already used)

    4. Run Description - any description of your choice

    5. parameters_filename - ml_params.json (this file contains values for parameters required by components of the pipeline)

  3. To ensure the pipeline has run properly, view the run details and ensure that each pipeline has created a model in the model repository. You are ready to test the inference service for each model.

    Note

    The inference service will accept a set of data points as input and output the sales predicted by the model. Once an inference service has been deployed for each model, they can be combined to create an A/B Test. In an A/B Test, requests are randomly sent to the two inference services and results are obtained.

  4. Combine the deployment of the inference services and A/B Testing as follows:

    1. Open the Inference Services page.

    2. Select both the inference services.

    3. For each inference service, do the following:

      1. Select the latest successful run for the appropriate pipeline.

      2. Select the latest build version of the inference service.

      3. Set the port name to “default” and value to 5000.

      4. Specify any mesh name of your choice.

      5. Specify the weights as “50” each in the routing strategy. This indicates that 50% of the requests will go to the first model, and 50% to the second (on average)

    4. Deploy the inference services.

  5. Note the URL obtained as a result.

  6. To check the deployment, visit the Kubernetes dashboard for the solution.

  7. After the services have been deployed successfully, open a tool such as POSTMAN, and follow the test instructions. You can use the sample data below for the request payload:

{ "input": { "Store": 238.0, "DayOfWeek": 5.0, "Promo": 0.0,
“StateHoliday”: 0.0, “SchoolHoliday”: 0.0, “StoreType”: 3.0,
“Assortment”: 2.0, “CompetitionDistance”: 610.0, “Promo2”: 0.0,
“Day”: 1.0, “Month”: 7.0, “Year”: 1.0, “isCompetition”: 0.0,
“NewAssortment”: 3, “NewStoreType”: 1 } }

Tip

The response should indicate the predicted sales (in dollars), as well as the name of the model which produced the response. As mentioned above, roughly 50% of the requests should be executed by each model.

Sample Response

{“message”: “success”, “results”: [4350.8134765625], “run_name”: “run_15” }

Machine Learning (Spark)

Solution Name: sample_project_spark

  • This solution demonstrates Machine Learning pipelines on Spark.

  • The model built in this solution is trained to predict the probability that a specified patient will have a stroke in the next few months.

  • The solution uses a Random forest classifier provided by pyspark. It consists of the following components for feature engineering and eventually build a model using Random forest classifier.

    1. string_indexer - This component encodes a string column of labels to a column of label indices. Extends pysaprk’ StringIndexer.

    2. one_hot_encoder - This maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values and extends pyspark’ OneHotEncoderEstimator.vector_assembler

    3. vector_assembler - Component extending pyspark’ VectorAssembler helps in assembling all the feature into feature vector.

    4. feature_engg_and_classifier_pipeline - Mahine learning pipeline that combines the string_indexerone_hot_encoder and vector_assembler components

Shown below is the data snapshot.

image1

  • Each attribute that we want to use as a feature has to go through some transformations using some pyspark class (this is a usual part of feature preparation of any ML workflow).

  • We would want to use one hot encode with almost all the attributes (gender, age, hypertension, heart_disease, ever_married, work_type, Residence_type, smoking_status).

  • We would rather not directly work with strings for these attributes so we will try to index the string with integers and we do so by using pyspark’s StringIndexer. So, for each attribute we will have two stages - string_indexer and one_hot_encoder in this order.

  • Since this is a supervised ML example, our tag/label is stroke column. We need to make it a label - hence we have a labelindexer stage. We reuse pyspark’s StringIndexer as label indexer here.

  • Finally, vector_assembler is used to aggregate for the prepared features into feature vector. Hence the final stage. It uses pyspark’s VectorAssembler.

Training pipeline: Finally, we will have following component stages in order show below:

gender-string_indexergender-one_hot_encoderage-string_indexerage-one_hot_encoderhypertension-string_indexerhypertension-one_hot_encoderheart_disease-string_indexerheart_disease-one_hot_encoderever_married-string_indexerever_married-one_hot_encoderwork_type-string_indexer→***work_type-one_hot_encoder →Residence_type-string_indexer → Residence_type-one_hot_encoder → smoking_status-string_indexer → smoking_status-one_hot_encoder*→labelindexervector_assembler

How to use this solution?

You will work on a clone of this solution. The steps to be followed are:

  1. Clone the solution.

Note

Cloning a solution does not clone its code, so you need to do this manually (Steps 2-4 below)

  1. Clone the code repository of the sample solution

    1. Navigate to the code repository of the sample solution.

    2. Click “Clone” and copy the git clone command.

    3. Execute the command on your machine.

Note

Ensure you have Git installed.

  1. Clone the code repository of the cloned solution

    1. Navigate to the code repository of the cloned solution.

    2. Click “Clone” and copy the git clone command

    3. Execute the command on your machine

    Note

    Ensure you have Git installed.

  2. Copy code from the sample solution into the cloned solution

  3. Commit and push the code back into the code repository of the cloned solution

    1. Execute git add -A to add the changed code to the local repository

    2. Execute git commit -m “Cloned code” to commit the code to the local repository

    3. Execute git push to push the code into code repository

  4. Build the cloned solution components and pipeline

    1. Select the “master” branch for each component and pipeline during the build.

  5. Before deploying the components and pipelines, you need to upload the data files into the HDFS folder for the solution.

  6. Download the contents of the “input” folder under /pipelines/feature-engg-and-classifier-pipeline from the original solution and upload these into the cloned solution.

  7. Deploy the pipeline of the cloned solution.

Note

You will need to specify the deployment parameters for the pipeline, and not for each component (recall that in Spark, pipelines are executed as a whole, not as combinations of components) - you just need to specify the latest build version of the pipeline as the deployment parameter.

  1. To run the pipeline, start an experiment  using the deployed version of each pipeline.

  2. Specify the following parameters during the run:

  1. Name of the pipeline - <name of the pipeline>

  2. Version - latest deployed version

  3. Run Name - any run name of your choice

Do not use a name which you have already used.

  1. Run Description - any description of your choice

  1. To ensure the pipeline has run properly, view the run details.

Note

  • In xpresso.ai, Spark is run within Kubernetes. You can view the Kubernetes dashboard to see the Spark worker process in action.

  • The pipeline should have created a model in the “output” folder of the HDFS, as well as in the model repository.

Data Management

Solution Name: sample_project_data_management

This solution demonstrates a pipeline to fetch data from a file, explore it, and visualize the results, without writing a single line of code, by using components from the xpresso.ai Component Library.

The solution has the following components:

  • data_connection - a component of type “pipeline_job” to fetch data from the shared file system for the solution using the Data Connectivity component from the xpresso.ai Component Library

  • data_exploration - a component of type “pipeline_job” to explore data fetched by the data_connection component by using the Data Exploration component from the xpresso.ai Component Library

  • data_visualization - a component of type “pipeline_job” to visualize the explorations results found by the data_exploration component by using the Data Visualization component from the xpresso.ai Component Library

How to use this solution?

You will work on a clone of this solution. The steps to be followed are:

  1. Clone the solution.

    Note

    • You do not need to copy solution code since all the components are from the xpresso.ai Component Library.

    • You do not need to build any of the components since there is no coding required.

  2. Before deploying the pipeline, you need to upload the parameters file and data file to the shared drive of the solution. To do so, perform the following steps:

    1. Download /pipelines/data_con_exp_viz_pl/data_management_params.json from the NFS Drive of the original solution.

    2. Upload it to the NFS Drive of the cloned solution, to the pipelines/data_con_exp_viz_pl folder.

    3. Download /pipelines/data_con_exp_viz_pl/participant_data.csv from the NFS Drive of the original solution.

    4. Upload it to the NFS Drive of the cloned solution, to the /pipelines/data_con_exp_viz_pl folder.

  3. Deploy the pipeline of the cloned solution. Specify the following deployment parameters for the components

    1. data_connection

      1. Advanced Settings (Custom Docker Image) = docker image specified in the component documentation, as per the instance you are working on.

      2. Advanced Settings (Args) - as below

Dynamic?

Name

No

-component-name

No

data_connection

  1. data_exploration

    1. Advanced Settings (Custom Docker Image) - docker image specified in the component documentation, as per the instance you are working on

    2. Advanced Settings (Args) - as below

Dynamic?

Name

No

-component-name

No

data_exploration

  1. data_visualization

    1. Advanced Settings (Custom Docker Image) - docker image specified in the component documentation, as per the instance you are working on

    2. Advanced Settings (Args) - as below

Dynamic?

Name

No

-component-name

No

data_visualization

Tip

Any other parameters required by any component of the pipeline will be taken from the parameters file specified when running an experiment on the deployed pipeline.

  1. To run the pipeline, start an experiment using the deployed version of the pipeline.

  2. Specify the following parameters during the run:

    1. Name of the pipeline - <name of the pipeline>

    2. Version - latest deployed version

    3. Run Name - any run name of your choice (do not use a name which you have already used)

    4. Run Description - any description of your choice

    5. parameters_filename - data_management_params.json (this file contains values for parameters required by components of the pipeline)

  3. To ensure the pipeline has run properly, view the run details. You should see the exploration and visualization results in the output folders specified in the parameters file.