Sample Solutions¶
Sample solutions have been created to demonstrate various features of xpresso.ai. All users have access to these. These include:
sample_project_basic - demonstrates the use of jobs, services and databases
sample_project_etl_bi - demonstrates an ETL pipeline, which fetches data from a text file, cleans it and stores it in a Data Warehouse, as well as a query service to query the data in the Warehouse
sample_project_ml - demonstrates Machine Learning pipelines, Inference Services and A/B Testing
sample_project_data_management - demonstrates a pipeline to fetch data from a file, explore it, and visualize the results, without writing a single line of code, by using components from the xpresso.ai Component Library
sample_project_spark - demonstrates a Machine Learning pipeline run on a Spark cluster
How to use a sample solution?¶
You will work on a clone of the sample solution. The steps to be followed are:
Clone the solution. See the section on how to clone a solution
Note
Cloning a solution does not clone its code, so you need to do this manually.
Copy code from the sample solution into the cloned solution.
Commit and push the code back into the code repository of the cloned solution.
Build the cloned solution components.
Deploy the components and/or pipelines of the cloned solution.
Test the components and/or pipelines of the cloned solution.
Basic Functionality¶
Solution Name: sample_project_basic
This solution demonstrates the use of jobs, services and databases, through the following components:
sample_job - a component of type “job” that implements a counter which counts down from the number provided as input.
sample_database - a component of type “database” (MySQL database)
sample_service - a component of type “service” that echoes the name of the component when a GET request is made to it.
How to use this solution?
You will work on a clone of this solution. The steps to be followed are:
Clone the solution.
Note
Cloning a solution does not clone its code, so you need to do this manually (Steps 2-4 below).
Clone the code repository of the sample solution by performing the following steps:
Navigate to the code repository of the sample solution.
Click “Clone” and copy the git clone command.
Execute the command on your machine.
Note
Ensure you have Git installed.
Clone the code repository of the cloned solution by performing the following steps:
Navigate to the code repository of the cloned solution
Click “Clone” and copy the git clone command
Execute the command on your machine
Note
Ensure you have Git installed.
Copy the code from the sample solution into the cloned solution.
Commit and push the code back into the code repository of the cloned solution by performing the following steps:
Execute git add -A to add the changed code to the local repository.
Execute git commit -m “Cloned code” to commit the code to the local repository.
Execute git push to push the code into code repository.
Build the cloned solution components by performing the following step:
Select the “master” branch for each component during the build.
Deploy the components of the cloned solution by performing the following steps:
For sample_job, specify the following deployment parameters:
Build Version = latest build version
Advanced Settings - Args = number of seconds you want the job to count (e.g., 100)
For sample_database, specify the following deployment parameters:
Build Version = <Latest Build Version>
Advanced Settings (Environment Variables) - name = MYSQL_ROOT_PASSWORD, value = <any password of your choice>)
Advanced Settings (Ports) - name = default, value = 3306
For sample_service, specify the following deployment parameters:
Build Version = <Latest Build Version>
Advanced Settings (Ports) - name = default, value = 5000
Test the components by performing the following steps:
Note
The database and service might take a few minutes to get deployed.
a. Note the URLs output for the sample_service and sample_database components.
b. To test the sample_job, open the Kubernetes dashboard, navigate to the pod for the sample_job and view its logs. When the job completes, you should see the counts from the number you specified, counting down to 0.
c. To test the sample_database, use a database tool (such as Toad or MySQL Manager) to connect to the URL you got in Step 8a.
d. Use the user ID root and the password you specified during deployment to connect to the database. You should see a database called “reporting” with some tables in it.
e. To test the sample_service, click on the URL you got in Step 8a for the service. You should see a “Hello” message specifying the name of the component (“sample_service”)
ETL & BI¶
Solution Name: sample_project_etl_bi
This solution demonstrates an ETL pipeline, which fetches data from a text file, cleans it and stores it in a Data Warehouse, as well as a query service to query the data in the Warehouse, through the following components:
data_warehouse - a component of type “database” representing a Data Warehouse (MySQL database)
fetch_data - a component of type “pipeline_job” to fetch data from a CSV file on the NFS shared drive and store it in the data repository
transform_data - a component of type “pipeline_job” that cleans up the fetched data and stores the cleaned data in the data repository
store_data - a component of type “pipeline_job” that fetches cleaned data from the data repository and stores it in the Data Warehouse
etl_pipeline - a pipeline that combines the fetch_data, transform_data and store_data components
query_data - a component of type “service” that queries the data warehouse according to the criteria specified by the request, and returns the number of records found (optionally grouped)
How to use this solution?
You will work on a clone of this solution. The steps to be followed are:
Clone the solution.
Note
Cloning a solution does not clone its code, so you need to do this manually (Steps 2-4 below)
Clone the code repository of the sample solution.
Navigate to the code repository of the sample solution.
Click “Clone” and copy the git clone command.
Execute the command on your machine.
Note
Ensure you have Git installed.
Clone the code repository of the cloned solution.
Navigate to the code repository of the cloned solution.
Click “Clone” and copy the git clone command. Execute the command on your machine
Note
Ensure you have Git installed.
Copy code from the sample solution into the cloned solution.
Commit and push the code back into the code repository of the cloned solution
Execute git add -A to add the changed code to the local repository
Execute git commit -m “Cloned code” to commit the code to the local repository
Execute git push to push the code into code repository
Build the cloned solution components
Select the “master” branch for each component during the build.
Before deploying the components and pipelines, you need to upload the parameters file and the data file to the shared drive of the solution.
Download /pipelines/etl-pipeline/params.json from the NFS Drive of the original solution and upload it to the NFS Drive of the cloned solution, both in the “root” folder of the solution, as well as to the “/pipelines/etl-pipeline” folder.
Note
Caution: Before uploading the file, change lines 4, 5, 8 and 12 of the params.json file as follows:
Line 4 - replace <your_user_id> with your xpresso.ai user id (e.g., the line should be changed to “xpresso_uid”: “john.doe”
Line 5 - replace <your_password> with your xpresso.ai password (e.g., the line should be changed to “xpresso_pwd”: “my_strong_password”
Line 8 - In the value for the parameter “db_url”, replace “sample-project-etl-bi”with your solution name (you will have to make this change twice in the line, so it should read “<solution_name>–data-warehouse.<solution_name>”). Replace any underscores in the solution name with dashes. Example: if the name of the solution is “sample_solution_john”, set the db_url parameter as “sample-solution-john–data-warehouse.sample-solution-john”
Line 12 - replace <database_password_you_set> with a suitable database password (make sure you specify the same password in Step 8b)
Download /pipelines/etl-pipeline/participant.csv from the NFS Drive of the original solution and upload it to the NFS drive of the cloned solution, into the “/pipelines/etl-pipeline” folder. This file represents participants in a clinical trial.
Deploy the components and pipelines of the cloned solution:
For the etl_pipeline, specify the following deployment parameters for each component:
Build Version = latest build version
For data_warehouse, specify the following deployment parameters:
Build Version = <Latest Build Version>
Advanced Settings (Environment Variables) - name = MYSQL_ROOT_PASSWORD, value = <any password of your choice>
Advanced Settings (Ports) - name = default, value = 3306
For query_service, specify the following deployment parameters:
Build Version = <Latest Build Version>
Advanced Settings (Ports) - name = default, value = 5000
Test the components (the database and service might take a few minutes to get deployed):
Note down the URLs output for the query_service and data_warehouse components
To test the data_warehouse, use a database tool (such as Toad or MySQL Manager) to connect to the URL you got in Step 8a. Use the user ID root and the password you specified during deployment to connect to the database. You should see a database called “dwh” with a single table in it.
To test the query_service, issue a POST request to the service URL you got in Step 8 by appending “/get_results” (e.g., 172.16.2.1:31133/get_results), with an empty JSON object ({}) in the request body. The pipeline is deployed (but has not run), and you should get a response which says num_particpants = 0
Note
Use a tool such as POSTMAN or curl.
To run the pipeline, start an experiment using the deployed version of the pipeline. Specify the following parameters during the run:
Name of the pipeline - etl_pipeline
Version - latest deployed version
Run Name - any run name of your choice
Note
Do not use a name which you have already used.
Run Description - any description of your choice
parameters_filename - params.json
To ensure the pipeline has run properly, view the run details.
After the pipeline has run correctly, run the query in the query service again. You should get 10000 as the number of participants.
You can run further queries on the query service by using filters such as:
{“filter”: {“gender”:“M”}} - will return the number of male participants
{“filter”: “diabetes_present”: “No”}} - will return the number of participants without diabetes
Use the database connection to the data warehouse to query the table and try other filters and check the data in the database against the query service results.
You can even add a grouping clause, e.g., {“filter”: {“gender”:”F”}, “group”:[hypertension_present]} will return the number of female participants, grouped by whether or not they exhibit symptoms of hypertension.
Machine Learning (Kubeflow)¶
Solution Name: sample_project_ml
This solution demonstrates Machine Learning pipelines, Inference Services and A/B Testing.
The models built in this solution are trained to predict the future sales of a store, using sales data for a previous time period, for training and validation.
Two types of models are built using XGBoost and Neural Networks. Once the models have been trained, an Inference Service is deployed for each model, which is used to obtain predictions from the model. The Inference Services are combined to create an A/B Test.
The solution has the following components:
data_fetch - a component to fetch data from the data repository for the solution using the Data Versioning component from the xpresso.ai Component Library.
xgboost_data_prep - a component of type “pipeline_job” to prepare data for training using the XGBoost library
xgboost_train - a component of type “pipeline_job” to train an XGBoost model using the prepared data
xgboost_training_pipeline - a pipeline that combines the data_fetch, xgboost_data_prep and xgboost_train components
xgboost_infer - a component of type “inference_service” to provide a REST API to perform predictions on input requests using the trained XGBoost model
dnn_data_prep - a component of type “pipeline_job” to prepare data for training using a Deep Neural Network (using the keras and Tensorflow libraries)
dnn_train - a component of type “pipeline_job” to train a Deep Neural Network model using the prepared data
dnn_training_pipeline - a pipeline that combines the data_fetch, dnn_data_prep and dnn_train components
dnn_infer - a component of type “inference_service” to provide a REST API to perform predictions on input requests using the trained DNN
How to use this solution?
You will work on a clone of this solution. The steps to be followed are:
Clone the solution.
Note
Cloning a solution does not clone its code, so you need to do this manually by following the steps below.
Clone the code repository of the sample solution
Navigate to the code repository of the sample solution.
Click “Clone” and copy the git clone command.
Execute the command on your machine.
Note
Ensure you have Git installed.
Clone the code repository of the cloned solution.
Navigate to the code repository of the cloned solution.
Click “Clone” and copy the git clone command.
Execute the command on your machine.
Note
Ensure you have Git installed.
Copy code from the sample solution into the cloned solution
Commit and push the code back into the code repository of the cloned solution
Execute git add -A to add the changed code to the local repository.
Execute git commit -m “Cloned code” to commit the code to the local repository.
Execute git push to push the code into code repository.
Build the cloned solution components.
Select the “master” branch for each component during the build.
Before deploying the components and pipelines, you need to upload the parameters file to the shared drive of the solution and the data file into the data repository. To do so, perform the following steps:
Download /pipelines/dnn-training-pipeline/params.json from the NFS Drive of the original solution and upload it to the following:
NFS Drive of the cloned solution
/pipelines/dnn-training-pipeline folder
/pipelines/xgboost-training-pipeline folders
Download the data files (store.csv, train.csv, test.csv) from the root folder of the NFS Drive of the original solution. These files represent store information, training data and test data respectively.
Push the files into the data repository of the cloned solution using the xpresso.ai Data Versioning library. To do so, perform the following steps:
Navigate to the data repository for the solution.
Create a new branch in the data repository, called “raw_data”.
Upload the three data files into the branch.
Deploy the pipelines of the cloned solution by specifying the following deployment parameters for the components:
data_fetch (in each pipeline)
Advanced Settings (Custom Docker Image) - dockerregistrysb.xpresso.ai/library/data_versioning:2.2
Advanced Settings (Args) - as below:
Dynamic? |
Name |
---|---|
No |
-component-name |
No |
data_fetch |
Other components
Build Version = latest build version
Note
Any other parameters required by any component of the pipeline will be taken from the parameters file specified when running an experiment on the deployed pipeline.**
To run the pipeline, start an experiment by using the deployed version of each pipeline.
Specify the following parameters during the run:
Name of the pipeline - <name of the pipeline>
Version - latest deployed version
Run Name - any run name of your choice (do not use a name which you have already used)
Run Description - any description of your choice
parameters_filename - ml_params.json (this file contains values for parameters required by components of the pipeline)
To ensure the pipeline has run properly, view the run details and ensure that each pipeline has created a model in the model repository. You are ready to test the inference service for each model.
Note
The inference service will accept a set of data points as input and output the sales predicted by the model. Once an inference service has been deployed for each model, they can be combined to create an A/B Test. In an A/B Test, requests are randomly sent to the two inference services and results are obtained.
Combine the deployment of the inference services and A/B Testing as follows:
Open the Inference Services page.
Select both the inference services.
For each inference service, do the following:
Select the latest successful run for the appropriate pipeline.
Select the latest build version of the inference service.
Set the port name to “default” and value to 5000.
Specify any mesh name of your choice.
Specify the weights as “50” each in the routing strategy. This indicates that 50% of the requests will go to the first model, and 50% to the second (on average)
Deploy the inference services.
Note the URL obtained as a result.
To check the deployment, visit the Kubernetes dashboard for the solution.
After the services have been deployed successfully, open a tool such as POSTMAN, and follow the test instructions. You can use the sample data below for the request payload:
{ "input": { "Store": 238.0, "DayOfWeek": 5.0, "Promo": 0.0, “StateHoliday”: 0.0, “SchoolHoliday”: 0.0, “StoreType”: 3.0, “Assortment”: 2.0, “CompetitionDistance”: 610.0, “Promo2”: 0.0, “Day”: 1.0, “Month”: 7.0, “Year”: 1.0, “isCompetition”: 0.0, “NewAssortment”: 3, “NewStoreType”: 1 } }Tip
The response should indicate the predicted sales (in dollars), as well as the name of the model which produced the response. As mentioned above, roughly 50% of the requests should be executed by each model.
Sample Response
{“message”: “success”, “results”: [4350.8134765625], “run_name”: “run_15” }
Machine Learning (Spark)¶
Solution Name: sample_project_spark
This solution demonstrates Machine Learning pipelines on Spark.
The model built in this solution is trained to predict the probability that a specified patient will have a stroke in the next few months.
The solution uses a Random forest classifier provided by pyspark. It consists of the following components for feature engineering and eventually build a model using Random forest classifier.
string_indexer - This component encodes a string column of labels to a column of label indices. Extends pysaprk’ StringIndexer.
one_hot_encoder - This maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values and extends pyspark’ OneHotEncoderEstimator.vector_assembler
vector_assembler - Component extending pyspark’ VectorAssembler helps in assembling all the feature into feature vector.
feature_engg_and_classifier_pipeline - Mahine learning pipeline that combines the string_indexer, one_hot_encoder and vector_assembler components
Shown below is the data snapshot.
Each attribute that we want to use as a feature has to go through some transformations using some pyspark class (this is a usual part of feature preparation of any ML workflow).
We would want to use one hot encode with almost all the attributes (gender, age, hypertension, heart_disease, ever_married, work_type, Residence_type, smoking_status).
We would rather not directly work with strings for these attributes so we will try to index the string with integers and we do so by using pyspark’s StringIndexer. So, for each attribute we will have two stages - string_indexer and one_hot_encoder in this order.
Since this is a supervised ML example, our tag/label is stroke column. We need to make it a label - hence we have a labelindexer stage. We reuse pyspark’s StringIndexer as label indexer here.
Finally, vector_assembler is used to aggregate for the prepared features into feature vector. Hence the final stage. It uses pyspark’s VectorAssembler.
Training pipeline: Finally, we will have following component stages in order show below:
gender-string_indexer→gender-one_hot_encoder→age-string_indexer→age-one_hot_encoder→hypertension-string_indexer→hypertension-one_hot_encoder→heart_disease-string_indexer→heart_disease-one_hot_encoder→ever_married-string_indexer→ever_married-one_hot_encoder→work_type-string_indexer→***work_type-one_hot_encoder →Residence_type-string_indexer → Residence_type-one_hot_encoder → smoking_status-string_indexer → smoking_status-one_hot_encoder*→labelindexer→vector_assembler
How to use this solution?
You will work on a clone of this solution. The steps to be followed are:
Clone the solution.
Note
Cloning a solution does not clone its code, so you need to do this manually (Steps 2-4 below)
Clone the code repository of the sample solution
Navigate to the code repository of the sample solution.
Click “Clone” and copy the git clone command.
Execute the command on your machine.
Note
Ensure you have Git installed.
Clone the code repository of the cloned solution
Navigate to the code repository of the cloned solution.
Click “Clone” and copy the git clone command
Execute the command on your machine
Note
Ensure you have Git installed.
Copy code from the sample solution into the cloned solution
Commit and push the code back into the code repository of the cloned solution
Execute git add -A to add the changed code to the local repository
Execute git commit -m “Cloned code” to commit the code to the local repository
Execute git push to push the code into code repository
Build the cloned solution components and pipeline
Select the “master” branch for each component and pipeline during the build.
Before deploying the components and pipelines, you need to upload the data files into the HDFS folder for the solution.
Download the contents of the “input” folder under /pipelines/feature-engg-and-classifier-pipeline from the original solution and upload these into the cloned solution.
Deploy the pipeline of the cloned solution.
Note
You will need to specify the deployment parameters for the pipeline, and not for each component (recall that in Spark, pipelines are executed as a whole, not as combinations of components) - you just need to specify the latest build version of the pipeline as the deployment parameter.
To run the pipeline, start an experiment using the deployed version of each pipeline.
Specify the following parameters during the run:
Name of the pipeline - <name of the pipeline>
Version - latest deployed version
Run Name - any run name of your choice
Do not use a name which you have already used.
Run Description - any description of your choice
To ensure the pipeline has run properly, view the run details.
Note
In xpresso.ai, Spark is run within Kubernetes. You can view the Kubernetes dashboard to see the Spark worker process in action.
The pipeline should have created a model in the “output” folder of the HDFS, as well as in the model repository.
Data Management¶
Solution Name: sample_project_data_management
This solution demonstrates a pipeline to fetch data from a file, explore it, and visualize the results, without writing a single line of code, by using components from the xpresso.ai Component Library.
The solution has the following components:
data_connection - a component of type “pipeline_job” to fetch data from the shared file system for the solution using the Data Connectivity component from the xpresso.ai Component Library
data_exploration - a component of type “pipeline_job” to explore data fetched by the data_connection component by using the Data Exploration component from the xpresso.ai Component Library
data_visualization - a component of type “pipeline_job” to visualize the explorations results found by the data_exploration component by using the Data Visualization component from the xpresso.ai Component Library
How to use this solution?
You will work on a clone of this solution. The steps to be followed are:
Clone the solution.
Note
You do not need to copy solution code since all the components are from the xpresso.ai Component Library.
You do not need to build any of the components since there is no coding required.
Before deploying the pipeline, you need to upload the parameters file and data file to the shared drive of the solution. To do so, perform the following steps:
Download /pipelines/data_con_exp_viz_pl/data_management_params.json from the NFS Drive of the original solution.
Upload it to the NFS Drive of the cloned solution, to the pipelines/data_con_exp_viz_pl folder.
Download /pipelines/data_con_exp_viz_pl/participant_data.csv from the NFS Drive of the original solution.
Upload it to the NFS Drive of the cloned solution, to the /pipelines/data_con_exp_viz_pl folder.
Deploy the pipeline of the cloned solution. Specify the following deployment parameters for the components
data_connection
Advanced Settings (Custom Docker Image) = docker image specified in the component documentation, as per the instance you are working on.
Advanced Settings (Args) - as below
Dynamic? |
Name |
---|---|
No |
-component-name |
No |
data_connection |
data_exploration
Advanced Settings (Custom Docker Image) - docker image specified in the component documentation, as per the instance you are working on
Advanced Settings (Args) - as below
Dynamic? |
Name |
---|---|
No |
-component-name |
No |
data_exploration |
data_visualization
Advanced Settings (Custom Docker Image) - docker image specified in the component documentation, as per the instance you are working on
Advanced Settings (Args) - as below
Dynamic? |
Name |
---|---|
No |
-component-name |
No |
data_visualization |
Tip
Any other parameters required by any component of the pipeline will be taken from the parameters file specified when running an experiment on the deployed pipeline.
To run the pipeline, start an experiment using the deployed version of the pipeline.
Specify the following parameters during the run:
Name of the pipeline - <name of the pipeline>
Version - latest deployed version
Run Name - any run name of your choice (do not use a name which you have already used)
Run Description - any description of your choice
parameters_filename - data_management_params.json (this file contains values for parameters required by components of the pipeline)
To ensure the pipeline has run properly, view the run details. You should see the exploration and visualization results in the output folders specified in the parameters file.