PySpark Pipelines


Build Guide

The build of any PySpark pipeline within an xpresso.ai solution is according to the Jenkins Build Pipeline defined for the solution. This, in turn, is governed by the order of stages specified in the Jenkinsfile configuration file, located in the xprbuild folder for each PySpark pipeline.

Stages in Jenkins Build Pipeline for PySpark Pipeline

S. No.

Stage

Description

Steps

Checkout

checks out source code from the code repository and cleans the target folder

  1. Checks out source code from the code repository

  2. Calls make clobber using the Makefile located at <component root>/xprbuild - this cleans the folder to remove any temporary files and binaries

Prepare

prepares the build environment

Calls make prepare using the Makefile located at <component root>/xprbuild - this calls <component root>/xprbuild/system/linux/pre_build.sh

Build

Builds the Docker image for the component

Calls make build using the Makefile located at <component root>/xprbuild - this calls <component root>/xprbuild/docker/build.sh

Test

Tests the new Docker image

Calls make unittest using the Makefile located at <component root>/xprbuild - this calls <component root>/xprbuild/docker/test.sh

Docker Push

Pushes the new Docker image into the xpresso.ai Docker registry

Calls make dockerpush using the Makefile located at <component root>/xprbuild - this pushes the new Docker image into the registry

Repository Folder Structure for PySpark pipelines

The folder structure for any PySpark pipelines is described in detail below:


Folder

File

Description

Developer Tips

/

CHANGELOG.md

Stores a log of changes to the component

Document changes to the component in this file

/

Makefile

Makes the solution (see above for details)

Changes will usually not be required to this file. However, it is a good idea to review the actions being performed on various make rules, especially clobber, prepare, build and dockerpush

/

README.md

Describes the pipeline

Write a brief description of the pipeline and the source files required by it in this file

/

VERSION

Stores the pipeline version number

Write the pipeline version number here

/app

__init__.py

Dummy source code

/app

.gitignore

Dummy .gitignore file

Populate this as per need to ignore file from git file tracking

/app

app.py

Dummy source code

app.py is the default entry point for the pyspark pipeline

/requirements

requirements.txt

Contains list of libraries required to be installed for proper functioning of the pipeline

Libraries will be installed as part of the build stage of the Jenkins pipeline (see above)

/xprbuild

Jenkinsfile

Stores the actions performed by the Jenkins pipeline for the component

See above for details. Review, but do not make changes to this file. Make changes to scripts being called by the pipeline if required

/xprbuild/docker

Dockerfile

Stores commands processed when building the Docker image for the component

Default actions: Call <component root>/xprbuild//system/linux/pre_build.sh Call <component root>/xprbuild//system/linux/build.sh Call <component root>/xprbuild//system/linux/post_build.sh Call <component root>/xprbuild//system/linux/run.sh Change this file as per the component requirements. See Docker documentation for details

/xprbuild/docker

build.sh

Called during the Build stage of the Jenkins Build Pipeline

Builds the Docker image as per the instructions in Dockerfile by default. Change as per pipeline requirements

/xprbuild/docker

pre-build.sh

Unused

/xprbuild/docker

test.sh

Called during the Test stage of the Jenkins Build Pipeline

Executes pytest on the new Docker image by default. Change as per pipeline requirements

/xprbuild/system

Makefile

Unused

/xprbuild/system/linux

build.sh

Called during the Build stage of the Jenkins Build Pipeline

Installs requirements mentioned in <pipeline root>/requirements/requirements.txt by default. Change as per pipeline requirements

/xprbuild/system/linux

post-build.sh

Called during the Build stage of the Jenkins Build Pipeline

Does nothing by default. Change as per pipeline requirements

/xprbuild/system/linux

pre-build.sh

Called during the Prepare and Build stage of the Jenkins Build Pipeline

Installs python and pytest by default. Change as per pipeline requirements

/xprbuild/system/linux

run.sh

Called during the Build stage of the Jenkins Build Pipeline

Runs the code in <pipeline root>/app/app.py by default. Change as per pipeline requirements

/xprbuild/system/linux

spark-submit.sh

Called when spark pipeiline needs to be submitted to the cluster i.e. deployment

Submits the spark pipeline on cluster which starts running from app/main.py

/xprbuild/system/linux

test.sh

Unused

/xprbuild/system/windows

Makefile

Unused

May be required in future to support Windows deployment

/xprbuild/system/windows

build.bat

Unused

May be required in future to support Windows deployment

/xprbuild/system/windows

post-build.bat

Unused

May be required in future to support Windows deployment

/xprbuild/system/windows

pre-build.bat

Unused

May be required in future to support Windows deployment

/xprbuild/system/windows

run.bat

Unused

May be required in future to support Windows deployment

/xprbuild/system/windows

test.bat

Unused

May be required in future to support Windows deployment