visit
Data Scientists have become newly minted developers in their own right. In becoming developers, it is useful to understand development principles that software engineers use to iteratively test, construct and shape their deployed code.
In this article, we’ll talk about some often-misunderstood development principles that will guide you to developing more resilient, production-ready development pipelines using CI/CD tools. Then, we’ll make it concrete with a tutorial about how to set up your own pipeline using .
A Representation of the Modern Data Science Workflow. Created with Notability for iPad.
Understanding the components to development is the first step to understanding which pieces go where, and how things fit together. Each of these elements serves as crucial building blocks towards the coveted end-to-end pipeline. In this case, “end-to-end” is jargon for “you make a code-level change, and the end-user experiences the effect”.
In a nutshell, you as the data scientist would use the development pipeline to push changes from your local machine to a version control tool, and have these changes be reflected in the cloud deployment service for your end users.Next, we’ll break down an example development pipeline step by step.
However, we’re still missing a crucial piece of the puzzle. Uploading code to GitHub and setting up an AWS deployment is great, but if there are changes or upgrades to that codebase, the AWS deployment will not automatically reflect them. Instead, each version for deployment will have to be manually updated. While this is suboptimal in terms of effort and time, there is also a possibility that your flashy new update might break the basic functionality of your original dashboard. This mistake is compounded when working with a team of data scientists to create a product.
To patch this missing puzzle piece, we introduce the concept of Continuous Integration / Continuous Deployment — abbreviated as CI/CD. This tool bridges the gap between development and operation activities through automation. It helps you test and integrates your new changes into existing the body of work. is an excellent option for this tool when setting up your deployment pipeline.
You might be wondering — how is this going to stop my development pipeline from breaking? Let’s explore the value of using Buddy. This CI/CD tool is actually a process, that involves adding testing, automation, and delivery benchmarks to connect your GitHub repository to the cloud configuration.
Buddy functions as a Swiss Army knife when it comes to deployment operations.Let's examine each element in turn:
Now that we have established the premise of CI/CD and its uses, let’s dive right into a first look of Buddy’s platform and how you can get a basic pipeline off the ground.
Buddy conveniently syncs with all of your GitHub repositories, public and private.
We’ll use the repository for the purposes of this tutorial, which is an interactive dashboard built with Streamlit. After forking the repository on Github, it will show up in our repo list within Buddy. Clicking on the name will lead us to the next screen.
Buddy scans your repo meta-data to recommend a relevant environment setup.Here, Buddy has already detected that the repository’s contents contain a Python app and shows us more options for setting up the relevant Python environment. At this step, we also have to select how the pipeline should trigger.
I went with the ‘on-push’ trigger to master-branch, so all my latest and greatest changes will be acted on.
This is the home for pipeline building. Add new actions to your pipeline either by searching or clicking on icons.
As mentioned, Buddy has detected that our app is written in Python, so we’ll click on that icon first. Here’s where we can configure the environment, choose the relevant Python version (in this case, it’s
python3.7
). A quick look in the README.md of the project tells us the BASH lines needed to get the app up and running:pip install --upgrade streamlit
pip install -r requirements.txt
The first line ensures that we are running the latest version of streamlit, and the
requirements.txt
contains the remaining dependencies we need to be able to run our app.At the bottom, we can also notice the Exit Code Handling section — this allows for a way of helpfully identifying behavior in case of errors at any step in the pipeline. We can either solider on (not recommended for obvious reasons), or stop the pipeline where it broke and send a notification that something went wrong, or try running different commands. Identifying where something has broken is perhaps the most frustrating part of fixing a broken process. Proactively setting error-handling behavior and notifications as a priority will help keep frustrations at a minimum going forward, when some element inevitably breaks.
The build commands allow you to write any BASH or SH scripts you need to get the environment set up right.
Raw Logs allow you to see exactly what happened during execution, and the timer is a convenient method for estimating runtimes.
Awesome! The build is complete and without errors. If you are following along with the tutorial to this stage and faced errors, check that the Python version is exactly python3.7, because that is required for this particular app’s dependencies.
“Unit tests give you the confidence that your code does what you think it does”
Adding unit tests can be simple as adding python files in the same repository. In order to run these tests, we’ll return to Step 3: Building the environment, and add in our new line to run the tests here.
Adding in “python run unittest.py” will run your files when the environment is built. If all tests pass, the pipeline will continue.When the tests have been implemented, this is where we would expect to see the results. In this case, error handling setup becomes particularly important, as Buddy can share notifications if some tests fail.
If all tests pass, the compiler will end with “Build finished successfully”.
Adding in notifications is critical to ensuring we know where breaks in the pipeline occur, or which tests have failed. From the pipeline overview, click on the “Actions Run On Failure” section, where we can decide what actions will run if there is an error anywhere in the pipeline. For our purposes, it will be sufficient to set this up using environmental variables that will indicate which execution or test broke the pipeline execution.
$BUDDY_PIPELINE_NAME
gives us the name of the pipeline that is broken$BUDDY_EXECUTION_ID
gives us the unique identifier of the instance of the pipeline that created an error, including the$BUDDY_FAILED_ACTION_LOGS
will give an extensive overview of the logs of what went wrong, which is convenient because it helps in diagnosing any issues that pop up. It may even help solve the issue by just glancing in the email, fixing the code, and making a new commit to patch the issue — without even needing to visit the CI/CD tool at all.An extensive array of environmental variables are, and more can be developed with ease.
Here, the issue is clearly that my build environment is deprecated, and thus I need to choose a new python version that is being maintained. In this case, that’s.python 3.7
In order to do this, select the
SFTP
action, and make the connection between Buddy and the public IPv4 address of the EC2 machine.Using the Pipeline Filesystem is super important here because it makes use of the tested files.Here, I’ve entered my Hostname & Port, and Login information, as well as used my Private SSH key to actually give Buddy access to the EC2 machine. There are two caveats to mention here:
git commit "app.py" -m "Buddy cicd test"
git push
Hooray! We correctly set up the environment, sent across changes from the local machine to GitHub. The code was then executed, ran unit tests, and uploaded to the EC2 machine, where the changes were reflected in our visualization.
Let’s take a look at the final product:You can also visit Streamlit’s version of this app .This is the front-end visualization, powered by Streamlit. To review, we’ve taken python code and committed it to a versioning tool (in this case, Github). This repo is then linked to a CI/CD tool (Buddy), which syncs, tests, and integrates our commits to the overall build, hosted on an AWS EC2 machine.
SFTP
.Full disclosure, this is a sponsored article by Buddy. I do use Buddy CI/CD in my projects, and have leveraged their technology to develop and deliver end-to-end pipelines to a number of data engineering clients.