Building Machine Learning Pipelines at Scale

The Challenge

Traditional ML development is a complex, expensive and iterative process. One of the reasons for the complexity and expense is that there are currently limited to no integrated tools that can manage the entire process. The integration of the right tools and workflows for setting up a ML pipeline is time consuming and error prone, it also detracts from the overall business objectives.

A learned approach to priority setting & classification

The road to meeting the Challenge

Given the nature of today’s business, data insights are expected in near real-time. Considering the following factors – 1. The need for speed, 2. Complexities of implementing a ML model, 3. The particular use case, 4. Urgency for critical insight – it is only logical to need a more predictable approach by using a proven framework. To that end, our team has put together a working ML Pipeline that can de-risk the investment of implementing ML models at scale. First let’s look at the components the team selected for our Pipeline framework:

Big Data Processing

Spark: Apache Spark is known as a fast, easy-to-use, and general engine for big data processing that has built-in modules for streaming, SQL, Machine Learning (ML), and graph processing. We are using the PySpark version, (the underlying language is Python).

Spark ML: Spark ML leverages Apache Spark, and it’s distributed processing environment. It is used in stacking multiple data cleaning and preprocessing functions in a sequential pipeline format.

AWS Glue: AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scaled-out Apache Spark environment. This is good because you pay only for resources used while your jobs are running (It’s a serverless operation.)

Model Training & Prediciton

Amazon SageMaker: Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. Bottom line it removes the heavy lifting from each step of the machine learning process to make it easier to develop higher quality models.

AWS Inference Pipelines: You can use trained models directly in an inference pipeline to make real-time predictions without performing external preprocessing. This is a real time saver! When you configure the pipeline, you can choose to use the built-in feature transformers in SageMaker or, you can implement your transformation logic using just a few lines of scikit-learn or even Spark code.

The Chenoa ML Pipeline in AWS Cloud:

Benefits of our ML Pipeline

Scalable Process – We have implemented big data technologies like Spark and can handle 1 TB of data

Faster Turn-Around Time – With our model in place we can tweak the models quickly. With the entire pipeline in the cloud using S3 buckets for data and model storage we can package our ML application into a container allowing replication of our results across multiple platforms.

Partial Automated Data Preprocessing – By using SparkML and Python scripts we have accelerated data preprocessing.

Near Real-time prediction – Predictions or outcomes available in near real-time to the end-user once given input. The predictions can be stacked for multiple inputs in an excel format or even for individual input.

The Results

With our framework in place our data science and engineering team benchmarked their processes utilizing multiple use cases and came up with an average machine learning project life cycle of four weeks. The results demonstrated an overall 70% decrease in elapsed time from data transformation to model deployment.

Taking a closer look our team saved an average elapsed time of 80% on data preparation, 60% of model development and 65% on model deployment. While there were variations in time saved, the most consistent time saving areas were in data processing pipeline scaling and model development.

Next Steps

At this point we have based our efforts on classification type modeling and will now expand the ML pipeline to include regression, clustering and explainable (XAI) modeling as well. We are most interested in enhancing our ML pipeline with model-agnostic methods to expedite the process of evaluating model outcomes – building confidence in our work and business partners.

Want to save time in model deployment using ML?

Get In Touch

Building Machine Learning Pipelines at Scale

The Challenge

A learned approach to priority setting & classification

The road to meeting the Challenge

Big Data Processing

Model Training & Prediciton

The Chenoa ML Pipeline in AWS Cloud:

Benefits of our ML Pipeline

The Results

Next Steps

Want to save time in model deployment using ML?

Related Content

Services

About Us

Global HQ

India HQ