1,982 reads

The Essential Architectures For Every Data Scientist and Big Data Engineer

by Sharmistha ChatterjeeAugust 11th, 2020

Too Long; Didn't Read

Feature Store has become an important unit of organizations developing predictive services across any industry domain. The essential Architectures For Every Data Scientist and Big Data Engineer are available in this blog. We highlight on the features supported by different Feature Store frameworks, that are primarily developed by different leading industry giants. Feature Store provides a platform for feature-sharing for ML-sharing models across different datasets. It provides a horizontally scalable multi-tenant architecture for multiple models with suitable scaling and monitoring. It provides options to define hierarchical partitioning schemas to train models per partition, that can be deployed as a single logical model.

Companies Mentioned

Coin Mentioned

featured image - The Essential Architectures For Every Data Scientist and Big Data Engineer

Comprehensive List of Feature Store Architectures for Data Scientists and Big Data Professionals

Introduction & Motivation - Why Feature Store

Feature store has become an important unit of organizations developing predictive services across any industry domain. Some of the earlier challenges in deploying ML solutions at scale involves :

Developing and maintaining customized systems by individual teams with little or no coordination.
No collaborative system for sharing features for similar type ML models (models from a similar domain or models addressing. same business use-cases or customer domains).
Increased cognitive burden without the proper scope of scalability.
Limited integration with big-data ecosystems.
Limited scope for model retraining, comparison, model governance, and traceability, limiting agile development life-cycle.
Difficult to track and retrain model which exhibits seasonality.

To overcome the above limitations, Architects. Data scientists, Big Data, and Analytics professionals have felt the necessity to walk under one roof with one unified framework to facilitate easier collaboration, sharing of data, results, reports. Departments, teams and organizations shared some of the similar notions of Feature Engineering:

Feature Engineering is expensive and amortization happens over time and across models.
The increase in cost is non-linear/exponential with the increase in the number of features.
Triggers/Alerts due to addition/removal of feature is high.
Most often dependencies are not documented/tracked which results in an increase of implicit and explicit dependencies getting added over time.

While sharing a similar opinion, it became easier to come together and create a Unified Framework called Feature Store. This would enhance the speed of ML model deployment life-cycle along with the creation of proper documents, required version analysis, and model performance in order to save time and effort.

In this blog, we highlight on the features supported by different Feature Store frameworks, that are primarily developed by different leading industry giants.

Advantages of Feature Store

Ability to re-use and discover features between teams across the organization.
Features should be governed by adding features like access control and versioning.
Ability to precompute and automatically backfill features --- including online computation and offline aggregation.
Helping to create a collaborative environment between data scientists and big data engineers.
Save effort and cost by sharing not only features but also related artifacts, documents, marketing insights of models developed from these features.
Enable consistency between training and serving.

Michaelengelo From Uber

Michaelangelo - a framework developed by Uber that allows feature integration/joining in both offline and online pipelines. Here Hive (Offline) and Cassandra (Online) acts as the main storage unit for raw/transformed features. It provides a horizontally scalable multi-tenant architecture for multiple models with suitable scaling and monitoring. Training jobs can be configured and managed through a web UI or an API,

It further provides options to define hierarchical partitioning schema to train models per partition, that can be deployed as a single logical model. This provides easy bootstrapping and helps to overcome challenges when several models need to be trained based on the hierarchical structure of the data.

At runtime during serving, it finds root to the best model for each node. Further its best known for its ability to support continuous learning, providing integration with AutoML, along with its support for distributed deep learning.

Feast Feature Store

Google released Feast which is primarily built around Google Cloud services: Big Query (offline) and Big Table (online) and Redis (low-latency), with Apache Beam for feature engineering. It allows a clear separation between big data and model development. This online predictive service allows feature sharing among teams with strong consistency between model training and serving.

Further Feast comes with centralized feature management, discovery, feature validation, and feature aggregation. The feature columns reside inside wide-entity tables. In addition, the composite entities separate individual features.

Wix Feature Store

Wix provides a platform for feature-sharing across different ML models for both batch and real-time datasets. It supports a pre-configured set of feature families on the site and user-level for both training and serving models. The different stages of data management, model training and deployment are marked and show in the figure above. It further uses S3 to store real-time extracted features.

FeatureStore from Comcast

The Feature Store developed by Comcast helps data scientists to reuse versioned features, upload online (real-time)/streaming data, and review feature metrics by models. The product is available in multiple pluggable feature store components. The built-in model repository contains artifacts related to data pre-processing (normalization, scaling) displaying the required mapping to the features needed to execute the model. Further, the architecture is built using Spark on Alluxio (open source data orchestration layer that brings data close to compute for big data and AI/ML workloads in the cloud), S3, HDFS, RDBMS, Kafka, Kinesis. The Model deployment with Kubeflow helps to build a resilient, highly available distributed systems with support for rate-limiting, shadow deployments, and auto-scaling.

The integration with Data Lake with suitable API s helps data scientists to use SQL and create training/validation/test datasets that can be versioned and integrated into the full model pipeline. In addition, the framework comes with the support of Seldon Inference Graphs for A/B Testing, Ensembles, Multi-armed bandits, Custom combinations. The end to end system not only provides traceability from use-case, models, features, model to features mapping, versioned datasets, model training codebase, model deployment containers, and prediction/outcome sinks, it is also known for integration with Feature-Store, Container Repository, and Git to integrate data, code and run-time artifacts for CI/CD integration.

Just like any other architecture, it has continuous Feature Aggregation on streaming data + on-demand features. The Online Feature Store uses the following sequences before giving a prediction:

Payload only contains Model Name & Account Number
Model Metadata informs which features are needed for the model
Pull required by features by Account Number
Pass a full set of assembled features for model execution

is a multi-tenant architecture that integrates AWS Sagemaker, Databricks, Kubernetes, and Jupyter Notebook. It also supports integration with Authentication frameworks like LDAP, Kerberos, and Oauth2.

The Batch / Live Streaming functionality is facilitated by Apache Beam, Apache Flink, and Apache Spark, whereas the model governance and monitoring pipeline are built using Kafka and Spark Streaming.

The architecture is composed of several building blocks namely

The Feature Store API - For reading/writing to/from the feature store.
The Feature Store Registry - User-Interface to discover features.
Feature Metadata - Documentation, Analysis and Versioning
Feature Engineering Job - For computationStorage Layer - For feature storage

The feature store developed by Netflix supports both online and offline model training and development. The online micro-services enables the framework to collect the data elements required by the feature encoders in a model. It further passes this downstream for future use by offline predictions. The Fact Logging service of Netflix logs user-related, video-related and computation specific features in a serialized format in appropriate storage units (S3).

The unique point of this architecture is the presence of components that help to:

Develop/Create contexts to snapshot
Snapshot data of various micro-services for the selected context
Build APIs to serve this data for a given time coordinate in the past

As snapshotting data for all contexts (e.g all member profiles, devices, times of day) would incur overhead and cost, Netflix relies on selecting samples of contexts to snapshot periodically (at regular intervals - daily/twice daily), though different algorithms. It achieves this through Spark, by training data on different distributions, and by using stratified samples based on properties such as viewing patterns, devices, time spent on the service, region, etc.

Netflix embraces a fine-grained Service Oriented Architecture for cloud-based deployment model.

FBLearner from Facebook

The FBLearner designed by Facebook is a framework for AI WorkFlow with Model Management and Deployment. It is mainly composed of 3 components - FB Learner Feature Store (runs on CPU), FB Learner Flow (runs on CPU +GPU), and FB Learner Predictor (runs on CPU). It supports building all kinds of deep learning models (Caffe2, Pytorch, Tensorflow, MxNet, CNTK) and models can be stored in ONNX format ( across converters, runtimes, compilers, and visualizers. supports and to) across different hardware/software platforms.

The above broad categories can be seen as creating logical units from hardware to application software.

Frameworks (FB Learner Feature Store) needed to create, migrate and train models.
Platforms (FB Learner Flow) for model deployment and management and
Infrastructure (FB Learner Predictor) needed to compute workloads and store data

Facebook also uses a principle to split development and deployment (production) environments.

Pinterest Feature Store

Pinterest's - Big Data Machine Learning is a classic example of high speed and quality which is scalable, reliable, and secure. This Metadata-driven framework is built using open-source technology with individual building blocks that help in reusability. It also provides governance: enforcement & tracking.

The uniqueness of this architecture lies in capturing relationships and interactions (clicks made by users) between pins (how objects are organized into collections).

The below figure illustrates the different components in model governance and development architecture

Zipline from Airbnb

The predictive system ZipLine created by Airbnb relies on a scoring service based on features gathered in due time and space. The scoring log (acts as debug/audit log) is computed/updated daily to ensure feature consistency and single feature definition both during training ML model and deploying them at production. In addition, it ensures Data Quality monitoring, feature back-filling, and making features searchable and sharable.

The architecture integrated with data sources -- Hive Table, databases and Jitney's Event Bus apart from Apache Spark (batch) and Flink (streaming) with Lambda as serving point. The uniqueness of this platform lies in :

Reduction of custom pipeline creations
Reducing data leaks in custom aggregations
Feature distribution observabilityImproved model iteration workflow

TFX

TensorFlow Extended (TFX), a TensorFlow based general-purpose machine learning platform provides orchestration of many components—a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. The platform is particularly known for training, validation, visualization, and deployment of fresh newly trained models in production continuously relatively quickly. The individual components can share utilities that allow them to communicate and share assets. Due to fast training data and deserialization teams and community can share their data, models, tools, visualizations, optimizations, and other techniques

The components are further known for gathering statistics over feature values: for continuous features, the statistics include quantiles, equi-width histograms, the mean and standard deviation, whereas for discrete features they include the top-K values by frequency. In addition, the components support the computation of model metrics on slices of data e.g., on negative and positive examples in a binary classification problem) and cross-feature statistics like correlation and covariance between features. These statistics give insights to users on the shape of each dataset.

Further, the architecture also provides configuration free validation-setup enabled for all users, multi-tenancy to serve multiple machine-learned models concurrently, soft model-isolation to increase model performance.

Apache Airflow

Apache Airflow :

Apache Airflow's entire architecture is based on the concept of DAG (Directed Acyclic Graph), which takes into account the dependencies within them. Its principal responsibility to ensure all things happen at the right time and in the right order. The DAGs define a single logical workflow and they are defined in python files.

Further, it supports Airflow Operators which states what steps are executed over time (e.g. download or transfer operators- GoogleCloudStorageDownloadOperator ). One such Operator is the GoogleCloudStorageObjectSensor which pauses execution until aa key appears in S3.

Apache Airflow guarantees Idempotence (ensuring subsequent execution of any step produces the same end-result, irrespective of the number of times.), Atomicity, and Metadata Exchange. Data exchange between different components of this distributed architecture is facilitated using XCOM (cross-communication) that provided an exchange of small metadata. However, for large volumes of data, it supports shared network storage, data lake (S3) or URI based exchange through XCOM.

Parameterized representations of operators help DAG to run tasks that spawn a TaskInstance at a particular instant of time. Further, the instances within Apache AirFlow DAG are grouped into a DagRun.

Zomato Feature Store

Zomato's restaurant business heavily relies on stream data processing to compute running orders at the restaurant at any given point. The architecture use Apache Flink that provides job level isolation for each ML model as features from each ML model maintain their separate space for research, analysis, logging and do not interact with features from other ML models.

In addition to streaming and online feature extraction, the life-cycle management of ML models are provided by MLFlow. The ML models are served to the external world via API Gateway by means of AWS Sagemaker endpoints.

Overton from Apple

Overton automates the life cycle of model construction, deployment, and monitoring by providing a set of novel high-level, declarative abstractions. It supports multi-task learning to concurrently predict several ML models in both real-time and backend production applications.

Further, the architecture allows separation between model and data with two components the tasks, which capture the tasks the model needs to accomplish, and payloads that represent sources of data, such as tokens or entity embeddings.

The model training is governed by a schema file, which acts as a guide to compile a TensorFlow model and to describe its output for downstream use. Overton also embeds raw data into a payload, which is then used as input to a task or to another payload. The payloads are either singletons (e.g., a query), sequences (e.g. a query tokenized into words or characters), and sets (e.g., a set of candidate entities).

StreamSQL Feature Store

StreamSQL Feature store is alow latency based model development framework with high throughput serving. It allows new model features to be deployed confidently with versioning with much with ease. With the use of feature definitions, consistent feature deployment is ensured across training, in serving and across production.

The architecture is also known for its ability to increase model performance by integrating features from 3rd party. It combines batch and stream processing with an immutable ledger, where each event is appended to the end of the ledger. Further, the framework at any point allows the addition of new data sources/transformations (from Flink and Spark. Files, tables, and stream), modify or create a new set of features and even analyze/discover features from feature registry.

Feature Store from Tecton

Tecton has come up with a unified architecture to develop, deploy, curate/govern and monitor a platform built to standardize high-quality features, labels, and data sets for ML models in production, ensuring the safe operation of models over time, with proper reproducibility, lineage, and logging.

The Tecton platform consists of:

Feature Pipelines for transforming your raw data into features or labels

A Feature Store for storing historical feature and label data
A Feature Server for serving the latest feature values in production
An SDK for retrieving training data and manipulating feature pipelines
A Web UI for managing and tracking features, labels, and data sets
A Monitoring Engine for detecting data quality or drift issues and alerting

Hybrid Feature Store

The above figure illustrates a Hybrid Feature Store with Data Pipeline, BI Platforms (Tableau) using Apache Airflow, S3, Hopsworks Feature Store, and Data Lakes from Cloudera. The platform is capable of ingesting raw data, event or SQL data at the input.

Feature Store from Scribble data

The Feature Store provided by Scribble Data puts lots of stress on Input Data Correctness and Completeness (gaps, duplicates, exceptions, invalid values), as it is known to play an impact on ML models' prediction. Hence it recommends a continuous check/early morning system to prevent poor quality data from coming into the system. On the reactive side, the system undertakes a continuous process to improve ML operations over time.

Conclusion

Here we have discussed about different architectural frameworks using Big Data (some of them are Open Source tools), ML model training and serving tools, along with orchestration layer (such as Kubernetes). Each of the component is equally important and they go hand in hand to create a real-time end to end predictive system.