Attending this event?
Back To Schedule
Friday, January 28 • 6:00pm - 6:50pm
Building data pipelines for Anomaly Detection

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Cloud-native applications. Multiple Cloud providers. Hybrid Cloud. 1000s of VMs and containers. Complex network policies. Millions of connections and requests in any given time window. This is the typical situation faced by a Security Operations Control (SOC) Analyst every single day. In this talk, the speaker talks about the high-availability and highly scalable data pipelines that he built for the following use cases :
* Denial of Service: A device in the network stops working.
* Data Loss : An example is a rogue agent in the network transmitting IP data outside the network
* Data Corruption : A device starts sending erroneous data.

The above can be solved through anomaly detection models. The main challenge here is the data engineering pipeline. With almost 7 Billion events occurring every day, processing and storing that for further analysis is a significant challenge. The machine learning models (for anomaly detection) has to be updated every few hours and requires the pipeline to create the feature store in a significantly small time window.

The core components of the data engineering pipeline are:
* Apache Flink
* Apache Kafka
* Apache Pinot
* Apche Spark
* Apache Cassandra

The event logs are stored in Pinot through Kafka topic. Pinot supports apache kafka based indexing service for realtime data ingestion. Pinot has primitive capabilities to create sliding time window statistics. More complex real-time statistics are computed using Flink. Apache Flink is a stream-processing engine and provides high throughput and low latency. Spark jobs are used for batch processing. Cassandra serves as the data warehouse as well as the final database.

The speaker talks through the architectural decisions and shows how to build a modern real-time stream processing data engineering pipeline using the above tools.

* The problem: overview
* Different Architecture Choices
* The final architecture - a brief explanation
* Real-Time Processing
* Apache Flink
* Micro-batching vs Streaming?
* Basic Spark Streaming Micro Batching With State
* Flink ADS - Asynchronous Distributed Snapshot
* Why Flink for this application?
* Apache Pinot
* What is OLAP?
* ClickHouse vs Druid vs Pinot
* Why Pinot for this application?
* Batch Processing
* Apache Spark
* Data Engineering + Machine Learning
* ML and MLLIB
* Apache Cassandra
* What is OLTP?
* Cassandra vs Hbase vs Couchbase vs Mongo
* Why Cassandra for this application?
* A short demo

Session chairs: Andrei Veselov and Pavel Yadlouski


Tuhin Sharma

Senior Principal Data Scientist, Red Hat
Tuhin Sharma is Senior Principal Data Scientist at Redhat in the Corporate Development and Strategy group. Prior that he worked at Hpersonix as AI Architect. He also co-founded and has been CEO of Binaize, a website conversion intelligence product for Shopify. He received master’s... Read More →

Friday January 28, 2022 6:00pm - 6:50pm CET
Session Room 1