Sunday, May 15, 2016

Kappa - Lightweight Lambda Architecture


Lambda Architecture defines the methodology for splitting and combining the real-time (fast track) and batch data processing to efficiently process the data supporting near real time architectures. This model is well supported by Hadoop ecosystem and vendor platforms like Cloudera CDH.

With the advancement of stream processing, a new set of frameworks have emerged like: Apache Flink, Google Data Flow, Spark 2.0 (June 2016) which are pushing the boundaries of stream processing. These frameworks are bridging the gap between fast-track and slow-track in Lambda architecture, at least in lightweight data processing at this time.

This lightweight Lambda architecture is termed as Kappa. There are different variations of how much the real-time vs batch gap is bridged depending on the application needs. This blog takes a business use case for Doctor Ranking which fits the model for Kappa architecture and goes in depth on the data flow.

Business Use Case

Healthcare providers (doctors, hospitals) would like to know how they are being ranked in the results when customers does a search looking for a doctor or hospital for healthcare needs.

Analysis

Provider results returned when a customer performs a search are limited using pagination. This is done to avoid performance bottlenecks in search engine, application and UX layers. However, the business use case is looking for analytics reporting on entire results dataset for every search performed by the customers. As this is very expensive for performance reasons, a modified version of the use case: to return rankings of providers for the top xx searches performed based on the given criteria at periodic intervals of time is being explored.

Spark Streaming 1.0

In the 1.0 version of Spark streaming framework, micro-batching model is used for processing streaming data. A real-time stream of data coming from, for example, a Kafka event processor, is divided into a sequence of RDDs by the arrival timeline which can be configured, and the resulting RDDs are processing by the Spark framework.

Micro-batching model is good for streaming 1.0 but things have moved-on since and applications demand features like iterative processing, delta-only processing, different windowing options, triggers on event.

Apache Flink, Spark 2.0, Google DataFlow

Apache Flink project started with stream data processing as the core operation and has led the way in providing the above mentioned features that were missing in streaming 1.0. With its structured streaming enhancements, Spark 2.0 is ready to streaming to next level. Google DataFlow which is currently open-sourced also supports the advanced features and an Apache Beam project has started on top of it.

Solution Architecture

The following architecture diagram is for building the solution with Spark 1.6 streaming framework:


Steps

  1. Customer interactions with search are logged on search engine nodes are streamed by Logstash to Kafka topic in real-time
  2. Spark analytics processor application consumes the messages from Kafka topic for analysis
  3. Supplementary data needed for processing the log data is pulled from MongoDB and cached by Spark application
  4. Ranked provider data by location and other criteria as needed are streamed to another Kafka topic for downstream consumption on a specific time interval
  5. Downstream applications process the streamed data and publishes to persistent storage 
  6. Tableau connector for Hadoop data store pulls data needed for business reports on demand from the business users

 

Data Processing

Steps 3,4,5 have the core application logic which has the following data flow:
  1. Filter out the logs streamed from Kafka and extract metadata
  2. De-dupe logs to include only single entry per search
  3. Store search criteria in cache
  4. Pull and cache supporting data from MongoDB for processing logs
  5. Generate Top xx entries with the given criteria (example: by location)
  6. Submit searches to search engine for each of the Top xx queries and gather the provider names
  7. Generate Top xx providers with given criteria (example: by location)
  8. Publish at given interval Top xx providers by location to Kafka topic
  9. Continually update #3 cache with incoming search logs


1 comment:

  1. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging…!!..Big Data Hadoop Online Course Bangalore

    ReplyDelete