Kafka and Real-Time Data Integration

Many colorful ropes tied together, symbolizing Kafka's data integration ability

Kafka and Real-Time Data Integration

Introduction

Hadoop is evolving rapidly from batch to real-time, and that’s bringing new data integration challenges. Fortunately, there’s a new—yet proven—solution available: Apache Kafka. Originally created by LinkedIn to meet their speed and scalability requirements, Kafka is an open source project ready for the enterprise.

If you have or you’re contemplating building a tangle of 1:1 connections between sources and targets, or if you’re looking for a high availability real-time solution, Kafka may be able to help you with your enterprise data integration problems.

What is Kafka?

Kafka is a publish and subscribe messaging system. Kafka is distributed and messages can be any object in any format, including strings and JSON.

In publish-subscribe systems, systems producing messages are called publishers and systems consuming messages are called subscribers. Instead of sending their messages directly to one (or zero) or more receivers, the publishers produce their messages to a topic. Subscribers subscribe to a topic and consume messages from it. More than one publisher can produce messages to a topic and more than one subscriber can subscribe to a topic. The system which retains messages from the publishers for the subscribers is called the broker.

Source: Oracle.com

What’s unique about Kafka?

Scalability: It can scale to handle more subscribers in part because it delegates to the subscribers the job of tracking which messages have been read. Instead of attempting to track and delete messages unread by any subscriber, Kafka simply deletes messages that get too old.

Throughput and Latency: These qualities can enable stream processing of incoming data feeds. In a Cloudera blog, Gwen Shapira and Jeff Holoman say “...Kafka is designed for high throughput and low latency; even a small three-node cluster can process close to a million events per second with an average latency of 3ms.”

Because of these advantages, Kafka is suitable for new use cases by providing a unified platform to avoids the complexity of writing and maintaining many individual pipelines between sources and targets.

How can Kafka help with Big Data integration?

Kafka can aggregated incoming logs and make them available to multiple services, including ingestion into Hadoop’s HDFS. Real-time stream processing applications (like Storm) can subscribe, process, and publish results to new topics. This can enable real-time operational metrics.

In banking and financial services, this could be used to store transaction data for predictive analytics tasks (like loan risk assessment) while simultaneously supporting real-time fraud detection. The first step would be to deliver transaction data from the operation system to Kafka, from mainframe, web, and other systems. Once in Kafka, the log data would be rapidly published to subscribers.  One subscriber that would store the data into HDFS. Another subscriber (likely Storm) would process the data in real-time to filter and republish the data into topics per client. With the data partitioned per client, examining the last N transactions for a client could detect suspicious activity like transactions from improbably distant locations on the same day.

Source: Quora

Kafka can also reduce the cost of operations. By extracting the data once, no matter how many subscribers, it reduces the burden on the source system. This helps ensure that the source system (for transaction processing in our example) continues to perform its critical primary function.

Who should use it?

To date, Flume has been a popular solution for ingesting log data into Hadoop. Flume has the advantage of having many log source formats built-in, and has been optimized for output into HDFS and HBase, and it’s integrated into Hadoop’s security. In addition, Flume can apply filters to data in-flight. In basic use cases, if your data is only going to Hadoop and it is only coming from a system already well-supported by Flume, then the answer is obvious.

But many enterprise data integration use cases are more challenging. Here are some key reasons to look at Kafka for data integration:

  • One or more sources are not supported by Flume

  • The sources include mainframe data (this is common for enterprise transaction data)

  • Your data will be ingested by multiple targets

  • You need high availability

If your data is destined for Hadoop but you also want real-time stream processing, combining Kafka with Flume can be a good solution.

Coming soon

What does Kafka have to do with Veristorm and vStorm Enterprise? How can Kafka help in a cloud environment? Those articles are coming up.

Credit where credit is due

I borrowed pretty liberally from the Apache Kafka for Beginners blog on Cloudera. If you’d like more technical depth, I highly recommend it.