What is Kafka?
Apache Kafka is a distributed publish-subscribe messaging system that can handle a high volume of data and enables you to pass messages from one end-point to another. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with Apache Storm and Sparks for real-time streaming.
Kafka has better throughput, built-in partitioning, replication, and inherent fault-tolerance, which makes it a good fit for large-scale message processing applications.
Kafka is very fast and guarantees zero downtime and zero data loss.
What are Kafka Benefits?
Following are a few benefits of Kafka −
Reliability − Kafka is distributed, partitioned, replicated and fault tolerance.
Scalability − Kafka messaging system scales easily without downtime.
Durability − Kafka uses a Distributed commit log which means messages persists on disk as fast as possible, hence it is durable.
Performance − Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored.
Publish-Subscribe Messaging System
In the publish-subscribe system, messages are persisted in a topic. Unlike point-to-point systems, consumers can subscribe to one or more topics and consume all the messages on that topic. In the Publish-Subscribe system, message producers are called publishers and message consumers are called subscribers. A real-life example is Dish TV, which publishes different channels like sports, movies, music, etc., and anyone can subscribe to their own set of channels and get them whenever their subscribed channels are available.
Apache Kafka architecture
Kafka stores key-value messages that come from arbitrarily many processes called producers. The data can be partitioned into different "partitions" within different "topics". Within a partition, messages are strictly ordered by their offsets (the position of a message within a partition) and indexed and stored together with a timestamp.
Other processes called "consumers" can read messages from partitions. For stream processing, Kafka offers the Streams API that allows writing Java applications that consume data from Kafka and write results back to Kafka.
Kafka runs on a cluster of one or more servers (called brokers), and the partitions of all topics are distributed across the cluster nodes. Additionally, partitions are replicated to multiple brokers. This architecture allows Kafka to deliver massive streams of messages in a fault-tolerant fashion
- The producer is an application that generates the entries or records and sends them to a Topic in Kafka Cluster.
- Producers are a source of data streams in Kafka Cluster.
- Producers are scalable. Multiple producer applications could be connected to the Kafka Cluster.
- A single producer can write the records to multiple Topics [based on configuration].
- The consumer is an application that feeds on the entries or records of a Topic in Kafka Cluster.
- Consumers are sink to data streams in Kafka Cluster.
- Consumers are scalable. Multiple consumer applications could be connected to the Kafka Cluster.
- A single consumer can subscribe to the records of multiple Topics [based on configuration].
Kafka Stream Processors
- Stream Processor is an application that enrich/transform/modify the entries or records of a Topic (sometimes write these modified records to a new Topic) in Kafka Cluster.
- Stream Processors first act as sink and then as a source in Kafka Cluster.
- Stream Processors are scalable. Multiple Stream Processing applications could be connected to the Kafka Cluster.
- A single Stream Processor can subscribe to the records of multiple Topics [based on configuration] and then write records back to multiple Topics.
- Connectors are those which allow the integration of things like Relational Databases to the Kafka Cluster and automatically monitor the changes. They also help to pull those changes onto the Kafka cluster.
- Connectors provide a single source of ground truth data. This means that we have a record of changes, a Topic has undergone.
Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.
For each topic, the Kafka cluster maintains a partitioned log that looks like this:
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
Brokers, Topics and their Partitions – in Apache Kafka Architecture
Observe in the following diagram that there are three topics. Topic 0 has two partitions, Topic 1 and Topic 2 has only a single partition. Topic 0 has a replication factor of 3, Topic 1 and Topic 2 have a replication factor of 2. Zookeeper may elect any of these brokers as a leader for a particular Topic Partition. Whereas the replication factor determines how many backups you want to create for your topic in the Kafka cluster for enabling HA for your data in times of cluster losing a node due to various reasons.
The workflow of Pub-Sub Messaging
Following is the stepwise workflow of the Pub-Sub Messaging −
- Producers send messages to a topic at regular intervals.
- Kafka broker stores all messages in the partitions configured for that particular topic. It ensures the messages are equally shared between partitions. If the producer sends two messages and there are two partitions, Kafka will store one message in the first partition and the second message in the second partition.
- Consumer subscribes to a specific topic.
- Once the consumer subscribes to a topic, Kafka will provide the current offset of the topic to the consumer and also saves the offset in the Zookeeper ensemble.
- The consumer will request the Kafka in a regular interval (like 100 Ms) for new messages.
- Once Kafka receives the messages from producers, it forwards these messages to the consumers.
- The consumer will receive the message and process it.
- Once the messages are processed, the consumer will send an acknowledgment to the Kafka broker.
- Once Kafka receives an acknowledgment, it changes the offset to the new value and updates it in the Zookeeper. Since offsets are maintained in the Zookeeper, the consumer can read the next message correctly even during server outrages.
- This above flow will repeat until the consumer stops the request.
- The consumer has the option to rewind/skip to the desired offset of a topic at any time and read all the subsequent messages.