Kafka - An Overview

Kafka - An Overview
Written by Shuvhojit DebDecember 27, 2021
11 min read
Shuvhojit Deb

Full Stack Developer

Today we can get to know about Kafka.

What is Apache Kafka?

Apache Kafka is an open-source distributed publish-subscribe messaging platform that has been purpose-built to handle real-time streaming data for distributed streaming, pipelining, and replay of data feeds for fast, scalable operations.

Kafka is a broker-based solution that operates by maintaining streams of data as records within a cluster of servers. Kafka servers can span multiple data centers and provide data persistence by storing streams of records (messages) across multiple server instances in topics. A topic stores records or messages as a series of tuples, a sequence of immutable Python objects, which consist of a key, a value, and a timestamp.


Concepts of Apache Kafka

Topics: A topic is a fairly universal concept in publish/subscribe messaging. In Apache Kafka and other messaging solutions, a topic is an addressable abstraction used to show interest in a given data stream (series of records/messages). A topic can be published and subscribed to and is an abstraction layer that is used by the application to show interest in a given stream of data.

Partitions: In Apache Kafka, topics can be subdivided into a series of order queues called partitions. These partitions are continually appended to form a sequential commit log. In the Kafka system, each record/message is assigned a sequential ID called an offset that is used to identify the message or record in the given partition.

Persistence: Apache Kafka operates by maintaining a cluster of servers that durably persist records/messages as they are published. The Kafka cluster uses a configurable retention timeout to determine how long a given record is persisted regardless of consumption. While the record/message is within the retention timeout the record/message is available for consumption. Once the record/message exceeds this retention timeout the record/message is deleted and space is freed.

Topic/Partition Scaling: Because Apache Kafka operates as a cluster of servers, topics/partitions can be scaled by sharing the load to each server on a given topic/partition. This load sharing allows each server in the Kafka cluster the ability to handle the distribution and persistence of records/messages on a given topic/partition. While individual servers handle all distribution and persistence, all servers replicate data providing fault tolerance and high availability in the event a server fails. Partitions are segmented between servers having one server elected to be the partition leader and all other servers acting as followers. The server that is the partition leader handles all distribution and persistence (reads/writes) of data and the followers provide replication services for fault tolerance.

Producers: In Apache Kafka, the concept of a producer is no different than most messaging systems. A Producer of data (records/messages) defines what topic (stream of data) a given record/message should be published on. Since partitions are used to provide additional scalability, a producer also can define what partition a given record/message is published to. Producers do not have to define a given partition and by not defining a partition a round-robin style of load balancing across topic partitions can be achieved.

Consumers: Consumers in Kafka, like in most messaging systems, are the entities that process records/messages. Consumers can be configured to work independently on individual workloads or cooperatively with other consumers on a given workload (load balancing). Consumers manage how they process a workload based on their consumer group name. Using a consumer group name allows consumers to be distributed within a single process, across multiple processes, and even across multiple systems. Using consumer group names, consumers can either load balance (multiple consumers with the same consumer group name) record/message consumption across the consumer set, or process each record/message uniquely (multiple consumers with unique consumer group names) where every consumer subscribed to a topic/partition gets the message for processing.

What is Kafka used for?

To stay competitive, businesses today rely increasingly on real-time data analysis allowing them to gain faster insights and quicker response times. Real-time insights allow businesses or organizations to make predictions about what they should stock, promote, or pull from the shelves, based on the most up-to-date information possible.

Traditionally, data has been processed and transmitted across networks in “batches”. This is down to limitations in the pipeline – the speed at which CPUs can handle the calculations involved in reading and transferring information, or at which sensors can detect data. As this interview points out, these “bottlenecks” in our ability to process data have existed since humans first began to record and exchange information in written records.

Due to its distributed nature and the streamlined way it manages incoming data, Kafka is capable of operating very quickly – large clusters can be capable of monitoring and reacting to millions of changes to a dataset every second. This means it becomes possible to start working with – and reacting to – streaming data in real-time.

Kafka was originally designed to track the behavior of visitors to large, busy websites (such as LinkedIn). By analyzing the clickstream data (how the user navigates the site and what functionality they use) of every session, a greater understanding of user behavior is achievable. This makes it possible to predict which news articles, or products for sale, a visitor might be interested in.

Since then, Kafka has become widely used, and it is an integral part of the stack at Spotify, Netflix, Uber, Goldman Sachs, Paypal, and CloudFlare, which all use it to process streaming data and understand customer, or system, behavior. In fact, according to their website, one out of five Fortune 500 businesses uses Kafka to some extent.

One particular niche where Kafka has gained dominance is the travel industry, where its streaming capability makes it ideal for tracking booking details of millions of flights, package holidays, and hotel vacancies worldwide.

Benefits of Kafka

Following are a few benefits of Kafka −

Reliability − Kafka is distributed, partitioned, replicated and fault tolerance.

Scalability − Kafka messaging system scales easily without down time..

Durability − Kafka uses Distributed commit log which means messages persists on disk as fast as possible, hence it is durable..

Performance − Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored.

Kafka is very fast and guarantees zero downtime and zero data loss.

Was this blog helpful?
You must be Logged in to comment
Code Block

Techiio is on the journey to build an ocean of technical knowledge, scouring the emerging stars in process and proffering them to the corporate world.

Follow us on:

Subscribe to get latest updates

You can unsubscribe anytime from getting updates from us
Developed and maintained by Wikiance
Developed and maintained by Wikiance