What is Apache Kafka

If you are working on Distributed Enterprise applications or with Microservices Architectures, then Apache Kafka might not be anything new for you. Apache Kafka is becoming increasingly popular these days in message-driven architectures. In any distributed application, data is received from the sources which then need to be processed and utilised accordingly. While there are several other options available for serving this purpose (messaging queues), Apache Kafka takes the lead.

So, What is Apache Kafka

Apache Kafka is a framework written in Scala and Java to provide a unified distributed stream processing and data pipeline management solution.

Unit testing of Google Cloud Dataflow applications

Apache Kafka consists of servers and clients that work together to form distributed event messaging.  For the server-side, Kafka runs as clusters that can span across multiple data centres ensuring that Kafka implementation is highly scalable. The servers also utilise some servers amongst themselves as brokers. These brokers act as a storage layer for incoming and outgoing messages. Servers can also utilise the Kafka Connect API to export and import messages continuously as event streams.

Apache Kafka Clients are used to write the distributed architectures of applications. Clients send and receive the streams of data in parallel and Kafka implementation ensures a fault-tolerant network so that event processing is reliable.

Key concepts

When building any Kafka applications, there are some concepts to be aware of;

First, there is a producer. A producer is a client that writes the event to Kafka. An event is a piece of data consisting of key-value pairs and timestamps. A consumer is a Kafka application that reads this data. The producer and consumer are decoupled  meaning that the producer does not wait if the data has been read or not. Instead it keeps writing the data and Kafka ensures that events are not processed more than once.

Now events are organised in topics. A topic is a categorical folder in laymen’s terms for organising the events. Consumers can subscribe to these topics. Once an event has been read, it is not deleted. This property is configurable, so the technical designers need to decide when to delete the topic. This is different from many message-queues implementations where messages are immediately deleted after retrieval. Consumers and producers can read/write to one or more topics.

Finally, these topics are partitioned into buckets. This allows for the horizontal scaling of the Kafka application. These buckets are located on different Kafka brokers. To have fault-tolerant behaviour, the event topic is replicated across multiple data centres so if one node goes down other nodes remain alive to serve.

Kafka APIs

Kafka has a command-line interface for management and administration. Specifically, the below APIs are present.

  • Admin API for management and inspection of Kafka objects.
  • Producer API to write a stream of events to one or more Kafka topics.
  • Consumer API to read one or more topics and their processing.
  • Kafka Streams API to implement stream processing applications and microservices.
  • Kafka Connect API to build and run reusable data import/export connectors that
  • consume (read) or produce (write) streams of events from and to external systems and applications so they can integrate with Kafka.

Why you should be using Apache Kafka?

Whether you’re a  developer, tester or technical product owner, I am sure you have heard about Apache Kafka. However, the implementation of Apache Kafka requires expertise. If you are not sure whether you should be investing time with Apache Kafka or not, we’ve listed the most common use cases below. If your application has any use cases falling under these categories, then we recommend you go forward with Apache Kafka.

  • Event-driven use cases

For all the use cases, when a user emits one or more events that need to be processed in real-time. Kafka is perfect in those use cases.

  • Messaging Use Cases

The perfect example is the Microservices architecture. If there are multiple nodes distributed across the globe having to be connected, consider implementing Apache Kafka.

  • Log collection

Imagine applications having multiple servers. These servers are generating logs that need to be processed combined. Apache Kafka is the best suited for this as the consumers and producers can read, write, and process logs in real-time.

  • Stream processing

If there is nonstop frequent input of data. Every packet of data needs to be logged and processed accordingly and reliable. As mentioned above, Apache Kafka boasts high throughput and reliability using TCP protocol, making sure packets are not destroyed.

How companies are using Apache Kafka

See below how companies are leveraging the capabilities of Apache Kafka. See if your use cases are similar to yours or not.

  • Linkedin

Where Apache Kafka was born, Linkedin uses it for stream processing of data.

  • Uber

Uber uses Apache Kafka to process real-time data.

  • Netflix

Popular on-demand video streaming, Apache Kafka is used to handling streaming use cases.

  • Adidas

Using for real-time analytics reporting solutions.

  • CloudFare

CloudFare is probably the best Apache Kafka user. It uses it for log processing and event logs and aggregations. They have to combine these logs from thousands of users.

  • Expedia

If you have event-driven architecture, you are not alone. Expedia is leveraging Apache Kafka for this purpose.