Summary

Overview

This course session provides a hands-on demonstration of monitoring a Kafka-based real-time data streaming system using Prometheus and Grafana. The instructor walks through the architecture of a streaming pipeline involving Kafka topics, producers (triggered via serverless Lambda functions), and ksqlDB streams, then transitions into setting up local monitoring by installing and configuring Prometheus to scrape metrics from a Confluent Kafka cluster and visualizing them in Grafana. The session emphasizes practical setup steps, configuration of scrape jobs, and interpreting real-time stream metrics such as message throughput, partition usage, and storage retention.

Topic (Timeline)

1. Kafka Streaming Architecture and Metrics Overview [00:00:00 - 00:05:04]

The session begins by introducing a real-time data pipeline where user interactions (e.g., game plays) trigger asynchronous Lambda functions that produce messages to a Kafka topic named “user name.” Each producer sends a single message per event, with data including user ID, score, and level. The instructor details the topic structure: number of partitions, bytes in/out, messages in/out, and retention settings (1 hour duration, infinite size). Two ksqlDB queries are introduced: “stats per user” (top scores per user) and “summary stats” (aggregate metrics of all players). These queries output to separate Kafka topics, which act as streams. Internal Kafka topics for offset management are also noted. The instructor highlights key monitoring metrics: producer-consumer throughput, storage usage, and the need for external visualization.

2. Introduction to Prometheus and Grafana for Monitoring [00:05:04 - 00:09:55]

The instructor explains the purpose of Prometheus (an open-source metrics scraping tool) and Grafana (a visualization dashboard) for monitoring Kafka clusters. The goal is to collect metrics from the Kafka cluster (including Kafka Connect components) and display them in a custom Grafana dashboard. The session clarifies that the setup is local (on a VM), not cloud-based, and that the metrics will be scraped directly from the cluster endpoints. The instructor notes the availability of sample dashboards and encourages learners to replicate the setup.

3. Installing and Configuring Prometheus on Local VM [00:09:55 - 00:22:25]

The instructor guides learners through the step-by-step installation of Prometheus on a local VM:

  • Navigating to a designated folder (e.g., “student folder”) and opening a terminal.
  • Downloading the Prometheus binary and extracting it.
  • Moving the Prometheus executable to /usr/local/bin for system-wide access.
  • Editing the prometheus.yml configuration file to define scrape jobs.
  • Adding a new job configuration to scrape metrics from the Confluent Kafka cluster, using provided credentials (username/password).
  • Ensuring correct YAML formatting to avoid configuration errors.
  • Copying the updated prometheus.yml file to the /etc/prometheus/ directory to apply the configuration. The session ends mid-configuration, with the instructor instructing learners to move the file to the correct system path (/etc/prometheus/) to complete the setup.

Appendix

Tools Used

  • Kafka: For real-time message streaming with topics and partitions.
  • ksqlDB: For stream processing (stats per user, summary stats).
  • Prometheus: For scraping and storing time-series metrics from Kafka endpoints.
  • Grafana: For visualizing scraped metrics via dashboards.
  • Lambda (serverless): For triggering message production on user events.

Key Configuration Steps

  • Set Kafka topic retention: 1 hour duration, infinite size.
  • Configure Prometheus scrape_configs in prometheus.yml to include Kafka cluster endpoints.
  • Use basic auth credentials (username/password) for secure scraping.
  • Place prometheus.yml in /etc/prometheus/ for system-level execution.

Common Pitfalls

  • YAML formatting errors in prometheus.yml (e.g., incorrect indentation) causing service failure.
  • Missing or incorrect endpoint URLs or credentials in the scrape job.
  • Not moving the Prometheus binary to /usr/local/bin, leading to path issues.

Practice Suggestions

  • Replicate the Prometheus installation and configuration on a local VM.
  • Add additional Kafka metrics (e.g., broker lag, consumer group offsets) to the scrape job.
  • Connect Prometheus to Grafana and build a custom dashboard visualizing topic throughput, producer count, and storage usage.