Course recordings on DaDesktop for Training platform
Visit NobleProg websites for related course
Visit outline: Kafka for Administrators (Course code: kafkaadmin)
Categories: Apache Kafka
Summary
Overview
This course session provides a practical guide to monitoring Apache Kafka clusters using open-source tools—primarily Prometheus and Grafana—with a focus on metric collection, dashboard creation, alerting, and the limitations of managed Kafka services like Confluent. The instructor demonstrates how to access and interpret Kafka metrics, explains the role of JMX and the Prometheus Java agent for custom metric extraction, and discusses replication strategies using MirrorMaker. The session also addresses audit logging constraints in managed cloud environments and the need for self-hosted Kafka to gain full observability.
Topic (Timeline)
1. Kafka Metric Exploration and Dashboard Access [00:00:01 - 00:02:30]
- Introduced the Metrics Browser interface for exploring Kafka metrics, using “partition count” as a sample metric.
- Noted that metrics only appear after operational activity (e.g., producing/consuming messages), with a 1–2 minute delay.
- Emphasized the need to perform actions on topics to trigger data visibility in the monitoring interface.
- Confirmed participants could view live data and encouraged reporting issues during hands-on exploration.
2. Local vs. Cloud Dashboard Hosting [00:03:04 - 00:04:12]
- Clarified that dashboards visible in the instructor’s local environment are not accessible remotely.
- Stressed that for shared or remote access, dashboards must be hosted in a cloud environment (e.g., Confluent Cloud or self-hosted cloud instance).
- Highlighted that local Grafana instances are isolated and cannot be accessed by others without cloud deployment.
3. Prometheus and Grafana Architecture Best Practices [00:04:15 - 00:05:44]
- Recommended deploying Prometheus and Grafana on separate servers for resilience: if Prometheus fails, Grafana retains historical data.
- Explained Grafana’s popularity as an open-source visualization tool due to its community-built dashboards and flexibility.
- Mentioned historical use cases (e.g., tracking order volume) and noted that out-of-the-box dashboards are often insufficient for deep analysis.
4. Limitations of Managed Kafka Services (Confluent) [00:05:48 - 00:07:07]
- Noted that Confluent’s default dashboards offer limited metrics (e.g., throughput, production rates) and lack granular or custom metrics.
- Clarified that audit logs are not exposed in managed Confluent environments unless self-hosted.
- Distinguished between two deployment models: using Confluent Cloud (limited access) vs. self-hosting Kafka on-premises or in private cloud (full control).
5. JMX and Custom Metric Collection with Prometheus [00:07:23 - 00:09:02]
- Explained that Kafka, being Java-based, emits metrics via JMX (Java Management Extensions).
- Described the use of the JMX Prometheus Java Agent to scrape JMX metrics and expose them to Prometheus.
- Stressed that custom rules must be written to define which metrics to capture (e.g., topic creation, configuration changes), as there is no automatic out-of-the-box solution.
- Indicated this will be demonstrated in the next session.
6. Replication: In-Cluster vs. Cross-Cluster with MirrorMaker [00:09:13 - 00:10:50]
- Clarified that intra-cluster replication (between brokers) is handled automatically by Kafka’s leader-follower mechanism.
- Introduced MirrorMaker as the tool for cross-cluster replication (e.g., between data centers or cloud regions).
- Emphasized that MirrorMaker acts as a bridge to copy data between separate Kafka clusters, not within a single cluster.
7. Audit Logging: Cluster vs. User-Level Activity [00:11:34 - 00:14:40]
- Distinguished between cluster-level audit logs (e.g., topic modifications, broker events) and user-level audit logs (e.g., who performed an action).
- Noted that Confluent Cloud typically provides cluster-level logs but rarely exposes user-level activity logs unless on a premium plan.
- Advised that to obtain user audit logs, users must either request them from Confluent (if supported by plan) or self-host Kafka to enable full audit logging.
8. Alerting with Grafana and Metric-Based Rules [00:14:42 - 00:16:51]
- Demonstrated how to configure alerts in Grafana using custom queries against Prometheus data sources.
- Explained setting thresholds (e.g., “if partition count exceeds X”) and configuring alert actions.
- Highlighted Grafana’s built-in SMTP server for email alerts and support for mobile notifications via integrated alerting channels.
9. Performance Monitoring Metrics and Tuning [00:17:01 - 00:18:26]
- Listed key Kafka performance metrics: batch size, average batch size, throughput, partition count.
- Explained how these metrics are used during performance testing to tune producer/consumer configurations.
- Noted that while high-level metrics are visible in Confluent dashboards, detailed metrics require self-hosted Prometheus/Grafana setups.
10. User Management, Access, and Cost Considerations [00:18:29 - 00:20:09]
- Mentioned that user management features (e.g., read-only dashboard access) may be available but often incur additional costs.
- Acknowledged uncertainty around Confluent’s exact access controls but suggested requesting read-only dashboard access as a baseline.
- Concluded by confirming participants had access to the tools and invited final questions.
11. Session Wrap-up and Closing [00:20:17 - 00:24:58]
- Repeated acknowledgments and thank-yous from participants and instructor.
- No further technical content; session concluded with closing remarks.
Appendix
Key Principles
- Self-hosted Kafka is required for full observability: audit logs, custom metrics, and granular alerting are not available in managed services like Confluent Cloud unless on premium tiers.
- JMX + Prometheus Java Agent is the standard method to extract Kafka metrics for monitoring systems.
- Grafana is the preferred visualization layer due to its flexibility, community dashboards, and alerting capabilities.
Tools Used
- Apache Kafka – Distributed streaming platform.
- Prometheus – Time-series monitoring and alerting system.
- Grafana – Visualization and dashboarding platform.
- JMX Prometheus Java Agent – Bridge to expose Kafka JMX metrics to Prometheus.
- MirrorMaker – Tool for cross-cluster data replication.
Common Pitfalls
- Assuming managed Kafka services provide full audit or performance metrics.
- Running Prometheus and Grafana on the same server, risking data loss if the host fails.
- Not writing custom metric rules in Prometheus, leading to incomplete monitoring coverage.
- Confusing intra-cluster replication (Kafka-native) with cross-cluster replication (MirrorMaker).
Practice Suggestions
- Set up a local Kafka cluster with Prometheus and Grafana to practice metric collection.
- Write custom Prometheus rules to capture broker-level metrics (e.g., under-replicated partitions, request latency).
- Configure Grafana alerts for critical thresholds (e.g., high consumer lag, low disk space).
- Compare Confluent Cloud dashboards with self-hosted Grafana dashboards to identify missing metrics.
- Experiment with MirrorMaker to replicate topics between two local Kafka clusters.