Course recordings on DaDesktop for Training platform
Visit NobleProg websites for related course
Visit outline: Getting Started with Apache Superset (Course code: superset)
Categories: Apache Superset
Summary
Overview
This course provides a comprehensive, hands-on introduction to Apache Superset, a data visualization and business intelligence platform. Participants learn to connect to SQL databases (specifically PostgreSQL), create physical and virtual datasets, define custom metrics, build interactive charts, and assemble dashboards for analytical reporting. The session emphasizes practical workflows for data exploration, fraud analysis, and dashboard creation, with a focus on understanding the distinction between physical datasets (mapping to database tables) and virtual datasets (derived from SQL queries). Key concepts include templating with Jinja, data filtering, annotation layers, and performance considerations. The training concludes with a guided practical exercise on analyzing transaction fraud patterns using real-world data.
Topic (Timeline)
1. Environment Setup & Virtual Desktop Navigation [00:00:00 - 00:04:25]
Participants are guided through connecting to a remote virtual machine (AVM) via a web interface. The trainer explains how to access a dual-screen environment (local browser + AVM), launch Firefox within the AVM, and execute initial commands. Common issues such as incorrect browser usage (local vs. AVM) and connection errors are addressed. Participants confirm successful connection and observe the virtual desktop layout, including terminal access and the “play” button to activate the AVM session.
2. Initial Command Execution & Docker Build Process [00:04:25 - 00:06:39]
The trainer initiates the launch of Superset via terminal commands (cd super7 and related scripts). Participants experience delays due to Docker building the environment for the first time (3–4 minutes). The trainer explains that subsequent startups will be faster. A shared screen is used to demonstrate the interface, and participants are encouraged to follow along. The session includes troubleshooting for command failures and re-execution of steps.
3. Superset Interface Overview & Core Navigation [00:06:39 - 00:13:38]
Participants are introduced to the Superset web interface. Key components are explained: the persistent top navigation bar, dashboard view, datasets, SQL Lab, settings, and user management. The trainer demonstrates how to navigate between sections, emphasizing that the top bar remains fixed across all views. The default sample dashboards and datasets are shown, and the absence of folder hierarchies is noted, with tagging and search as alternatives for organization.
4. Database Connection & Dataset Creation [00:13:38 - 00:26:23]
The process of connecting to a PostgreSQL database is demonstrated. Participants learn to add a new database connection via “Data > Connect Database,” specifying connection parameters (host, port, username, password). The distinction between “Database” (connection) and “Dataset” (exposed table) is clarified. A participant encounters an error due to an un-saved connection, which is resolved by editing and re-saving the connection. The trainer then guides participants to create a dataset from the “Transaction Full” table, selecting the correct schema (public) and table.
5. Dataset Configuration & Column Properties [00:26:23 - 00:31:08]
Participants learn to edit dataset properties: renaming columns, defining labels, setting data types (temporal, categorical), enabling filters and dimensions, and marking columns for deletion. The concept of “metrics” is introduced as numerical aggregations (e.g., count, sum) that can be reused across visualizations. The trainer shows how to create custom metrics (e.g., count_star renamed to “Number of Transactions”) and how to use “calculated columns” for transformations. Certification tags (e.g., “Certified by Finance”) are introduced for data governance.
6. Chart Creation & Visualization Fundamentals [00:31:08 - 00:34:49]
Using the created dataset, participants build their first chart: a bar chart showing transaction count by country. The trainer demonstrates selecting metrics, dimensions, and chart types. Interactive features are shown: clicking legend items to hide categories, applying filters (e.g., selecting France, Germany, UK), and adjusting axis labels and legends. The chart is saved and added to a new dashboard.
7. Dashboard Assembly & Layout Management [00:34:49 - 00:43:11]
Participants learn to create a dashboard via “+ Dashboard,” then add the saved chart using drag-and-drop. The grid-based layout system is explained, allowing resizing and repositioning of visualizations. The trainer shows how to edit dashboard elements, adjust chart titles, and modify metrics directly on the dashboard. The concept of shared components is introduced: modifying a chart affects all dashboards using it, with a warning alert.
8. SQL Lab for Data Exploration & Virtual Datasets [00:43:11 - 01:02:31]
The SQL Lab interface is introduced as a tool for direct SQL querying without requiring a dataset. Participants write SQL queries (e.g., SELECT MAX(amount) FROM table GROUP BY country). The trainer demonstrates saving queries as “virtual datasets,” which appear in blue (vs. black for physical datasets). Virtual datasets allow combining multiple tables via joins and are managed within Superset, not the underlying database. The trainer explains that virtual datasets support Jinja templating (to be covered later) and can be reused in charts.
9. Dataset Types: Physical vs. Virtual & Best Practices [01:02:31 - 01:14:50]
The distinction between physical datasets (1:1 with database tables) and virtual datasets (SQL-derived, reusable) is reinforced. Best practices are outlined: avoid heavy aggregations or complex joins in virtual datasets; use them for simple joins or filtering. The trainer explains that grouping and ordering should be handled by the chart layer, not the virtual dataset, to optimize performance. Examples include creating calculated fields (e.g., age brackets), adding clickable URLs via HTML formatting, and using Jinja templating for dynamic filtering.
10. Jinja Templating & Dynamic Dashboards [01:14:50 - 01:17:02]
The trainer introduces Jinja templating as a way to inject dynamic parameters into SQL queries (e.g., WHERE country = '{{ country }}'). This allows personalized dashboards: users accessing via a URL with ?country=DE see only German data. The trainer notes this is a powerful feature for multi-tenant or role-based views but is not essential for basic use. The session notes that virtual datasets are the only way to use Jinja in Superset.
11. Data Type Configuration & Metric Formatting [01:17:02 - 01:32:10]
Participants configure column properties in the dataset: marking “Transaction Hour” as temporal, verifying data types (string, boolean, double). The trainer clarifies that data types cannot be changed, but usage (filter, dimension, temporal) can. Formatting of metrics is demonstrated: using “Format” options like “.2f” for two decimal places or “.2%” for percentage display. The difference between column-level formatting and metric-level formatting is explained.
12. Fraud Analysis Metrics & Practical Exercise Setup [01:32:10 - 01:37:55]
The trainer introduces the fraud analysis use case: a dataset with a boolean is_fraud flag. Participants are guided to create two custom metrics:
- Fraud Rate:
SUM(is_fraud) / COUNT(*) - Average Fraud Amount:
AVG(amount) WHERE is_fraud = 1
These metrics are defined in the dataset’s “Metrics” section using SQL expressions. The trainer emphasizes that these metrics are context-aware — they recalculate based on filters applied in charts. The practical exercise (TP1) is introduced: build a dashboard analyzing transaction fraud patterns using time, geography, and these metrics. Participants are given time to begin the exercise.
Appendix
Key Principles
- Physical Dataset: Direct mapping to a single database table. Only one per table.
- Virtual Dataset: SQL query result, can combine multiple tables. Multiple per source.
- Metrics: Numerical aggregations (count, sum, avg) defined at the dataset level, reusable across charts.
- Jinja Templating: Enables dynamic SQL via URL parameters (e.g.,
?country=FR), only usable in virtual datasets. - Performance: Avoid heavy SQL logic (GROUP BY, JOINs) in virtual datasets; defer to chart layer for aggregation.
Tools Used
- Apache Superset: Data visualization and dashboarding platform.
- PostgreSQL: SQL database used for transaction data.
- Docker: Containerization tool for environment setup.
- Jinja2: Templating engine for dynamic SQL queries.
- Google Sheets: Optional data source (mentioned for future use).
Common Pitfalls
- Confusing local browser with AVM browser — must use Firefox inside AVM.
- Forgetting to save database connection before creating dataset.
- Attempting to rename physical datasets — only the underlying table name can be changed.
- Using complex aggregations in virtual datasets, causing slow performance.
- Misunderstanding metric vs. label: metrics are calculated values; labels are display names.
- Not marking “Transaction Hour” as temporal — prevents time-based visualizations.
Practice Suggestions
- Recreate the fraud analysis dashboard independently.
- Create a virtual dataset joining two tables (e.g., transactions + users) and visualize it.
- Use Jinja templating to build a dashboard that filters by user role via URL.
- Add annotation layers using a Google Sheet to mark events (e.g., “System Update: 2024-03-01”).
- Experiment with different chart types (line, pie, heatmap) using the same dataset.