Login Register

Getting started with Apache Superset - romain-w5vd-20241122-212409

← Back to Recording

Summary

Overview

This course session is an advanced, hands-on training on data visualization and dashboard creation using Apache Superset, focused on geospatial analysis, data transformation via SQL, and fine-grained control over filter interactions. Participants worked through real-world challenges involving department-level aggregation from address data, handling overseas territories (DOM-TOM), creating virtual datasets, managing cross-filtering behavior, and customizing map visualizations. The session emphasized practical SQL techniques (e.g., regex extraction, CASE statements), data modeling, and dashboard design principles, with troubleshooting of permission issues and performance optimization.

Topic (Timeline)

1. Filter Interactions and Cross-Filtering in Superset [00:00:02.980 - 00:08:54.680]

Discussed limitations of cross-filtering when visualizations rely on different datasets (e.g., transaction vs. virtual datasets).
Explained that filters apply at the dataset level; if a filter is applied to one dataset, it may not propagate to others unless they share a common dimension or are joined.
Demonstrated the use of “scoping” to control which visualizations interact via cross-filters (e.g., allowing only two charts to filter each other).
Highlighted that cross-filtering fails when the filtering field is not a dimension (e.g., axis values in bar charts) but works with categorical dimensions (e.g., country names in pie charts).
Noted that field name case sensitivity (e.g., “country” vs. “Country”) can break filter propagation.
Emphasized that virtual datasets inherit filter behavior from their source tables via joins.

2. Geospatial Data Processing and Department Extraction [00:08:58.900 - 00:29:36.640]

Addressed the challenge of extracting French departments from address fields containing postal codes (5-digit format).
Demonstrated SQL regex pattern matching using substring and regexp to extract 5-digit postal codes from unstructured address strings.
Showed how to use left() and concat() to prepend “FR-” to the first two digits of the postal code to form standardized department codes (e.g., “FR-75”).
Created a virtual dataset using a SQL query to generate a new “department” column from address data, enabling aggregation by region.
Discussed the importance of using count(*) instead of count(distinct *) when no duplicates exist, and clarified the difference between dimensions (grouping fields) and metrics (aggregated values).

3. Handling Overseas Territories (DOM-TOM) and Data Mapping [00:29:39.520 - 00:57:38.000]

Identified that standard 2-digit department extraction fails for overseas territories (e.g., Martinique: 972, Réunion: 974), which require 3-digit codes.
Proposed a solution using a CASE statement to detect 3-digit INSEE codes starting with “97” and extract them as full department codes (e.g., “FR-972”).
Noted that Superset’s built-in GeoJSON map layer does not natively support DOM-TOM codes, resulting in blank regions despite valid data.
Confirmed required ISO codes for DOM-TOM: FR-MQ (Martinique), FR-RE (Réunion), FR-GP (Guadeloupe), FR-GF (Guyane), FR-YT (Mayotte), FR-PM (Saint-Pierre), FR-NC (Nouvelle-Calédonie), FR-TF (Terres australes), FR-BL (Saint-Barthélemy), FR-MF (Saint-Martin).
Advised creating a lookup table or union query to combine metropolitan and overseas department codes for full national coverage.

4. Dashboard Design, Performance, and Best Practices [00:57:58.280 - 01:16:27.640]

Demonstrated creating a new metric (e.g., “document rate”) by combining counts of expected vs. actual documents.
Showed how to add a time dimension (e.g., “date de confirmation”) to enable time-based filtering in dashboards.
Warned against overly granular aggregations (e.g., daily by commune) that cause performance issues due to high cardinality (538 communes).
Recommended aggregating to weekly/monthly levels to improve rendering speed and interpretability.
Advised using Superset’s SQL editor to monitor query load and avoid resource-heavy operations.
Highlighted the inability to customize legend placement on maps as a UI limitation, suggesting manual placement via dashboard layout.

5. Data Access, Permissions, and Virtual Dataset Management [01:16:34.520 - 01:20:30.880]

Troubleshot SQL permission errors: participants lacked access to specific schemas/tables (e.g., geofrontier.rapport_parcelles), despite having user roles.
Clarified that virtual datasets are live queries, not static extracts — changes in source data update visualizations in real time.
Advised participants to explore existing dashboards to understand field names and data structure before writing queries.
Encouraged requesting new tables or views from database administrators if needed, rather than overloading existing ones.
Noted that non-technical users should rely on pre-built virtual datasets rather than writing complex SQL.

6. Course Wrap-up, Feedback, and Next Steps [01:20:35.320 - 01:24:04.960]

Confirmed participants have access to personal Superset instances, enabling continued practice.
Encouraged iterative learning: “Don’t get frustrated — progress takes time.”
Reiterated that hands-on experience with a live instance is superior to theoretical training.
Addressed administrative follow-up: attendees will receive attendance certificates for reporting.
Concluded with appreciation and encouragement to continue exploring Superset’s capabilities for advanced mapping and analytics.

Appendix

Key Principles

Filter Scoping: Use scoping to control which visualizations interact via cross-filters; avoid unintended propagation.
Dimension vs. Metric: Dimensions define grouping (e.g., department); metrics define aggregation (e.g., count of documents).
Virtual Datasets: Are live SQL queries — not snapshots. Changes in source data reflect immediately in visualizations.
Performance: Avoid high-cardinality aggregations (e.g., daily + commune). Aggregate to weekly/monthly levels for better performance and clarity.

Tools Used

Apache Superset (v2+)
PostgreSQL (via SQL queries)
Regex pattern matching (substring, regexp, left, concat)
GeoJSON map layer (with known limitations for DOM-TOM)
SQL-based virtual datasets

Common Pitfalls

Case sensitivity in field names breaking cross-filters.
Using count(distinct *) on non-unique fields when count(*) suffices.
Extracting 2-digit department codes from 3-digit DOM-TOM codes, causing blank map regions.
Querying high-cardinality fields (e.g., 500+ communes) without aggregation, causing server strain.
Assuming map legends are customizable — they are not in standard Superset GeoJSON layers.

Practice Suggestions

Create a virtual dataset that combines metropolitan and overseas department codes using a CASE statement and UNION.
Build a dashboard with two linked charts: one showing document count by department, another showing document rate (actual/expected).
Test filter scoping: link only two charts to interact, leaving others static.
Use Superset’s SQL editor to inspect and optimize queries before adding to dashboards.
Export and review sample data from existing dashboards to understand field structure before writing new queries.