Login Register

Getting started with Apache Superset - romain-w5vd-20241121-184941

← Back to Recording

Summary

Overview

This session is a collaborative troubleshooting session during a data analytics or business intelligence course, where participants attempt to calculate a fraud rate metric in a data visualization tool (likely Power BI or similar). The group encounters persistent calculation errors—resulting in absurd values like 1997%—and explores potential causes including SQL syntax, aggregation functions, caching, and tool-specific behavior. The session ends with the trainer acknowledging unresolved issues and suspending the activity until after a break, with a commitment to revisit with clearer guidance.

Topic (Timeline)

1. Fraud Rate Calculation Issues and Initial Debugging [00:00:05 - 00:02:17]

Participants identify an incorrect fraud rate output. One user notes they removed a SELECT statement from a calculation (likely in a DAX or SQL-like expression) and saw improved results, suggesting improper query structure was the root cause. Another user confirms experiencing the same error. The group begins questioning whether the metric should be edited directly within the Dataset view rather than in the visualization layer. Uncertainty arises about whether removing transaction counts affects the fraud rate calculation.

2. Erroneous Results and Suspected Formula Errors [00:02:27 - 00:04:25]

A participant observes a fraud rate of 1997%, which is clearly invalid. The group hypothesizes that the issue stems from double-applying percentage conversion—possibly multiplying by 100 when the tool already auto-converts to percentage. They question whether the formula SUM(is fraud) / COUNT(*) is being misinterpreted by the system. Despite modifying queries, results remain unchanged, indicating a deeper issue—possibly related to data type handling or aggregation context.

3. Caching, Data Refresh, and Tool Behavior [00:04:46 - 00:06:14]

Participants notice that changes to the formula do not reflect in the output, leading to confusion. One user discovers that manually refreshing the cache (via a top-right refresh button) resolves the issue, resulting in a 0% fraud rate. This suggests the tool was displaying stale or cached results. The group observes Romain’s screen and confirms the refresh action led to a visible change, validating cache as a contributing factor.

4. Aggregation Logic and Field Confusion [00:06:20 - 00:08:41]

The group debates whether using SUM(is fraud) versus COUNT(case when fraud) is more appropriate. They question their understanding of underlying fields (e.g., “caisse” or “zen”) and consider using COUNT(*) as a simpler alternative. Despite multiple attempts, the correct result remains elusive. The trainer acknowledges the session has become unproductive, apologizes for the confusion, and proposes pausing the session at 12:00 to regroup after lunch.

5. Session Closure and Follow-Up Commitment [00:08:45 - 00:08:57]

The trainer ends the session, promising to resolve the issues before the next meeting. They commit to reducing SQL complexity and improving clarity in future sessions, signaling awareness of the group’s frustration and the need for better instructional design.

6. Post-Break Ambient Audio and Technical Glitch [00:08:59 - 00:18:30]

The remainder of the transcript contains repeated audio cues (“Bon appétit”) likely from automated system messages or background noise. A brief, disconnected comment about “volets” (shutters) being broken appears at the end, unrelated to the main content—possibly a technical glitch or off-topic remark from a participant. No substantive instructional content follows the break.

Appendix

Key Principles

Fraud rate = (Count of fraudulent transactions / Total transactions) — must not multiply by 100 if the tool auto-converts to percentage.
Aggregation functions (SUM, COUNT) must be applied correctly in context; avoid mixing logical and numeric fields without explicit casting.
Data visualization tools often cache results; manual refresh is required after formula changes.

Tools Used

Likely Power BI, Tableau, or similar BI platform with DAX/SQL-like expression support.
Dataset editing interface for metric definition.

Common Pitfalls

Double-conversion to percentage (e.g., multiplying by 100 when tool already displays %).
Not refreshing the data model after formula changes.
Using SELECT in calculated column or measure definitions where it is not valid syntax.
Confusion between COUNT(*) and COUNTIF/CASE WHEN logic.

Next Steps (Inferred)

Review the exact DAX/SQL formula for fraud rate.
Verify data types of “is fraud” field (boolean vs. integer).
Test with a simplified dataset to isolate the issue.
Disable caching temporarily during development.
Provide clear documentation on metric creation workflow.