Login Register

Data Mining with Excel - jhon-p09s-20251218-215001

← Back to Recording

Summary

Overview

This course session provides a hands-on, step-by-step exploration of data analysis techniques using Power BI, Python, and Power Query, focusing on three core methodologies: time series forecasting (moving average and regression), association rule mining (Apriori), and correlation analysis. The instructor guides learners through real-world data preparation, algorithm configuration, troubleshooting language/region configuration issues, and interpretation of results in business contexts. Emphasis is placed on understanding the difference between data, information, and knowledge, and how analytical outputs must be interpreted through domain expertise.

Topic (Timeline)

1. Introduction to Data Mining, Machine Learning, and AI Concepts [00:10:55 - 00:15:12]

Defined classical data mining as extracting knowledge from historical business data, rooted in statistics.
Differentiated data mining from machine learning: ML enables autonomous decision-making, while data mining relies on predefined algorithms.
Clarified that AI can perform similar tasks but lacks transparency and control over algorithm selection, potentially leading to uninterpretable outcomes.
Highlighted two universal pillars of any analytics project: (1) data cleaning and quality validation, and (2) algorithm execution to generate insights.
Introduced Power Query and Excel for basic statistical analysis, and noted the shift to Python for ML tasks in modern Power BI versions.
Acknowledged a failed prediction example from the prior session due to configuration issues, setting up the next segment for troubleshooting.

2. Troubleshooting Python-Power BI Integration: Regional and Encoding Issues [00:15:12 - 00:32:01]

Shared screen to debug a failed time series prediction in Power BI using Python.
Identified root cause: regional configuration conflict between English OS, Spanish Power BI, and Python’s decimal/magnitude parsing (e.g., “98.69” misread as “9869”).
Demonstrated debugging by executing the same Python script in Visual Studio Code to isolate the issue.
Fixed encoding by adding encoding='latin1' and delimiter=',' in the Python script.
Corrected column name mismatches (e.g., “ejemplo 02” with space → renamed to “ejemplo_02”).
Explained the difference between moving average (weights recent periods more heavily) and regression (fits trend across full window).
Interpreted results: moving average gave 8.9 (indicating recent upward trend), regression gave 8.1 (indicating historical downward trend).
Emphasized that outputs are information; business knowledge is required to explain why trends occur (e.g., budget cuts, seasonality).

3. Association Rule Mining: Detecting Frequent Co-occurrences [00:32:02 - 00:44:56]

Introduced association rule mining using the Apriori algorithm to find frequent itemsets (e.g., “service + insumo” purchased together).
Used synthetic dataset of purchase orders with categories (e.g., service, license, security).
Defined key metrics:
- Support: frequency of an itemset in transactions (e.g., 20% = appears in 20/100 orders).
- Confidence: conditional probability (e.g., 60% confidence = if service is bought, 60% of the time insumo is also bought).
- Lift: ratio of observed vs. expected co-occurrence (indicates strength of association).
Interpreted results: “license + security” → “service” with 100% confidence (synthetic data), implying budgeting for service costs when licensing/security is planned.
Stressed that business context determines actionability: analysts must interpret rules based on internal operations.

4. Transition to Correlation Analysis: From Presence-Absence to Monetary Values [00:44:56 - 01:08:46]

Shifted from binary (1/0) presence/absence matrices to monetary value matrices for correlation analysis.
Reconfigured dataset: grouped by period and category, then summed spending per category per period.
Used Power Query’s “Column from Examples” to clean category names (e.g., “585 - Mantenimiento” → “Mantenimiento”).
Applied “Pivot Column” to transform categories into columns, with periods as rows and summed spending as values.
Created a custom column in M language to calculate total spending per row using List.Sum(List.RemoveItems(Record.ToList(_), {"periodo"})) — dynamically summing all category columns while excluding the period column.
Cleaned final matrix by removing redundant columns (period, total) to leave only category spending values.

5. Computing and Interpreting Correlation Matrices [01:08:46 - 01:17:39]

Exported cleaned matrix as CSV and imported into Power BI for Python-based correlation analysis.
Executed Python script to compute Pearson correlation coefficients between spending categories.
Interpreted output: values range from -1 (perfect negative correlation) to +1 (perfect positive correlation).
Analyzed examples:
- Mantenimiento & Seguridad: positive correlation (e.g., 0.78) — spending increases together.
- Mantenimiento & Viajes: negative correlation (e.g., -0.2) — when maintenance spending rises, travel spending falls.
Emphasized that correlation does not imply causation; business context is needed to understand drivers (e.g., cost-cutting in travel during maintenance investments).
Noted that 12 periods of data (not 10) were used, indicating a full year of observations.
Concluded with a personal note on regional climate, ending the session.

Appendix

Key Principles

Data vs. Information vs. Knowledge: Data is raw numbers; information is derived patterns (e.g., trends); knowledge is contextual interpretation (e.g., why a trend exists).
Algorithm Selection: Use Apriori for small datasets; consider FPGrowth for large ones. Use moving average for short-term stable trends; regression for long-term trend analysis.
Regional Configuration Matters: Decimal separators, encodings, and language settings must align across OS, tools, and scripts to prevent parsing errors.

Tools Used

Power BI: Data loading, transformation, Python execution, visualization.
Power Query (M language): Data cleaning, pivoting, dynamic column operations.
Python (Visual Studio Code): Scripting for association rules and correlation analysis.
CSV: Standard data exchange format; requires careful encoding and delimiter handling.

Common Pitfalls

Encoding/Region Conflicts: Spanish Power BI + English OS + Python → decimal parsing errors.
Static Column References: Hardcoding column names in M scripts breaks when schema changes.
Misinterpreting Correlation: Assuming causation from correlation without business context.
Data Format Assumptions: Assuming binary (1/0) data is sufficient for correlation — monetary values are required.

Practice Suggestions

Recreate the association rule analysis using your own purchase/order data.
Build a correlation matrix from your department’s monthly spending data.
Practice M language functions: List.Sum, Record.ToList, Record.RemoveFields.
Always validate Python script output in a standalone environment (e.g., VS Code) before embedding in Power BI.