Login Register

Big Data - jhon-megf-20241011-050246

← Back to Recording

Summary

Overview

This session is a hands-on technical course focused on data processing workflows using a big data tool (likely Talend or similar ETL platform). The instructor guides participants through building a data pipeline to analyze NBA shot data, including data ingestion from HDFS, schema definition, conditional logic for converting boolean shot outcomes into numeric counts, aggregation by player and position, and troubleshooting common configuration errors. The session concludes with setup instructions for Docker and Docker Compose to prepare a local environment for the next day’s session on integrating traditional databases with big data systems.

Topic (Timeline)

1. Initial Setup and Data Connection Troubleshooting [00:00:00 - 00:02:47]

Instructor begins by attempting to edit and reconfigure a job named “yoke” with unclear naming and connectivity issues.
Attempts to connect components (e.g., “subyacht”) fail due to misconfigured triggers or missing connections.
Issues with virtual machine access and scroll settings are noted; instructor confirms connectivity via “nba unido” as a key identifier.
Emphasis on ensuring correct access permissions to the virtual machine and proper component linking.

2. Job Configuration and HDFS Input Setup [00:02:47 - 00:05:48]

Job is renamed to “job integración datos nba” for clarity.
Instructor connects HDFS input component “t_hdfs_cis3” to “t_hdfs_input_1” as a subyacht trigger.
Error occurs due to missing leading slash in HDFS path; corrected to “/nba_unido”.
Execution is attempted, and successful data flow is confirmed after path fix.
New job “job calcular estadísticas nba” is created, reusing connection settings from prior job but removing S3 components.

3. Data Flow Design: HDFS Input, Map, and Schema Reuse [00:05:48 - 00:07:54]

New HDFS input component added to the job.
“tRow” component introduced to structure data flow.
Input is connected to “tMap” and then to “tRow”.
Schema is reused from previously loaded metadata file “nba_metadata” to ensure field consistency.
Instructor confirms schema includes: player name, position, shot made (as string).

4. Field Selection and Data Preview [00:07:54 - 00:10:43]

Only three fields selected for output: “player name”, “position”, and “shot made”.
Instructor realizes the input file was compressed and incorrectly named; corrected to “nba_unido”.
File format is set to “delimited” with metadata from “nba” file.
Schema preview confirms correct field extraction: player name, position, shot made.

5. Boolean-to-Numeric Conversion and Aggregation Logic [00:10:43 - 00:17:27]

“shot made” field is a string (“true”/“false”), not boolean; must be converted to numeric for counting.
Two new variables created using ternary operator in tMap:
- anotaciones_acertadas: 1 if shot made == "true", else 0
- anotaciones_erradas: 1 if shot made == "false", else 0
Data types changed from string to integer for both variables.
Output fields passed to “tAggregate” for grouping.

6. Aggregation, Grouping, and Sorting [00:17:27 - 00:20:38]

tAggregate groups by “player name” and “position”.
Two sum functions applied:
- Sum of anotaciones_acertadas → total shots made
- Sum of anotaciones_erradas → total shots missed
Output passed to “tSortRow” to sort by player name and position in ascending order.
Execution initiated to generate final statistics.

7. Error Diagnosis: Data Type and Separator Mismatches [00:20:38 - 00:25:12]

Execution fails due to “shot made” field still being string in output; corrected in tMap.
Separator in HDFS input was incorrectly set to semicolon (;) instead of comma (,); fixed.
Re-execution confirms successful data flow and correct aggregation output.
Output shows per-player shot statistics: e.g., PG with 3 made, 9 missed; SF with 265 made, 312 missed.

8. HDFS File Integrity and Connection Validation [00:25:12 - 00:27:11]

Instructor investigates file size anomaly (9.8 MB vs expected ~10 B); suspects corruption or misconfiguration.
Realizes file was not properly transferred; confirms need to re-upload or validate source.
Decides to defer full resolution to next session due to time constraints.

9. Docker Environment Setup for Next Session [00:27:11 - 00:33:48]

Instructor instructs participants to close all applications and free memory.
Docker image “docker-highmaster” is extracted from Downloads folder.
Terminal opened in extracted directory; docker-compose up -d command executed to start containers.
Multiple participants experience command syntax issues (e.g., “docker” vs “docker-compose”).
Command successfully runs; containers begin downloading images.
Memory usage monitored; system remains stable at ~68% usage.

10. Session Wrap-up and Next Steps [00:33:48 - 00:46:30]

Instructor confirms job and VM are saved and left running for next session.
Next session will focus on integrating traditional databases with big data systems (HDFS/Spark).
Instructor offers to share virtual machine snapshot to synchronize progress.
Session ends with informal conversation about unrelated purchases (gloves, glasses, jeans), likely due to ASR errors or off-topic audio.

Appendix

Key Principles

Schema Reuse: Always reuse metadata schemas from previously validated files to avoid field mismatches.
String vs Boolean: Treat boolean-like strings (“true”/“false”) as strings in ETL tools; convert explicitly using ternary operators.
Path Syntax: HDFS paths must include leading slash (/) and use correct separators (comma, not semicolon).
Component Triggering: Ensure “subyacht” or similar triggers are properly connected to initiate downstream components.

Tools Used

ETL Tool: Likely Talend Open Studio (based on component names: tHDFSInput, tMap, tAggregate, tSortRow)
Data Source: NBA shot data in CSV format stored in HDFS
Environment Setup: Docker, Docker Compose for containerized big data services

Common Pitfalls

Missing leading slash in HDFS paths → file not found
Incorrect delimiter in delimited file → parsing errors
Unconverted string booleans → aggregation fails
Unsaved job configurations → rework required
Leftover processes consuming memory → system instability

Practice Suggestions

Rebuild the entire pipeline from scratch using a clean HDFS file.
Test each component individually before chaining.
Use tLogRow to preview data at each stage.
Always validate file encoding and delimiters before ingestion.
Practice Docker Compose setup with sample YAML files to prepare for next session.