15 videos 📅 2025-01-27 09:00:00 America/Bahia_Banderas
24:24
2025-01-27 13:13:59
2:06:12
2025-01-27 13:42:41
3:36:29
2025-01-28 09:08:14
4:33
2025-01-28 13:48:42
55:46
2025-01-28 14:06:51
2:02
2025-01-29 10:22:33
1:02:14
2025-01-29 10:25:14
2:10
2025-01-29 11:38:26
2:26
2025-01-29 12:03:00
1:23:37
2025-01-29 12:05:56
35:40
2025-01-29 15:01:26
1:40:43
2025-01-30 09:07:07
1:08:48
2025-01-30 11:20:20
1:10:50
2025-01-30 13:15:56
3:50:03
2025-01-31 07:20:07

Course recordings on DaDesktop for Training platform

Visit NobleProg websites for related course

Visit outline: Talend Big Data Integration (Course code: talendbigdata)

Categories: Big Data · Talend

Summary

Overview

This course session provides a comprehensive, hands-on tutorial on using Talend Big Data to interact with an Hadoop cluster via HDFS. It covers job design principles, component configuration (tHDFSConnection, tHDFSPut, tHDFSInput), hierarchical job execution using sub-jobs and components, error handling with tMsgBox, environment setup including virtual machine networking and Hadoop cluster initialization, and the implementation of context variables for flexible, reusable configurations. The session progresses from basic job structure to advanced practices like metadata-based cluster connections and environment-aware deployments.

Topic (Timeline)

1. Job Structure and Execution Hierarchy [00:00:00 - 00:08:22]

  • Introduction to job naming and execution flow in Talend Big Data.
  • Explanation of sub-job hierarchy: sequential execution order matters even under multi-threading.
  • Demonstration of using components (tHDFSPut/tHDFSInput) vs. sub-jobs: components execute sequentially only after successful completion of prior steps; sub-jobs are independent executable units.
  • Use of tMsgBox for error handling: triggers only on sub-job failure, enabling notifications or logging on errors.
  • Correction of connection logic: fixing incorrect regex connections and properly linking components via “On SubJob Error” triggers.
  • Creation of a new job (“job_on_the_record_integrator”) and demonstration of two methods to execute nested jobs: (1) using tRunJob component with job reference, and (2) dragging and dropping jobs from the repository with “in sequence” execution mode.
  • Warning about undefined schema in components and preliminary file structure checks.

2. Hadoop Cluster Setup and Network Configuration [00:08:29 - 00:27:32]

  • Setup of a virtualized Hadoop cluster (single NameNode and DataNode) on a Windows host using Hyper-V.
  • Verification of network connectivity between Windows host and VM: IP address confirmation via ipconfig (Windows) and ip addr (Linux VM), ping test between 10.0.3.15 (host) and 10.0.3.16 (VM).
  • Initialization of Hadoop services: formatting HDFS (hdfs namenode -format), starting HDFS (start-dfs.sh) and YARN (start-yarn.sh) daemons.
  • Verification of running services using jps command (NameNode, DataNode, ResourceManager, NodeManager, SecondaryNameNode, Jps).
  • Confirmation of required open ports: 9870 (HDFS Web UI) and 9000 (HDFS RPC).
  • Accessing HDFS Web UI via browser at http://10.0.3.16:9870 to confirm cluster status.
  • Creation of HDFS directory /axa using hdfs dfs -mkdir /axa and granting full permissions via chmod 777 /axa.
  • Refreshing HDFS browser to confirm directory creation.

3. File Upload to HDFS Using Talend Components [00:27:35 - 00:46:27]

  • Creation of a new Talend job: “job_on_the_record_subir_archivo_hdfs”.
  • Configuration of tHDFSConnection component: set distribution to “Universal”, version 3.x, and IP address to 10.0.3.16 (no authentication required due to open permissions).
  • Configuration of tHDFSPut component: use existing connection, set local directory to C:\temp\ (Windows path), and HDFS directory to /axa.
  • Upload of CSV files: production.csv, fill_actor.csv, super_market_on_the_record_sales.csv — ensuring exact filename matching (case-sensitive on Linux).
  • Use of tMsgBox for error messaging: configured with title “Error HDFS” and message “Error de conexión al cluster HDFS”.
  • Execution of job and verification via HDFS Web UI: files successfully uploaded and replicated (replication factor = 3).
  • Correction of “Action File” setting from “Create” to “Overwrite” to avoid errors on re-execution.

4. Context Variables and Flexible Job Design [00:46:31 - 00:58:11]

  • Introduction to Talend context variables for environment flexibility.
  • Creation of a project-level context group named “AXA” with environments: “QA” and “Production”.
  • Definition of context variable url_cluster_hdfs with values: 10.0.3.16:9000 (QA), 10.0.3.17:9000 (Production).
  • Import of project context into job “leer_archivo_hdfs” via context import button.
  • Configuration of tHDFSInput component to use context variable: context.url_cluster_hdfs (no quotes required in context field).
  • Benefit: Centralized IP/port management — changing context value updates all dependent jobs automatically.

5. Schema Definition and File Reading with Metadata [00:58:11 - 01:14:41]

  • Reading structured CSV files via tHDFSInput: requires schema definition.
  • Creation of metadata via delimited file connection: used sample file fill_actor.csv to extract schema.
  • Exported schema as XML (fill_actor.xml) and imported into tHDFSInput component.
  • Correction of file separator: changed from comma to semicolon (;) as per actual file format.
  • Enabled “Header” option (row 1) to skip column names during parsing.
  • Use of tLogRow component to display output in “Table” mode for data verification.
  • Troubleshooting execution errors: resolved by correcting context environment (Production vs QA), ensuring correct IP, and verifying schema alignment.
  • Final successful execution: data displayed correctly in tLogRow after fixing separator and header settings.

6. Hadoop Cluster Metadata Configuration [01:14:41 - 01:21:26]

  • Creation of Hadoop cluster metadata in Talend: “hadoop_on_the_record_axa”.
  • Configuration of metadata: distribution = Universal, version = 3.0.x, NameNode URI = 10.0.3.16:9000, ResourceManager URI = 10.0.3.16:8032, JobHistory URI = 10.0.3.16:19888.
  • Username set to hdfs (with uppercase H).
  • Validation via “Check Services” button: all services confirmed green.
  • Final note: metadata allows reuse across multiple jobs and simplifies cluster connection management.
  • Session concludes with break announcement at 13:00.

Appendix

Key Principles

  • Job Hierarchy: Use sub-jobs for modular, reusable logic; use components for sequential, dependency-driven flows.
  • Error Handling: Always connect tMsgBox via “On SubJob Error” to capture failures in sub-jobs.
  • Context Variables: Use project-level contexts for environment-specific configurations (dev/QA/prod) to avoid hardcoding IPs or paths.
  • Schema Definition: Always define schema via metadata export/import for CSV/JSON files — never rely on auto-detection.
  • Path Handling: Use forward slashes (/) in HDFS paths; escape backslashes (\\) in Windows local paths if needed.

Tools Used

  • Talend Big Data Studio
  • Hadoop HDFS (version 3.x)
  • Hyper-V (Windows VM host)
  • Linux terminal (Ubuntu-based Hadoop VM)
  • Web browser (for HDFS UI at port 9870)

Common Pitfalls

  • Incorrect IP/port in HDFS connection (e.g., using localhost or wrong port).
  • Forgetting to set “Header” in tHDFSInput when CSV has column names.
  • Using “Create” instead of “Overwrite” in tHDFSPut when re-running jobs.
  • Misconfiguring context environment (e.g., job runs in Production but context points to QA IP).
  • Case sensitivity in filenames on Linux-based HDFS.
  • Not importing project context into job — causing “undefined variable” errors.

Practice Suggestions

  • Recreate the entire workflow using different CSV files and verify data integrity.
  • Modify context variables to simulate a cloud-based Hadoop cluster (e.g., AWS EMR endpoint).
  • Build a job that reads from HDFS, transforms data (e.g., filter rows), and writes back — introducing tMap.
  • Test error handling by intentionally corrupting a file on HDFS and observing tMsgBox behavior.
  • Export and import metadata for multiple clusters (dev, staging, prod) and switch between them dynamically.