Summary

Overview

This course session provides a hands-on tutorial on integrating Apache Hadoop’s HDFS and Amazon S3 with Talend Data Integration tools. The instructor demonstrates how to configure HDFS and S3 connections via metadata repositories, create and execute ETL jobs to download files from S3, transfer them to HDFS, and clean up temporary local files. The session emphasizes practical workflow design using Talend components such as T-Pre-Job, T-Post-Job, T-S3_Get, T-HDFS_Put, and T-File_Delete, while highlighting configuration best practices, path syntax, connection reuse, and error handling. The session concludes with guidance on saving virtual machine states to preserve configurations.


Topic (Timeline)

1. HDFS Connection Setup and Metadata Reuse [00:00:00 - 00:06:08]

  • Introduced Hadoop cluster metadata in Talend, including Hive, HBase, and HCatalog as optional components.
  • Focused on creating an HDFS connection named HDFS_AXA via right-click on the Hadoop cluster → “Create HDFS”.
  • Configured default HDFS input settings: line separator (\n), field delimiter (;), and verified successful connection via “Check”.
  • Demonstrated how to reference pre-configured HDFS metadata connections instead of rebuilding them manually.
  • Emphasized that HDFS connections are stored in metadata for reuse across jobs.

2. HDFS File Download Job with T-HDFS_Get [00:06:09 - 00:09:51]

  • Created a new job named Job_Descargar_Fichero_Hadoop to download a file (production.csv) from HDFS.
  • Added T-HDFS_Get and T-Mesas_Boss components; connected them via trigger link (not row link).
  • Configured T-HDFS_Get to use metadata connection HDFS_AXA via “Repository” option → selected HDFS_AXA.
  • Set HDFS directory to AXA/ (not the file itself), local destination to temporal/, and action to “Overwrite”.
  • Named output file production.copy.csv.
  • Resolved configuration error: initially connected via row link → corrected to trigger link (on-component ok) to avoid “component has output” warning.
  • Executed job successfully; confirmed file appeared in local temporal/ folder.

3. S3 Integration and Job Architecture with T-Pre-Job / T-Post-Job [00:10:57 - 00:19:29]

  • Introduced a more complex job: Job_Formatear_Archivo_NBA to download from S3 → process → upload to HDFS.
  • Created T-Pre-Job (for initialization) and T-Post-Job (for cleanup) to structure job logic.
  • Reused existing HDFS connection by copying and pasting from a previous job.
  • Added T-S3_Connection component; installed missing Java dependencies via “Install” button.
  • Configured S3 connection using static credentials (Access Key and Secret Key) from a local file (credenciales_s3), enclosed in double quotes as strings.
  • Connected T-Pre-JobT-HDFS_ConnectionT-S3_Connection via on-component ok to ensure sequential initialization.
  • Configured T-S3_Closed and T-Post-Job to close connections cleanly.

4. S3 File Download and Local File Management [00:19:59 - 00:26:41]

  • Added T-S3_Get component; configured bucket name (NoblePro) and file key (team_nba.csv).
  • Set local destination path manually using Windows-style escaped backslashes: C:\\temporal\\team.csv.
  • Used Note component to document job purpose: “Download team_nba.csv from S3”.
  • Added T-File_Delete component to remove local copy after upload, linked via on-component ok to T-S3_Get.
  • Copied the local file path from T-S3_Get to T-File_Delete to ensure correct deletion target.

5. HDFS Upload and Cross-Platform Path Issues [00:27:41 - 00:33:57]

  • Added T-HDFS_Put component to upload team_nba.csv from local temporal/ to HDFS AXA/.
  • Configured connection via T-HDFS_Connection (from T-Pre-Job).
  • Set local source: C:\temporal\team.csv; remote target: /AXA/team_nba.csv.
  • Executed job; confirmed file appeared in HDFS (79 MB).
  • Local file deleted successfully.
  • Debugged error in Fernanda’s machine: T-File_Delete failed because file didn’t exist → caused by incorrect bucket name or case-sensitive path (Tim vs tim).
  • Noted inconsistency: Windows file system is case-insensitive, but Talend/Java behavior may vary depending on underlying OS (Ubuntu vs Windows).
  • Concluded that path casing and bucket names must be exact.

6. Virtual Machine State Preservation and Next Steps [00:34:47 - 00:35:39]

  • Instructed students to save the Hyper-V virtual machine state via right-click → “Save” to preserve all configurations.
  • Advised to shut down the Windows VM after saving to avoid usage charges.
  • Mentioned upcoming session on Talend’s proprietary version control system (Git integration and job versioning), to be demonstrated next time.

Appendix

Key Principles

  • Metadata Reuse: Always define connections (HDFS, S3) in metadata and reference them via “Repository” to avoid duplication and ensure consistency.
  • Job Structure: Use T-Pre-Job for initialization (connections, variables) and T-Post-Job for cleanup (disconnections, file deletion).
  • Trigger Links: Use on-component ok (trigger) instead of row links when components do not emit data rows.
  • Path Syntax: Use escaped backslashes (C:\\folder\\file) in Windows for absolute paths in Talend components.
  • Credential Security: Store sensitive keys externally; Talend auto-encrypts them in base64 after first save.

Tools Used

  • Apache Hadoop (HDFS)
  • Amazon S3
  • Talend Data Integration (Talend Studio)
  • Hyper-V Virtual Machine (Windows/Ubuntu)

Common Pitfalls

  • Connecting T-HDFS_Get via row link → causes “component has output” warning.
  • Incorrect bucket or file key names → silent failure (no error shown).
  • Case sensitivity in file paths between Windows and Linux environments.
  • T-File_Delete fails if file doesn’t exist → disable “Fail on error” if deletion is optional.
  • Missing Java dependencies for S3 component → requires manual “Install” step.

Practice Suggestions

  • Recreate the S3 → HDFS job using different file types (JSON, Parquet).
  • Add a T-File_Input component to read and transform team_nba.csv before upload.
  • Test job with multiple S3 buckets and validate error handling.
  • Use T-Context to externalize paths and credentials for portability.