Summary
Overview
This course session provides a hands-on tutorial on integrating Apache Hadoop’s HDFS and Amazon S3 with Talend Data Integration tools. The instructor demonstrates how to configure HDFS and S3 connections via metadata repositories, create and execute ETL jobs to download files from S3, transfer them to HDFS, and clean up temporary local files. The session emphasizes practical workflow design using Talend components such as T-Pre-Job, T-Post-Job, T-S3_Get, T-HDFS_Put, and T-File_Delete, while highlighting configuration best practices, path syntax, connection reuse, and error handling. The session concludes with guidance on saving virtual machine states to preserve configurations.
Topic (Timeline)
1. HDFS Connection Setup and Metadata Reuse [00:00:00 - 00:06:08]
- Introduced Hadoop cluster metadata in Talend, including Hive, HBase, and HCatalog as optional components.
- Focused on creating an HDFS connection named
HDFS_AXAvia right-click on the Hadoop cluster → “Create HDFS”. - Configured default HDFS input settings: line separator (
\n), field delimiter (;), and verified successful connection via “Check”. - Demonstrated how to reference pre-configured HDFS metadata connections instead of rebuilding them manually.
- Emphasized that HDFS connections are stored in metadata for reuse across jobs.
2. HDFS File Download Job with T-HDFS_Get [00:06:09 - 00:09:51]
- Created a new job named
Job_Descargar_Fichero_Hadoopto download a file (production.csv) from HDFS. - Added
T-HDFS_GetandT-Mesas_Bosscomponents; connected them via trigger link (not row link). - Configured
T-HDFS_Getto use metadata connectionHDFS_AXAvia “Repository” option → selectedHDFS_AXA. - Set HDFS directory to
AXA/(not the file itself), local destination totemporal/, and action to “Overwrite”. - Named output file
production.copy.csv. - Resolved configuration error: initially connected via row link → corrected to trigger link (
on-component ok) to avoid “component has output” warning. - Executed job successfully; confirmed file appeared in local
temporal/folder.
3. S3 Integration and Job Architecture with T-Pre-Job / T-Post-Job [00:10:57 - 00:19:29]
- Introduced a more complex job:
Job_Formatear_Archivo_NBAto download from S3 → process → upload to HDFS. - Created
T-Pre-Job(for initialization) andT-Post-Job(for cleanup) to structure job logic. - Reused existing HDFS connection by copying and pasting from a previous job.
- Added
T-S3_Connectioncomponent; installed missing Java dependencies via “Install” button. - Configured S3 connection using static credentials (Access Key and Secret Key) from a local file (
credenciales_s3), enclosed in double quotes as strings. - Connected
T-Pre-Job→T-HDFS_Connection→T-S3_Connectionviaon-component okto ensure sequential initialization. - Configured
T-S3_ClosedandT-Post-Jobto close connections cleanly.
4. S3 File Download and Local File Management [00:19:59 - 00:26:41]
- Added
T-S3_Getcomponent; configured bucket name (NoblePro) and file key (team_nba.csv). - Set local destination path manually using Windows-style escaped backslashes:
C:\\temporal\\team.csv. - Used
Notecomponent to document job purpose: “Download team_nba.csv from S3”. - Added
T-File_Deletecomponent to remove local copy after upload, linked viaon-component oktoT-S3_Get. - Copied the local file path from
T-S3_GettoT-File_Deleteto ensure correct deletion target.
5. HDFS Upload and Cross-Platform Path Issues [00:27:41 - 00:33:57]
- Added
T-HDFS_Putcomponent to uploadteam_nba.csvfrom localtemporal/to HDFSAXA/. - Configured connection via
T-HDFS_Connection(from T-Pre-Job). - Set local source:
C:\temporal\team.csv; remote target:/AXA/team_nba.csv. - Executed job; confirmed file appeared in HDFS (79 MB).
- Local file deleted successfully.
- Debugged error in Fernanda’s machine:
T-File_Deletefailed because file didn’t exist → caused by incorrect bucket name or case-sensitive path (Timvstim). - Noted inconsistency: Windows file system is case-insensitive, but Talend/Java behavior may vary depending on underlying OS (Ubuntu vs Windows).
- Concluded that path casing and bucket names must be exact.
6. Virtual Machine State Preservation and Next Steps [00:34:47 - 00:35:39]
- Instructed students to save the Hyper-V virtual machine state via right-click → “Save” to preserve all configurations.
- Advised to shut down the Windows VM after saving to avoid usage charges.
- Mentioned upcoming session on Talend’s proprietary version control system (Git integration and job versioning), to be demonstrated next time.
Appendix
Key Principles
- Metadata Reuse: Always define connections (HDFS, S3) in metadata and reference them via “Repository” to avoid duplication and ensure consistency.
- Job Structure: Use
T-Pre-Jobfor initialization (connections, variables) andT-Post-Jobfor cleanup (disconnections, file deletion). - Trigger Links: Use
on-component ok(trigger) instead ofrowlinks when components do not emit data rows. - Path Syntax: Use escaped backslashes (
C:\\folder\\file) in Windows for absolute paths in Talend components. - Credential Security: Store sensitive keys externally; Talend auto-encrypts them in base64 after first save.
Tools Used
- Apache Hadoop (HDFS)
- Amazon S3
- Talend Data Integration (Talend Studio)
- Hyper-V Virtual Machine (Windows/Ubuntu)
Common Pitfalls
- Connecting
T-HDFS_Getvia row link → causes “component has output” warning. - Incorrect bucket or file key names → silent failure (no error shown).
- Case sensitivity in file paths between Windows and Linux environments.
T-File_Deletefails if file doesn’t exist → disable “Fail on error” if deletion is optional.- Missing Java dependencies for S3 component → requires manual “Install” step.
Practice Suggestions
- Recreate the S3 → HDFS job using different file types (JSON, Parquet).
- Add a
T-File_Inputcomponent to read and transformteam_nba.csvbefore upload. - Test job with multiple S3 buckets and validate error handling.
- Use
T-Contextto externalize paths and credentials for portability.