Summary
Overview
This course session provides a hands-on demonstration of setting up a Big Data infrastructure using Hadoop on a virtualized Windows environment, integrating it with Talend Big Data for data ingestion and extraction. The instructor walks through the configuration of Hyper-V, Ubuntu-based Hadoop (HDFS and YARN), network connectivity verification, HDFS file system initialization, and the creation of Talend jobs to upload and download files between a local Windows system and HDFS. The session emphasizes practical workflow, context-based configuration for environment portability, and troubleshooting common setup issues.
Topic (Timeline)
1. Infrastructure Overview and Environment Setup [00:00:00 - 00:05:31]
The session begins with an explanation of the target architecture: a Windows host machine running a Hyper-V hypervisor, which hosts an Ubuntu virtual machine (VM) with Hadoop installed. The instructor outlines the layered structure: Windows → Hyper-V → Ubuntu (Hadoop) → Docker/Hive (for data warehousing). The goal is to enable Talend Big Data to communicate with Hadoop via HDFS and Hive. The instructor initiates the Ubuntu VM and prepares the workspace by opening draw.io (for diagramming), PowerShell (on Windows), and a terminal on Ubuntu. A Notepad file is created to log configuration commands.
2. Network Connectivity and Hadoop Initialization [00:05:31 - 00:18:09]
The instructor verifies network connectivity between the Windows host (IP: 10.0.3.15) and the Ubuntu VM (IP: 10.0.3.250) using ping commands in both directions. The Ubuntu VM is accessed using the user “Hedu” with password “hedu” (lowercase). The Hadoop HDFS file system is formatted using the command hdfs namenode -format to ensure a clean start. Hadoop services (NameNode, DataNode, ResourceManager, NodeManager) are started using start-dfs.sh and start-yarn.sh. The jps command is used to confirm running Hadoop daemons. Network ports 9000 (HDFS) and 9870 (HDFS web UI) are confirmed open via netstat -tuln, with 0.0.0.0 binding indicating external accessibility.
3. HDFS File System Configuration and Web UI Validation [00:18:09 - 00:23:15]
The Hadoop web interface (http://10.0.3.250:9870) is accessed from the Windows browser to validate the Hadoop cluster status. The DataNode count is confirmed as 1. The HDFS directory structure is explored using the “Browse File System” utility, showing an empty filesystem after formatting. A new directory /inputs is created in HDFS using hdfs dfs -mkdir /inputs. Permissions for /inputs are set to 777 (read/write/execute for all) for academic purposes. The directory structure is listed with hdfs dfs -ls / to confirm creation.
4. Talend Big Data Job Creation: Upload to HDFS [00:23:15 - 00:48:39]
A new Talend job named Job_subir_archivo_Hadoop is created. Components are added: tHDFSConnection, tHDFSPut, and tWarn. The tHDFSConnection component is configured with:
- Distribution: Universal
- NameNode URI:
hdfs://10.0.3.250:9000 - Authentication: Anonymous (default)
The
tHDFSPutcomponent is linked viaonComponentOkto ensure execution only if the connection succeeds. The local source directory is set toC:\temp\descargas(created on Windows), and two files (productium.csvandteam_nba.csv) are selected for upload to/inputsin HDFS. The “Action” is set to “Overwrite” to allow repeated job runs. Job execution is triggered, and the HDFS web UI is refreshed to confirm successful upload.
5. Troubleshooting and Component Configuration Fixes [00:48:39 - 00:52:37]
Several participants encounter errors during job execution. Common issues are diagnosed and resolved:
- Incorrect HDFS connection:
tHDFSPutnot using the definedtHDFSConnection(checkbox “Use an existing connection” was unchecked). - Incorrect local path separators: Windows backslashes (
\) used instead of forward slashes (/) in paths. - Files uploaded to root instead of
/inputsdue to misconfigured HDFS directory. Instructor guides participants to correct these via component reconfiguration and re-execution. All participants successfully upload files to/inputs.
6. Context-Based Configuration for Environment Portability [00:52:37 - 00:56:14]
To avoid hardcoding IP addresses and paths, the instructor introduces Talend contexts. A context group named “AXA” is created with two variables:
direccion_servidor_hadoop:10.0.3.250ruta_raiz:/inputsThese variables are imported into the job, replacing hardcoded values intHDFSConnectionandtHDFSPut. This enables seamless migration to QA/production environments by simply changing context values.
7. Talend Job Creation: Download from HDFS [00:56:14 - 01:03:36]
A second job, Job_descargar_archivo_Hadoop, is created using tHDFSGet to download files from HDFS. The tHDFSGet component is configured directly (no connection component) using context variables:
- HDFS directory:
context.ruta_raiz - Local directory:
C:\temp\descargas - File to download:
productium.csv - New name:
copy.csv - Action: “Overwrite”
A
tMsgBoxcomponent is connected viaonSubjobOkwith a conditionalRun Iftrigger. The condition checkstHDFSGet.NUMBER_OF_FILES == 0to detect failed transfers, triggering a message if no files were downloaded. The “Outline” view is used to access job variables (e.g.,NUMBER_OF_FILES) for conditional logic. The session ends with a note about a participant access issue, resolved by confirming user inclusion in the course group.
Appendix
Key Principles
- Layered Virtualization: Windows → Hyper-V → Ubuntu → Hadoop → Talend
- Hadoop Architecture: HDFS (storage) + YARN (processing) must be running for Talend integration.
- Network Isolation: VMs must be on the same subnet; static IPs are critical for reliable connectivity.
- HDFS Permissions: Use
777in development; production requires stricter ACLs. - Context Variables: Never hardcode IPs or paths. Use Talend contexts for environment portability.
Tools Used
- Virtualization: Hyper-V (Windows)
- OS: Ubuntu 20.04+
- Big Data: Hadoop 3.x (HDFS, YARN)
- Data Integration: Talend Big Data 8.x
- Network Tools:
ping,ipconfig,netstat,ifconfig - File Management:
hdfs dfs -mkdir,hdfs dfs -chmod,hdfs dfs -ls
Common Pitfalls
- Forgetting to check “Use an existing connection” in
tHDFSPut→ causes local HDFS attempt. - Using backslashes (
\) in Linux paths → use forward slashes (/). - Not formatting HDFS before first startup → corrupted state.
- Hardcoding IP addresses → breaks when moving to different environments.
- Not verifying connectivity with
pingbefore starting Hadoop services.
Practice Suggestions
- Rebuild the entire environment from scratch without following the video.
- Create a second job to download
team_nba.csvand rename it tonba_teams.csv. - Add a
tLogRowcomponent to log the number of files transferred. - Set up a second VM with a different IP and test context switching.
- Use
tHDFSListto list files in/inputsbefore and after upload.