Login Register

Big Data - jhon-megf-20241010-004719

← Back to Recording

Summary

Overview

This course session provides a hands-on demonstration of setting up a Big Data infrastructure using Hadoop on a virtualized Windows environment, integrating it with Talend Big Data for data ingestion and extraction. The instructor walks through the configuration of Hyper-V, Ubuntu-based Hadoop (HDFS and YARN), network connectivity verification, HDFS file system initialization, and the creation of Talend jobs to upload and download files between a local Windows system and HDFS. The session emphasizes practical workflow, context-based configuration for environment portability, and troubleshooting common setup issues.

Topic (Timeline)

1. Infrastructure Overview and Environment Setup [00:00:00 - 00:05:31]

The session begins with an explanation of the target architecture: a Windows host machine running a Hyper-V hypervisor, which hosts an Ubuntu virtual machine (VM) with Hadoop installed. The instructor outlines the layered structure: Windows → Hyper-V → Ubuntu (Hadoop) → Docker/Hive (for data warehousing). The goal is to enable Talend Big Data to communicate with Hadoop via HDFS and Hive. The instructor initiates the Ubuntu VM and prepares the workspace by opening draw.io (for diagramming), PowerShell (on Windows), and a terminal on Ubuntu. A Notepad file is created to log configuration commands.

2. Network Connectivity and Hadoop Initialization [00:05:31 - 00:18:09]

The instructor verifies network connectivity between the Windows host (IP: 10.0.3.15) and the Ubuntu VM (IP: 10.0.3.250) using ping commands in both directions. The Ubuntu VM is accessed using the user “Hedu” with password “hedu” (lowercase). The Hadoop HDFS file system is formatted using the command hdfs namenode -format to ensure a clean start. Hadoop services (NameNode, DataNode, ResourceManager, NodeManager) are started using start-dfs.sh and start-yarn.sh. The jps command is used to confirm running Hadoop daemons. Network ports 9000 (HDFS) and 9870 (HDFS web UI) are confirmed open via netstat -tuln, with 0.0.0.0 binding indicating external accessibility.

3. HDFS File System Configuration and Web UI Validation [00:18:09 - 00:23:15]

The Hadoop web interface (http://10.0.3.250:9870) is accessed from the Windows browser to validate the Hadoop cluster status. The DataNode count is confirmed as 1. The HDFS directory structure is explored using the “Browse File System” utility, showing an empty filesystem after formatting. A new directory /inputs is created in HDFS using hdfs dfs -mkdir /inputs. Permissions for /inputs are set to 777 (read/write/execute for all) for academic purposes. The directory structure is listed with hdfs dfs -ls / to confirm creation.

4. Talend Big Data Job Creation: Upload to HDFS [00:23:15 - 00:48:39]

A new Talend job named Job_subir_archivo_Hadoop is created. Components are added: tHDFSConnection, tHDFSPut, and tWarn. The tHDFSConnection component is configured with:

Distribution: Universal
NameNode URI: hdfs://10.0.3.250:9000
Authentication: Anonymous (default) The tHDFSPut component is linked via onComponentOk to ensure execution only if the connection succeeds. The local source directory is set to C:\temp\descargas (created on Windows), and two files (productium.csv and team_nba.csv) are selected for upload to /inputs in HDFS. The “Action” is set to “Overwrite” to allow repeated job runs. Job execution is triggered, and the HDFS web UI is refreshed to confirm successful upload.

5. Troubleshooting and Component Configuration Fixes [00:48:39 - 00:52:37]

Several participants encounter errors during job execution. Common issues are diagnosed and resolved:

Incorrect HDFS connection: tHDFSPut not using the defined tHDFSConnection (checkbox “Use an existing connection” was unchecked).
Incorrect local path separators: Windows backslashes (\) used instead of forward slashes (/) in paths.
Files uploaded to root instead of /inputs due to misconfigured HDFS directory. Instructor guides participants to correct these via component reconfiguration and re-execution. All participants successfully upload files to /inputs.

6. Context-Based Configuration for Environment Portability [00:52:37 - 00:56:14]

To avoid hardcoding IP addresses and paths, the instructor introduces Talend contexts. A context group named “AXA” is created with two variables:

direccion_servidor_hadoop: 10.0.3.250
ruta_raiz: /inputs These variables are imported into the job, replacing hardcoded values in tHDFSConnection and tHDFSPut. This enables seamless migration to QA/production environments by simply changing context values.

7. Talend Job Creation: Download from HDFS [00:56:14 - 01:03:36]

A second job, Job_descargar_archivo_Hadoop, is created using tHDFSGet to download files from HDFS. The tHDFSGet component is configured directly (no connection component) using context variables:

HDFS directory: context.ruta_raiz
Local directory: C:\temp\descargas
File to download: productium.csv
New name: copy.csv
Action: “Overwrite” A tMsgBox component is connected via onSubjobOk with a conditional Run If trigger. The condition checks tHDFSGet.NUMBER_OF_FILES == 0 to detect failed transfers, triggering a message if no files were downloaded. The “Outline” view is used to access job variables (e.g., NUMBER_OF_FILES) for conditional logic. The session ends with a note about a participant access issue, resolved by confirming user inclusion in the course group.

Appendix

Key Principles

Layered Virtualization: Windows → Hyper-V → Ubuntu → Hadoop → Talend
Hadoop Architecture: HDFS (storage) + YARN (processing) must be running for Talend integration.
Network Isolation: VMs must be on the same subnet; static IPs are critical for reliable connectivity.
HDFS Permissions: Use 777 in development; production requires stricter ACLs.
Context Variables: Never hardcode IPs or paths. Use Talend contexts for environment portability.

Tools Used

Virtualization: Hyper-V (Windows)
OS: Ubuntu 20.04+
Big Data: Hadoop 3.x (HDFS, YARN)
Data Integration: Talend Big Data 8.x
Network Tools: ping, ipconfig, netstat, ifconfig
File Management: hdfs dfs -mkdir, hdfs dfs -chmod, hdfs dfs -ls

Common Pitfalls

Forgetting to check “Use an existing connection” in tHDFSPut → causes local HDFS attempt.
Using backslashes (\) in Linux paths → use forward slashes (/).
Not formatting HDFS before first startup → corrupted state.
Hardcoding IP addresses → breaks when moving to different environments.
Not verifying connectivity with ping before starting Hadoop services.

Practice Suggestions

Rebuild the entire environment from scratch without following the video.
Create a second job to download team_nba.csv and rename it to nba_teams.csv.
Add a tLogRow component to log the number of files transferred.
Set up a second VM with a different IP and test context switching.
Use tHDFSList to list files in /inputs before and after upload.