15 videos 📅 2025-01-27 09:00:00 America/Bahia_Banderas
24:24
2025-01-27 13:13:59
2:06:12
2025-01-27 13:42:41
3:36:29
2025-01-28 09:08:14
4:33
2025-01-28 13:48:42
55:46
2025-01-28 14:06:51
2:02
2025-01-29 10:22:33
1:02:14
2025-01-29 10:25:14
2:10
2025-01-29 11:38:26
2:26
2025-01-29 12:03:00
1:23:37
2025-01-29 12:05:56
35:40
2025-01-29 15:01:26
1:40:43
2025-01-30 09:07:07
1:08:48
2025-01-30 11:20:20
1:10:50
2025-01-30 13:15:56
3:50:03
2025-01-31 07:20:07

Course recordings on DaDesktop for Training platform

Visit NobleProg websites for related course

Visit outline: Talend Big Data Integration (Course code: talendbigdata)

Categories: Big Data · Talend

Summary

Overview

This course session provides a comprehensive, hands-on tutorial on using Talend Big Data Studio to extract, process, and analyze structured data from CSV files stored in Hadoop clusters, while integrating with external AI services (OpenAI) for dynamic decision-making. The instructor demonstrates end-to-end workflows including JSON parsing with JSONPath, dynamic variable handling via global maps, email notification via SMTP, and complex data pipelines involving filtering, aggregation, sorting, and output generation. The session culminates in a real-world business use case: identifying the 10 least-sold products in a specific city and gender segment, then preparing the output for AI-driven strategic recommendations.

Topic (Timeline)

1. JSON Extraction with JSONPath and Schema Configuration [00:00:01.840 - 00:05:59.120]

The session begins with a demonstration of extracting nested data from a JSON response using JSONPath. The instructor walks through configuring a “Test JSON” component in Talend, specifying the JSONPath to navigate from the root ($) through an array (choices[0]) to a nested object (mesas) and finally extracting the content field. The schema is manually edited to define a single output field named “respuesta” of type string. The instructor emphasizes case sensitivity, correct path syntax, and the importance of matching field names exactly. Execution confirms successful extraction of only the desired field, suppressing extraneous JSON data.

2. Sending Extracted Data via Email with SMTP Configuration [00:09:08.120 - 00:14:40.750]

The instructor introduces the “T-Send Mail” component to send the extracted JSON content via email. Configuration includes: setting the recipient email, sender credentials (from a local credenciales_correo file), SMTP server (smtp.gmail.com), port 587, enabling TLS, and authenticating with a generated app-specific password (not the account password). The subject is set to “test” for initial validation. The instructor highlights security best practices: using app passwords, avoiding plaintext credentials, and verifying server settings for different providers (e.g., Outlook, Yahoo).

3. Dynamic Data Flow Using Global Variables and TJavaRow [00:16:37.690 - 00:22:25.690]

Since the “Test Email” component does not accept direct input, the instructor demonstrates using a global variable (globalMap) to pass data between components. A TJavaRow component is inserted between the JSON extractor and email sender. In the Java code, the instructor shows how to store the extracted respuesta value into globalMap.put("respuesta", input_row.respuesta) and later retrieve it in the email component using ((String)globalMap.get("respuesta")), with proper type casting. The session includes debugging steps: fixing missing semicolons in Java code and resolving null pointer errors caused by incorrect JSONPath references (e.g., missing content key).

4. Big Data Pipeline: Aggregating and Filtering Sales Data from HDFS [00:32:16.370 - 01:03:52.400]

The instructor shifts to a Big Data use case: analyzing supermarket sales data from a CSV file in HDFS. A pipeline is built using:

  • TFileInputDelimited to read the CSV (comma-separated, with header)
  • TMap to filter rows for city = "Yangon" and gender = "male" using Java expressions (row.city.equals("Yangon") && row.gender.equals("male"))
  • TAggregateRow to group by product_line and sum total sales
  • TSortRow to sort by total sales in ascending order (least to most)
  • TRowGenerator to assign row numbers (1 to N)
  • TFilterRow to retain only the first 10 rows (row_number <= 10)
  • TFileOutputDelimited to write the final 10 products to a CSV file in local storage, with header and overwrite enabled.
    The instructor emphasizes schema consistency, data type alignment, and the use of metadata to avoid configuration drift.

5. Debugging Data Type Mismatches and Schema Propagation [01:03:55.520 - 01:09:15.120]

The instructor encounters and resolves multiple runtime errors caused by schema mismatches:

  • A TRowGenerator outputting row_number as String instead of Integer
  • TFilterRow expecting an integer but receiving a string
  • TSortRow attempting to sort product_line (string) as a numeric field
    Solutions include: changing variable types in TMap and TFileOutputDelimited schemas, ensuring consistent data types across all connected components, and reapplying schema changes after component edits. The session concludes with a successful execution of the pipeline, outputting the 10 least-sold products for men in Yangon.

6. Course Wrap-up and Virtual Machine Management [01:09:17.600 - 01:10:49.010]

The instructor concludes the session by instructing participants to save and power down the virtual machine (not shut it off) to preserve state. A final note is made about an upcoming course evaluation, with the promise to continue the next day with examples on job versioning and Talend Studio features.

Appendix

Key Principles

  • JSONPath Navigation: Always start from $ (root), use [n] for array indexing (0-based), and respect case sensitivity.
  • GlobalMap for Dynamic Data: Use globalMap.put() and globalMap.get() in TJavaRow to pass data between components that don’t support direct input.
  • Schema Consistency: Ensure data types (String, Integer, Double) match across all connected components. Mismatches cause runtime errors.
  • SMTP Security: Never use account passwords for third-party apps. Use app-specific passwords and enable TLS on port 587 for Gmail.
  • Filtering Logic: Use TMap for complex Java-based filtering; use TFilterRow for simple, UI-based conditions.

Tools Used

  • Talend Big Data Studio
  • HDFS (Hadoop Distributed File System)
  • OpenAI API (via HTTP request)
  • Gmail SMTP server
  • JSONPath evaluator (embedded in Talend’s Test JSON component)

Common Pitfalls

  • Missing semicolon in Java code within TJavaRow
  • Incorrect JSONPath due to missing nested keys (e.g., content not referenced)
  • Using String instead of Integer for numeric fields in TRowGenerator or TFilterRow
  • Sorting string fields as numbers (e.g., product names as numeric values)
  • Forgetting to refresh metadata after CSV schema changes
  • Not enabling TLS or using wrong SMTP port for email delivery

Practice Suggestions

  1. Recreate the JSON extraction pipeline with a different API response structure.
  2. Modify the supermarket pipeline to filter by “female” and export results to a database instead of CSV.
  3. Add a TLogRow component to log the top 3 products before sending to email.
  4. Integrate a retry mechanism for failed email sends using TLoop and TJavaRow error handling.
  5. Use TMap to calculate average sales per product and compare with the least-sold list.