15 videos 📅 2025-01-27 09:00:00 America/Bahia_Banderas
24:24
2025-01-27 13:13:59
2:06:12
2025-01-27 13:42:41
3:36:29
2025-01-28 09:08:14
4:33
2025-01-28 13:48:42
55:46
2025-01-28 14:06:51
2:02
2025-01-29 10:22:33
1:02:14
2025-01-29 10:25:14
2:10
2025-01-29 11:38:26
2:26
2025-01-29 12:03:00
1:23:37
2025-01-29 12:05:56
35:40
2025-01-29 15:01:26
1:40:43
2025-01-30 09:07:07
1:08:48
2025-01-30 11:20:20
1:10:50
2025-01-30 13:15:56
3:50:03
2025-01-31 07:20:07

Course recordings on DaDesktop for Training platform

Visit NobleProg websites for related course

Visit outline: Talend Big Data Integration (Course code: talendbigdata)

Categories: Big Data · Talend

Summary

Overview

This course provides a comprehensive, hands-on training on the Talent Data Fabric platform, focusing on two core modules: Talent Data Stewards (for data curation and campaign management) and Talent Data Preparation (for data cleaning, transformation, and preparation). The session guides participants through practical exercises in setting up trial accounts, creating data pipelines, managing data campaigns (join, field mapping, task assignment), and using Talent Data Preparation to clean, enrich, and format structured datasets for downstream analytics tools. The instructor emphasizes real-world data quality challenges, such as column name mismatches, inconsistent formats, missing values, and data privacy, while demonstrating the platform’s autoservicio capabilities for non-technical users.

Topic (Timeline)

1. Environment Setup & Talent Data Stewards Overview [00:00:00 - 00:29:43]

  • Participants are instructed to access the Talent Data Fabric platform via Chrome or Edge, register for a Free Trial using corporate email (e.g., @daxa), and select a cloud provider (AWS or Azure) upon trial activation.
  • The instructor confirms that participants have imported two datasets from S3 into Talent Cloud: “pasajeros” and “vuelos” for use in exercises.
  • The core roles in Talent Data Stewards are explained: Campaign Creator (defines data model, roles, workflow), Campaign Operator (populates tasks), and Data Steward (reviews and corrects tasks).
  • A practical workflow is demonstrated: creating a Pipeline in Pile Line Designer to join “pasajeros” and “vuelos” datasets on “número de vuelo,” with emphasis on saving changes manually and handling column name mismatches (e.g., “nombre” → “nombre_cliente”).
  • The Field Selector component is used to remap source column names to target campaign model names, ensuring data populates correctly.
  • Task assignment is configured: tasks are inserted into a “revisión” state and assigned to a specific steward (Luis Martínez), with priority levels explained.
  • The pipeline is executed on Spark Local engine; participants observe task generation in Data Stewards interface.
  • Common pitfalls are highlighted: case sensitivity in column names, failure to click “safe” after edits, and shared infrastructure delays.
  • The session concludes with demonstrations of other campaign types: Merging (identifying duplicate records), Arbitration (yes/no decision workflows), and Grouping (segmenting data by attributes like location or social stratum).

2. Talent Data Preparation: Core Concepts & Interface [00:29:43 - 00:49:28]

  • Transition to Talent Data Preparation: participants are instructed to access the module and upload six sample CSV datasets (ejercicio1.csv to ejercicio6.csv) via drag-and-drop into multiple browser tabs.
  • The purpose of Talent Data Preparation is defined: enabling non-technical users (citizen data scientists) to clean, format, and transform structured data (CSV, tables) for tools like Power BI, Tableau, or ML platforms without developer dependency.
  • Key concepts introduced:
    • Dataset: Raw structured data (e.g., CSV) stored in Talent Cloud.
    • Preparation: A link between a dataset and a recipe (sequence of functions).
    • Recipe: A series of predefined functions (e.g., concatenate, rename, filter) applied to columns, rows, or the entire dataset.
    • Semantic Types: Auto-classification of columns (e.g., “email,” “first_name”) based on content patterns, shared with Talent Data Stewards.
  • The interface is explored: left panel (recipes), center (data preview), right panel (function menu). Data is previewed on 10,000 rows by default; users can adjust this.
  • Semantic type detection is demonstrated: the system infers data types (integer, text) and semantic categories (e.g., “first_name”) from sample data, with manual override possible.
  • The instructor warns of known limitations: JSON/XML file support is unreliable, and the system is case-sensitive in matching.

3. Data Preparation: Core Functions & Data Quality [00:49:28 - 01:56:10]

  • String Functions:
    • concatenate: Merges columns (e.g., nombre + apellido) with optional separator; preview and submit for permanence.
    • contains text: Case-sensitive substring search; users learn to use regex for case-insensitive matching.
    • extract by index / from n before: Extract substrings (e.g., domain from URL) using position-based logic.
    • magic fill: Learns transformation patterns from example input-output pairs (e.g., “John Doe” → “J. Doe”) to auto-format names.
    • match pattern: Regex-based search (e.g., find URLs starting with “con”) with case-insensitive flag ((?i)).
    • replace: Regex-based replacement (e.g., remove everything after first space in names using .*$).
    • remove trailing/leading characters: Standardizes whitespace.
  • Data Quality & Profiling:
    • Column profiling shows statistics: count, distinct, duplicates, nulls, min/max, standard deviation.
    • Color coding: green (valid), red (invalid), black (null).
  • Advanced String Functions:
    • match similar text: Uses fuzzy logic to detect typos (e.g., “Emily” vs. “Emilý”); configurable by edit distance.
    • remove consecutive characters: Standardizes repeated characters (e.g., “kelly” → “kely”).
    • remove non-numeric/non-alphanumeric: Cleans fields (e.g., extract only numbers from ID, only letters from fruit names).
    • simplify text: Removes accents and converts to lowercase (e.g., “José María” → “jose maria”).

4. Data Preparation: Conversions, Cleaning & Masking [01:56:10 - 03:13:52]

  • Conversions:
    • convert country code: Translates country names to ISO 3166-1 codes (e.g., “Francia” → “FR”) and vice versa; case sensitivity and language (English) are critical.
    • convert distance: Converts meters to kilometers, miles to meters with configurable decimal precision.
    • convert duration: Converts hours to days (e.g., 5 hours → 0.2 days).
    • convert temperature: Converts Fahrenheit to Celsius.
  • Cleaning Functions:
    • clear: Removes content from cells matching a value or regex (e.g., clear all “Carlos” entries).
    • delete row: Removes entire rows where a column matches a condition (e.g., delete rows where gender = “M”).
    • fill cell with value: Replaces nulls with a specified value (e.g., “F”).
    • fill in front/back: Propagates the last/next non-null value upward/downward (e.g., for smoothing time-series data).
  • Data Masking:
    • hash data: Converts sensitive fields (e.g., passwords) into irreversible hash values.
    • mask data: Obscures parts of sensitive data (e.g., email: “a****@domain.com”) using “replace n first characters” with customizable mask (e.g., “x”).

5. Date Functions & Final Exercises [03:13:52 - 03:36:23]

  • Date Functions:
    • calculate time since: Computes age from birth date to “now” in years.
    • change date format: Converts dates to custom formats using Java date pattern syntax (e.g., “yyyy-MM-dd” → “MM/dd/yyyy”).
    • compare dates: Compares two dates with operators (e.g., “registration_date > birth_date”).
    • convert to epoch/julian: Transforms dates to numeric representations (Unix epoch, Julian day) for storage or analysis.
  • Extract Date Parts: Extracts year, month, day from a date field (e.g., extract year from “fecha_registro”).
  • Final Demonstration: Participants are shown how to connect to external data sources (e.g., databases) via Talent Studio (to be covered next day), and the session ends with a reminder that Talent Big Data (unstructured/semi-structured data) will be covered the following day.

Appendix

Key Principles

  • Autoservicio: Talent Data Preparation empowers non-technical users to perform data transformation without developer intervention.
  • Semantic Consistency: Semantic types (e.g., “email,” “first_name”) are shared across Talent Data Stewards and Data Preparation, ensuring unified data governance.
  • Case Sensitivity: All string operations (search, match, replace) are case-sensitive by default; use regex with (?i) flag for case-insensitive matching.
  • Save Explicitly: Changes in pipelines and recipes require manual “safe” or “submit” to persist; auto-save is not enabled.
  • Preview Before Commit: Always use the preview function to validate transformations before applying them permanently.

Tools Used

  • Talent Data Fabric: Platform for data stewardship and preparation.
  • Talent Data Stewards: For campaign-based data curation and arbitration.
  • Talent Data Preparation: For dataset cleaning, transformation, and formatting.
  • Talent Studio: For automating preparation jobs (mentioned but not demonstrated).
  • Talent Data Catalog / Inventory: For data discovery and semantic classification (referenced but not used in depth).

Common Pitfalls

  • Forgetting to click “safe” after editing pipeline or recipe names.
  • Case mismatches between source column names and campaign model names.
  • JSON/XML file upload failures (unsupported in this version).
  • Shared infrastructure delays during job execution (long wait times).
  • Semantic type misclassification due to small preview size (e.g., integer inferred from first 10K rows, but later data contains text).
  • Regex syntax errors (e.g., using ? instead of (?i) for case-insensitive matching).

Practice Suggestions

  • Practice remapping column names using Field Selector in pipelines.
  • Use Magic Fill to standardize inconsistent names (e.g., “john,” “John,” “JOHN” → “J. Doe”).
  • Apply regex to clean phone numbers, emails, and URLs.
  • Use “fill in front/back” to handle missing values in time-series data.
  • Always test data masking (hashing, obscuring) on sample data before production use.
  • Use “keep row order” only when sequence matters (e.g., ranking); disable it for performance on large datasets.