Course recordings on DaDesktop for Training platform
Visit NobleProg websites for related course
Visit outline: Talend Big Data Integration (Course code: talendbigdata)
Summary
Overview
This course session provides a hands-on demonstration of configuring and managing data campaigns within a data governance platform (likely Talent Data Server or similar), focusing on workflow-based task assignment, data validation, and correction workflows. The trainer walks through setting up a campaign with defined states, assigning tasks to users based on roles, executing data pipelines using Spark infrastructure, and correcting data quality issues through visual inspection and transformation functions. Key topics include user permissions, data validation rules, search-and-replace operations, date/phone number formatting, and notification settings for data stewards. The session emphasizes practical, visual data cleaning and the integration of Big Data processing via Spark.
Topic (Timeline)
1. Campaign Configuration and Task Assignment [00:00:00 - 00:02:52]
- The trainer demonstrates how to configure a campaign via the right-side panel, where only two actions are available: inserting or deleting records in the talent pipeline.
- Inserting records requires selecting a workflow state; only states assigned to the user’s role are visible in production.
- User permissions determine visibility: if a user is assigned to only one workflow state, only that state appears.
- The “assigned” field must display the correct username; if it doesn’t, the user was not added to the role after creation.
- Emphasis on clicking “Save” explicitly—changes are not auto-saved.
- Verification step: navigate back to the client view to confirm the inserted record appears in the correct state (“revisión”).
2. Pipeline Execution and Infrastructure Options [00:02:55 - 00:04:46]
- Running the campaign triggers a background job; preview fails if the campaign has no data.
- Introduction to “profiles” and infrastructure configuration: pipelines can run on Talent’s cloud or on-premises Spark clusters.
- Spark is selected as the execution engine; cluster initialization takes ~20 seconds due to scale (millions of records).
- The system automatically converts standard jobs into Spark code when Spark is selected.
- Trade-off noted: evaluate tool cost vs. time savings from automated Big Data processing.
- Job completes successfully, but preview may delay due to data volume.
3. Navigating Campaigns, Tasks, and User Roles [00:04:50 - 00:07:55]
- A 500 error occurs when switching to “Data Teamwork” tool; refresh (F5) is suggested as workaround.
- Campaigns and tasks are separate views: campaigns show aggregated data; tasks allow editing.
- User with “operador de campaña” role can create and modify tasks; “propietario de campaña” creates models and assigns datastores.
- Manual task creation vs. automated population via file (e.g., .csv) demonstrated.
- Data sources can connect to 600+ components: Azure, AWS, BigQuery, databases, and files.
- Trainer notes desire to demonstrate another connection type (e.g., database) but time is limited.
4. Data Validation, Error Detection, and Manual Correction [00:08:47 - 00:13:02]
- Data is displayed with column names matching labels, not internal IDs; data types (text, integer, date) are defined in the model.
- Green lines = no validation errors; red lines = validation failures.
- Clicking a red line applies a filter to show only invalid records.
- Hovering over a red line reveals the specific validation rule violated (e.g., email must end in .mx, not .com).
- Manual cell editing: double-click to correct values (e.g., change .com → .mx).
- Comments can be added via right-click → “adding comment” to document changes for reviewers.
5. Automated Data Transformation with Search and Replace [00:15:05 - 00:19:41]
- Use “search and replace” function to batch-correct data (e.g., replace all .com with .mx in email column).
- Option to override entire cell or replace only matching substring.
- Poliza number format corrected by inserting hyphens (e.g., 20240101 → 2024-01-01).
- Phone number validation fails due to inconsistent formatting (e.g., +57 vs. +52, spaces).
- Trainer notes that correcting phone formats requires complex regex or multiple functions; no built-in function exists for this specific pattern.
- Demonstrates partial correction on one record to illustrate process.
6. Resolving Data Type and Format Errors [00:20:24 - 00:24:23]
- State field “suspendida” corrected to “cancelada” to match allowed values in the model.
- Vehicle model field contains non-standard separators (e.g., thin spaces); corrected via search-and-replace to remove whitespace.
- Year field (1980–2025) fails due to out-of-range values; corrected manually.
- Date fields have inconsistent separators (slashes vs. hyphens); use “ser replace” to convert / to -.
- Date format mismatch: system expects YYYY-MM-DD, but input is DD/MM/YYYY; no built-in converter available in this interface.
- Trainer notes that date format conversion is better handled in Talent Data Preparation and suggests manual correction for now.
7. User Preferences and Notification Setup [00:09:11 - 00:10:30]
- Users must enable email notifications under “Profiles Preferences” → “Data Stewards” to receive alerts when tasks are assigned.
- Language settings are available (English, French, Japanese, German); Spanish is not supported.
- Notification system ensures data stewards are proactively informed of pending data review tasks.
Appendix
Key Principles
- Role-Based Access Control: Users only see workflow states and data they are explicitly assigned to.
- Explicit Save Required: No auto-save; users must click “Save” to persist changes.
- Validation-Driven Correction: Red indicators highlight data violations; hover reveals rule details.
- Visual Data Cleaning: Primary workflow is interactive, point-and-click correction without scripting.
- Spark Integration: Big Data pipelines are triggered by selecting Spark engine; code is auto-generated.
Tools Used
- Talent Data Server (or similar platform)
- Spark cluster for distributed processing
- Search and Replace functions
- Data validation rules (regex-based patterns)
- Email notification system
Common Pitfalls
- Forgetting to click “Save” → changes lost.
- Not assigning users to roles → no visibility in workflow.
- Using incorrect date/phone formats → validation fails.
- Confusing “Campaign” view (read-only) with “Tasks” view (editable).
- Assuming Spanish language support → not available.
Practice Suggestions
- Recreate the campaign setup with a sample .csv file.
- Practice correcting 3 different data types: emails, phone numbers, dates.
- Enable notifications and simulate a task assignment to test email alert.
- Compare results between running a job on cloud vs. on-premises Spark.
- Try using regex in “search and replace” for complex patterns (e.g., phone normalization).