Summary

Overview

This course provides a comprehensive, hands-on walkthrough of the Talent Data Stewardship platform, focusing on data quality governance through the creation of data models, campaigns, and workflows. The session guides learners through defining data schemas, configuring data quality rules (including regex validation and semantic dictionaries), populating campaigns via pipelines or manual entry, and assigning roles (data stewards, analysts, verifiers) to enable human-in-the-loop data correction. It also covers integration with external data sources (S3, MySQL), user role management, and best practices for reusability and automation. The instructor demonstrates real-time troubleshooting of platform limitations, such as case sensitivity, hardware constraints, and workflow state transitions.

Topic (Timeline)

1. Initial Setup and Platform Navigation [00:00:00 - 00:02:51]

  • Instructor resolves connectivity and interface issues with participants’ machines.
  • Guides users to locate and access the “Talent Data Stewardship” platform and “Data Inventory” interface.
  • Demonstrates file upload process for clientes1.csv and identifies access issues due to shared infrastructure delays.
  • Explains that data quality scoring is computationally intensive and delayed due to shared cluster resources.

2. Case Study Introduction and Role Definitions [00:05:28 - 00:08:05]

  • Introduces the Retail S.A. case study: improving customer data quality using Talent Data Stewardship.
  • Defines three core roles:
    • Data Analyst / Campaign Creator: Defines data models and campaigns.
    • Data Engineer: Populates campaigns with data (via pipeline, API, or manual entry).
    • Data Steward: Corrects data quality issues in assigned tasks.
  • Emphasizes that each campaign must be associated with a data model.

3. Creating a Data Model: Attributes and Standards [00:09:02 - 00:17:12]

  • Creates a new data model named modelo, clientes, retains with standardized naming: lowercase identifiers, title-case labels.
  • Defines attributes with strict standards:
    • Identifiers: lowercase, no spaces or accents (e.g., identificaciónidentificacion).
    • Labels: human-readable, with accents and capitalization (e.g., “Identificación”).
  • Sets data types: text, integer, boolean.
  • Configures validation rules using regex patterns.
  • Advises saving the model frequently to prevent data loss due to platform instability.

4. Advanced Data Validation: Regex and Semantic Rules [00:18:08 - 00:36:21]

  • Configures regex validation for email and phone fields:
    • Uses AI (ChatGPT) to generate regex: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ for email.
    • Phone validation: ^\d{3}-\d{4}$ for Colombian format.
  • Introduces semantic tags: dictionary, regular expression, compound type.
  • Demonstrates use of remove leading and trailing whitespace function to clean data.
  • Emphasizes importance of standardization: mismatched case (e.g., Email vs email) causes validation failures.

5. Campaign Creation and Workflow Design [00:47:19 - 00:56:38]

  • Creates campaign campaña, clientes, retail with:
    • Type: Resolution (for data quality fixes).
    • Deadline: 1 hour.
    • Roles: Analista de Datos, Verificador de Datos, Gerente de Datos.
  • Defines workflow:
    1. Analizar y corregir → assigned to Analista.
    2. Verificar → assigned to Verificador.
    3. Terminar → assigned to Gerente.
  • Explains that campaign model cannot be changed after creation; duplication is required for modifications.

6. Populating Campaigns via Pipeline [00:56:47 - 01:20:46]

  • Uses Pipeline Designer (ETL tool) to populate campaign with data:
    • Source: clientes1.csv (after quality scoring completes).
    • Destination: campaña, clientes, retail.
    • Action: Insert, State: Analizar y corregir, Assign to user, Priority: High.
  • Explains pipeline limitations: only supports Resolution and Emerging campaign types.
  • Runs pipeline; notes delays due to shared Spark cluster resource contention.

7. Data Stewardship: Reviewing and Correcting Tasks [01:22:03 - 02:17:38]

  • Data stewards access tasks in “Tareas” view; see color-coded data quality issues:
    • Red: validation failure (e.g., invalid email, age < 18).
    • Gray: leading/trailing whitespace.
  • Corrects data manually:
    • Edits email to add .com.
    • Changes age from -23 to 23, 31.5 to 31.
    • Uses “Remove leading and trailing” function for city/state fields.
  • Adds comments to tasks for audit trail (e.g., “Revisar edad: asumí 31 años”).
  • Uses “Mark tasks as ready” to advance tasks to next workflow stage.
  • Uses “Validate” to finalize and move tasks to next state.
  • Demonstrates audit trail via clock icon: shows who changed what and when.

8. User Role Management and Multi-User Simulation [02:58:27 - 03:09:28]

  • Creates new user john.jaime.mendes@gmail.com with Data Stewards role via Manas Console.
  • Demonstrates role separation:
    • Original user: full access (creator, operator, steward).
    • New user: can only view and correct tasks, cannot create campaigns.
  • Troubleshoots login conflicts: users must log out of multiple sessions to avoid permission overlap.

9. External Data Source Integration: S3 and MySQL [03:12:57 - 03:27:04]

  • Configures S3 connection:
    • Uses AWS access key and secret key.
    • Creates dataset clientes_on_the_record from S3 file clientes.csv.
  • Configures MySQL connection:
    • Host: sql5.freemysqlhosting.net
    • Port: 3306
    • Database: nombre_base
    • Username/Password: from credentials file.
  • Notes: Data quality scoring occurs after ingestion from external sources.

10. Semantic Data Types and Dictionary Management [03:28:58 - 03:46:16]

  • Introduces semantic data types:
    • Dictionary: Predefined list of valid values (e.g., cities).
    • Regular Expression: Custom validation patterns.
    • Compound Type: Combinations of above.
  • Creates dictionary ciudades with 2 values: barranquilla, cartagena.
  • Explains three matching modes:
    • Exact value: case- and accent-sensitive.
    • Ignore case and accents: ignores case/accents, but not whitespace.
    • Simplify text: removes whitespace, accents, and case for comparison.
  • Demonstrates importing dictionary from .txt file (one column, line-separated values).
  • Advises using AI to generate city lists for large dictionaries.

11. Platform Limitations and Best Practices [03:46:16 - 03:47:40]

  • Notes platform instability: frequent timeouts, 500 errors, session loss.
  • Recommends:
    • Saving models frequently.
    • Using AI to generate regex and dictionary lists.
    • Duplicating campaigns to modify models.
    • Using JSON import for bulk model creation (via API).
  • Concludes session: next day will cover dictionary import and final campaign validation.

Appendix

Key Principles

  • Data Model First: Every campaign requires a pre-defined data model. Models are reusable across campaigns.
  • Standardization is Critical: Use lowercase identifiers, consistent naming, and avoid accents in field names to prevent validation failures.
  • Human-in-the-Loop: The platform is designed for human correction of data quality issues, not automated fixes.
  • Role Separation: Creators define models/campaigns; operators populate data; stewards correct tasks.

Tools Used

  • Talent Data Stewardship: Core platform for data quality governance.
  • Pipeline Designer: ETL tool for bulk task population (supports only Resolution/Emerging campaigns).
  • ChatGPT: Used to generate regex patterns and dictionary lists.
  • S3 and MySQL: External data sources connected via credentials.
  • Manas Console: Admin tool for user and role management.

Common Pitfalls

  • Case Sensitivity: Field names in dataset must exactly match model identifiers (e.g., emailEmail).
  • Whitespace Issues: Leading/trailing spaces in text fields cause validation failures; use “remove leading and trailing” function.
  • Dictionary Matching: Exact value mode is too strict; Ignore case and accents is recommended for real-world data.
  • Platform Instability: Frequent timeouts and 500 errors; save work often and use duplicate tabs.
  • Shared Infrastructure: Data quality scoring is slow due to resource contention; expect delays.

Practice Suggestions

  • Create a template JSON file to auto-generate data models via API.
  • Build a dictionary of cities, states, or departments using AI-generated lists and import via .txt.
  • Simulate multi-user workflows: one user as creator, another as steward.
  • Test regex patterns with real data before applying to production campaigns.
  • Always validate data source column names against model identifiers before pipeline execution.