Login Register

TAC, Data Catalog, Stewardship - jhon-aso1-20241101-230956

← Back to Recording

Summary

Overview

This course provides a comprehensive theoretical and practical introduction to the Talent Data Catalog (TDC), a data cataloging tool designed to centralize metadata management, improve data governance, and enable data discovery across distributed systems. The session begins with a diagnostic phase to verify user access and permissions, followed by an in-depth exploration of data cataloging concepts: metadata types (technical, business, lineage, and quality), data governance frameworks, and enterprise architecture. The instructor uses a case study of Acme Sportingwear to demonstrate how to configure TDC, import technical metadata from a PostgreSQL database, define data classifications, and interpret profiling results. The session emphasizes practical implementation through live screen sharing, with participants creating individual project configurations and exploring object relationships, labels, and versioning. The course concludes by establishing foundational workflows for metadata management and setting the stage for future hands-on practice.

Topic (Timeline)

1. Access Verification and Permission Check [00:00:00 - 00:06:02]

The session begins with troubleshooting participant access to the Talent Data Catalog software. The trainer guides users to verify login credentials, confirm correct URL access, and validate permissions for the “Test Job Migration” configuration. Issues with password entry and user account status are addressed, with instructions to test connectivity from local machines versus virtual machines. The goal is to ensure all participants have proper access before proceeding, highlighting that permission misconfigurations had previously hindered course delivery.

2. Project Context and Business Motivation for Data Cataloging [00:06:03 - 00:17:55]

The trainer introduces the business rationale for data cataloging using a narrative of organizational growth: from manual record-keeping to ERP, CRM, and SCM systems, leading to data silos and duplication. The core problem—lack of data governance—is illustrated through a scenario where a marketing team cannot segment customers by age due to missing birthdate fields. The solution is a centralized data catalog that provides inventory, lineage, and policy compliance visibility. Key benefits include: discovering data assets, understanding data meaning, ensuring regulatory compliance (GDPR/CCPA), and enabling data lineage/tracing.

3. Core Concepts: Metadata, Data Governance, and Enterprise Architecture [00:17:56 - 00:30:13]

The trainer defines metadata as “data about data,” using analogies like image EXIF data and document authorship. Three metadata types are introduced:

Technical: Column names, data types, table structures (from databases, data lakes).
Business: Glossaries, definitions, and terminology (e.g., “customer” = employee + family).
Lineage (Data Provenance): Transformation workflows, ETL jobs, and data movement.
Quality: Rules and monitoring for data integrity.
Enterprise architecture is explained via TOGAF’s four layers: business, application, data, and technology architecture. The trainer critiques misaligned tech implementations (e.g., time-tracking systems in Colombian universities) to emphasize strategic alignment.

4. Talent Data Catalog: Features, Versions, and User Roles [00:30:14 - 00:49:12]

The trainer details TDC’s three license tiers (Standard, Advanced, Advanced Plus), noting that Advanced lacks custom glosario schemas, requiring generic models. Core functionalities are outlined:

Discovery: Find metadata across systems.
Description: Document meaning and context.
Lineage: Visualize data flow.
Versioning: Track changes to metadata objects.
Collaboration: Share insights and comments.
The dashboard interface is introduced, with navigation to key menus: Object, Collection, Worship (saved queries), Dashboard (custom UI), and Manage (admin controls). The trainer clarifies that most users will be end-users, not administrators.

5. Project Setup: Repository, Configuration, and Labels [00:49:13 - 01:33:24]

Participants create a project structure:

A folder named test_**record**_course to organize all objects.
A configuration named TDC_config_[name] to internally link metadata objects.
Users connect to their personal configuration via the top-right dropdown.
Labels are introduced as tags for filtering and status tracking (e.g., “in review,” “approved”). The trainer recommends using status-based labels rather than domain-specific ones. The Responsibilities and Version menus are explored, emphasizing that versioning is per-object, not per-project. The History menu enables audit trails.

6. Technical Metadata Import: Connection, Schema, and Profiling [01:33:25 - 02:43:59]

The trainer guides participants through importing technical metadata from a PostgreSQL database:

Connection Setup: Choosing between default server (cloud-to-db) and RFDCA (on-prem agent).
Import Setup: Configuring host, database, schema (ACME_SportingWeb), and credentials from a provided file.
Import Options: Enabling data sampling (10 rows), data profiling (1,000 rows), and data classification.
Data Classification: Using predefined classes (email, first name, city) based on dictionaries and regex patterns. The tool auto-classifies columns (e.g., “city” → PII) with 60% threshold logic.
Profiling Results: Viewing statistics (nulls, duplicates), histograms (for numeric fields), and conditional labels (PII, confidential).
Participants observe how metadata is structured: tables → columns → data types → constraints → statistics. The Dataflow and Semantic Flow menus show no lineage or business glossary links yet.

7. Object Exploration and Metadata Interpretation [02:44:00 - 02:50:04]

Participants explore the imported Anonymized Order table:

Columns: View names, types, nullability, length.
Column Statistics: Null count, distinct values, duplicates.
Histograms: Distribution of numeric values.
Conditional Labels: PII tagging on “city” column.
Labels: Custom label “ubicación” added at column level.
Data Classification: Tool misclassifies “city” as “last name” due to dictionary overlap; users manually remove incorrect classifications.
Lineage & Semantics: No links to jobs or glossaries yet.
Diagram: Entity-Relationship diagram created for tables.
The session ends with participants confirming successful import and understanding metadata structure, with a 10-minute break scheduled before continuing with business metadata and lineage.

Appendix

Key Principles

Metadata is the DNA of data: Describes structure, meaning, and context—not the data itself.
Data cataloging enables data governance: Centralizes discovery, lineage, and compliance.
Versioning is per-object: Each table, column, or model has independent version history.
Labels are contextual: Global labels (tool-wide) vs. configuration-specific labels (project-wide).
Data classification relies on thresholds: 60% match to dictionary/regex triggers auto-classification (can be inaccurate).

Tools Used

Talent Data Catalog (TDC) Advanced Edition
PostgreSQL (source database)
AWS S3 / Azure (hypothetical data lake platforms)
Tableau (target reporting tool)
RFDCA (on-premises agent for secure connections)
Default Server (cloud-based connection mode)

Common Pitfalls

Misconfigured connection strings: Host URL errors prevent metadata import.
Over-reliance on auto-classification: Tools may mislabel columns (e.g., city → last name).
Incorrect versioning settings: “Copy model description” can overwrite historical documentation.
Using default server for on-prem databases: Security risks; RFDCA agent required for internal networks.
Ignoring labels: Failing to use status labels (e.g., “in review”) reduces searchability.

Practice Suggestions

Create a personal configuration and import metadata from a local database.
Add custom labels to objects (e.g., “deprecated,” “critical,” “reviewed”).
Manually override auto-classifications to improve accuracy.
Document objects with comments and attachments (e.g., schema diagrams, data dictionaries).
Test data profiling with small datasets to understand statistical outputs.
Explore lineage by linking tables to ETL jobs in future sessions.