Summary

Overview

This course provides a comprehensive, hands-on exploration of integrating Big Data technologies (Hadoop, Hive, Spark) with Artificial Intelligence (specifically OpenAI) and demonstrates the construction of a data warehouse using Hive. The session walks through end-to-end workflows: connecting Big Data pipelines to AI models via dynamic prompts, configuring Hive metadata and storage formats, building dimension tables for a data warehouse, and managing data lineage and versioning in proprietary tools. The focus is on practical implementation using Talend Big Data Studio, with emphasis on data transformation, schema alignment, and troubleshooting real-world integration challenges.

Topic (Timeline)

1. Introduction and Big Data-AI Integration Setup [00:03:13 - 00:14:23]

The session begins with an overview of integrating Big Data outputs with OpenAI for analytical insights. The trainer sets up a virtual machine with a Hadoop cluster and introduces the goal: to use Big Data processing results (e.g., top 3 least sold products filtered by city and gender) as input to an AI model. The workflow involves creating a custom PROMPT file (prom_less_sold_items.json.txt) with a Java-style placeholder (%s) to inject dynamic data from Big Data outputs into the AI prompt. The trainer demonstrates how to copy and paste a pre-existing OpenAI connection job into a new job (List 10 Least Sold Items), then disconnects the original PROMPT to reconfigure it for the new data source.

2. Data Flow Configuration: HDFS Input, Schema Mapping, and PROMPT Injection [00:14:27 - 00:27:03]

The trainer configures the data pipeline to feed filtered Big Data results into the AI model. A T-File Input Delimited component is added to read the output file (e.g., menos_vendidos.xml) from HDFS. The schema is imported from a previously exported T-File Output schema to ensure field alignment. The file path is copied from the output to the input component. The trainer emphasizes the importance of double quotes around file paths in stream contexts. The PROMPT component is reconnected to the T-Map, and a string variable named “prompt” is created to hold the dynamically generated prompt.

3. Data Aggregation for Single AI Request: File Input Delimited Tricky Configuration [00:27:06 - 00:34:10]

To avoid multiple AI calls (one per row), the trainer modifies the T-File Input Delimited component to read all rows as a single string. This is achieved by setting both field and row separators to @ (custom UTF-8), effectively concatenating all records into one line. The column is renamed “resultado”. The original T-File Input is deleted and replaced with a new one from metadata named “salida_big_data”. The connection from this input to the T-Map is updated, and the variable name is changed to “resultado_big_data”. The T-Map expression is updated using string.format(plantilla, resultado_big_data) to inject the aggregated data into the PROMPT template.

4. AI Prompt Formatting and JSON Sanitization [00:34:12 - 00:53:41]

The trainer connects the PROMPT output to the OpenAI component and executes the job. The initial response fails due to JSON formatting issues — OpenAI rejects newline characters (\n) in the payload. The solution is to append .replace("\n", " ") to the PROMPT expression in the T-Map, replacing line breaks with spaces. After correction, the AI returns a culturally informed sales plan for men in Yangon. The trainer notes OpenAI’s sensitivity to special characters and the importance of sanitizing input data before transmission.

5. Hive Architecture and Container Setup [00:55:46 - 01:05:28]

The trainer shifts focus to Hive, explaining its role as a data warehouse engine atop Hadoop. The architecture includes three Docker containers: PostgreSQL (for metastore metadata), Hive (query engine), and Hadoop (data storage). The containers are started in order: PostgreSQL → Metastore → Hive. The trainer demonstrates connecting to Hive via Beeline using JDBC URL jdbc:hive2://localhost:10000/default. The metastore allows SQL-like queries (PostgreSQL syntax) to be translated into MapReduce or Spark jobs executed on Hadoop.

6. Hive Table Creation and Data Insertion [01:05:30 - 01:15:49]

The trainer creates a new database axa and a table empleados with columns id (integer) and nombre (string). Data is inserted using SQL INSERT INTO statements. The trainer confirms data persistence by navigating the Hadoop file system browser, where Hive stores table data under /user/hive/warehouse/axa.db/empleados/. The trainer emphasizes that Hive abstracts Hadoop storage — SQL commands are translated into distributed jobs, and data resides in HDFS, not in the metastore database.

7. Hive Connection and Schema Extraction in Talend [01:16:45 - 01:21:33]

In Talend Big Data, the trainer creates a Hive connection named axa(hive) and configures it to point to localhost:10000 with database axa. The schema is manually extracted to populate the connection with table metadata. The trainer explains that Talend does not auto-discover schemas by default — users must explicitly extract them to avoid performance issues with large databases.

8. Hive Job: Creating and Populating a Table with Talend Components [01:21:33 - 01:43:38]

A new job Creacion Tabla Hive is created with components: T-Hive Pre-Job, T-Hive Create Table, T-Hive Row Generator, T-Hive Row, and T-Hive Post-Job. The T-Hive Create Table component is configured to create a table clientes with textfile format. The T-Hive Row Generator creates 20 synthetic rows with id (numeric sequence) and nombre (random first name). The T-Hive Row component inserts data via dynamic SQL: INSERT INTO clientes VALUES (%{id}, '%{nombre}'). The job is executed successfully, confirming data is written to HDFS.

9. Reading Hive Data with Talend: T-Hive Input and T-Row [01:49:59 - 01:55:28]

The trainer demonstrates reading the clientes table by extracting its schema via the Hive connection, then using a T-Hive Input component to read the table and connect it to a T-Row output. The trainer notes that schema extraction must be refreshed after new tables are created. The job successfully outputs the 20 inserted records.

10. Big Data Processing with Hive: Aggregation and Bulk Load [01:55:31 - 02:10:42]

A new job Job_Sales_Products_By_Genre is built to process supermarket sales data. Components include: T-HDFS Input (reads supermarket_sales.csv), T-Aggregator Row (groups by gender and product line, sums sales), T-HDFS Output (writes to sales_by_gender.csv), T-Hive Create Table, and T-Hive Load. The schema from the output file is imported into the Hive table definition. The T-Hive Load component is configured to bulk-load the CSV file into the Hive table using the LOAD DATA command. The trainer encounters and resolves a separator mismatch (comma vs. semicolon) and an overwrite issue.

11. Spark Integration in Proprietary Talend [02:10:44 - 02:23:40]

The trainer explains that while the open-source Talend version lacks Spark support, the commercial version includes a Spark Configuration component. This component allows users to define a Spark cluster endpoint (e.g., AWS EMR). Jobs are designed conventionally using Talend components, and the tool automatically translates them into Spark code (Java) and submits them to the cluster. The trainer notes that Talend’s commercial version also includes built-in machine learning components, reducing dependency on Python libraries.

12. Version Control and Project Management in Commercial Talend [02:23:40 - 02:32:35]

The trainer contrasts open-source Talend (local file-based) with the commercial version (Git-integrated). Projects are linked to Git repositories via the Talend Management Console (TMC). Users can create branches (forks), version jobs (e.g., 0.1 → 0.2), and push/pull changes. Documentation is auto-generated as HTML. The trainer highlights that component availability (e.g., Hadoop, ML) is license-dependent — users pay to unlock modules.

13. Case Study: Data Warehouse Design with Hive — Dimension: Players [02:34:58 - 03:02:20]

The case study begins with creating a data warehouse in Hive. A folder structure (case_study/bronze/silver/gold) is established. The first dimension, players, is built:

  • A job Job_Dimension_Players is created with T-Hive Pre/Post-Job, T-HDFS Input (reads team_nba.csv), T-Map (filters player_id, name, team), T-Hive Row (eliminates duplicates via player_id), T-HDFS Output (writes players.csv), and T-Hive Load.
  • The HDFS Input schema is imported from the downloaded CSV.
  • The T-Hive Create Table defines players with id, name, team (all string).
  • The T-Hive Load component uses the output file path and table name.
  • Execution confirms data is loaded into Hive. The trainer resolves a null issue caused by incorrect field separator (comma vs. semicolon) and schema mismatch.

14. Case Study: Dimension: Time with Date Parsing and Synthetic Keys [03:03:26 - 03:44:25]

The second dimension, time, is created:

  • A job Job_Dimension_Time uses T-HDFS Input (same team_nba.csv), T-Map (extracts date), T-Hive Row (removes duplicates), T-Hive Create Table, T-HDFS Output, and T-Hive Load.
  • In T-Map, variables are created: identifier (sequence), annuality (year from date), mes (month), trimestre (quarter = (month-1)/3 + 1), nombre_mes (month name via format_date), and date (for joins).
  • The date format in input is corrected from dd/mm/yyyy to yy-mm-dd to match Hive’s expectations.
  • The T-Hive Create Table schema is updated to include date as DATE type.
  • The output file tiempos.csv is generated with correct separator (semicolon) and header.
  • After dropping the table and re-executing, the job successfully loads 20 unique time records into Hive. The trainer resolves a final null issue by ensuring date format consistency between input and Hive schema.

15. Finalization, Submission, and Course Wrap-up [03:44:25 - 03:49:58]

The trainer instructs participants to save their Word document with screenshots of the two completed case study points and compress their Talend project folder (workspace/local_project) for submission. Participants are asked to email the document and compressed folder. The trainer confirms receipt of submissions, thanks participants for their engagement, and emphasizes the value of collaborative knowledge transfer. The session concludes with well-wishes and an invitation for future questions.

Appendix

Key Principles

  • AI-Big Data Integration: Use dynamic placeholders (%s) in prompts to inject structured Big Data outputs into AI models. Sanitize data (e.g., replace \n with spaces) to avoid JSON parsing errors.
  • Hive Architecture: Hive uses a metastore (e.g., PostgreSQL) to store schema metadata while data resides in HDFS. SQL queries are translated into MapReduce/Spark jobs.
  • Data Warehouse Design: Dimensions (e.g., Players, Time) are built using star schema principles. Use synthetic keys (e.g., sequence numbers) for surrogate keys since Hive does not enforce primary keys.
  • Date Handling: Always align date formats between source data, Talend transformations, and Hive table definitions. Use format_date and get_part_of_date functions for parsing.

Tools Used

  • Big Data: Hadoop (HDFS), Hive, Docker (for containerized Hive/PostgreSQL)
  • AI: OpenAI API via custom HTTP integration
  • ETL Tool: Talend Big Data Studio (Open Source and Commercial versions)
  • Version Control: Git (via Talend Management Console in commercial version)
  • Data Formats: CSV, XML, TextFile, ORC (mentioned), Parquet (mentioned)

Common Pitfalls

  • Separator Mismatch: Hive tables default to semicolon (;) as field separator; CSV files often use comma (,). Always verify and align.
  • Schema Not Refreshed: Talend does not auto-refresh Hive table metadata. Always use “Extract Schema” after creating new tables.
  • Null Output: Caused by incorrect date format, mismatched column order, or unescaped special characters in AI prompts.
  • File Path Confusion: T-Hive Load expects HDFS paths, not local paths. Use “Local” toggle correctly to reference HDFS files.
  • Versioning: In open-source Talend, project files are local. Always compress and back up the entire workspace folder.

Practice Suggestions

  • Recreate the OpenAI integration with a different dataset (e.g., customer reviews → sentiment analysis).
  • Build a third dimension (e.g., products) and create a fact table linking players, time, and sales.
  • Experiment with Hive storage formats: compare performance of TEXTFILE vs. ORC vs. PARQUET.
  • Use Talend’s commercial version (if available) to convert a job to Spark and compare execution time.
  • Practice Git integration: create a fork, modify a job, commit, and merge back.