Data Engineering
Integration for Developers

Instructor Led | Data Engineering | 3 Days | Version 10.5.1

Course Overview

Learn to accelerate Data Engineering Integration through mass ingestion, incremental loads, transformations, processing of complex files, creating dynamic mappings, and integrating data science using Python. Optimize the Data Engineering system performance through monitoring, troubleshooting, and best practices while gaining an understanding of how to reuse application logic for Data Engineering use cases.
This course is applicable for software version 10.5.1.

Objectives

After successfully completing this course, students should be able to:

  • Mass ingest data to Hive and HDFS
  • Perform incremental loads in Mass Ingestion
  • Perform initial and incremental loads
  • Integrate with relational databases using SQOOP
  • Perform transformations across various engines
  • Execute a mapping using JDBC in Spark mode
  • Perform stateful computing and windowing
  • Process complex files
  • Parse hierarchical data on Spark engine
  • Run profiles and choose sampling options on Spark engine
  • Execute Dynamic Mappings
  • Create Audits on Mappings
  • Monitor logs using REST Operations Hub
  • Monitor logs using Log Aggregation and troubleshoot
  • Run mappings in Databricks environment
  • Create mappings to access Delta Lake tables
  • Tune performances of Spark and Databricks jobs

Target Audience

  • Developer

Prerequisites

Agenda
Module 1: Informatica Data Engineering Integration Overview
  • Data Engineering concepts
  • Data Engineering Integration features
  • Benefits of Data Engineering Integration
  • Data Engineering Integration architecture
  • Data Engineering Integration developer tasks
  • Data Engineering Integration 10.5 new features
Module 2: Ingestion and Extraction in Hadoop
  • Integrating DEI with Hadoop cluster
  • Hadoop file systems
  • Data Ingestion to HDFS and Hive using SQOOP
  • Mass Ingestion to HDFS and Hive – Initial load
  • Mass Ingestion to HDFS and Hive - Incremental load
  • Lab: Configure SQOOP for Processing Data Between Oracle  (SQOOP) to HDFS
  • Lab: Configure SQOOP for processing data between an Oracle database and Hive
  • Lab: Creating Mapping Specifications using Mass Ingestion Service 
Module 3: Native and Hadoop Engine Strategy
  • DEI engine strategy
  • Hive Engine architecture
  • MapReduce
  • Tez
  • Spark architecture
  • Blaze architecture
  • Lab: Executing a mapping in Spark mode
  • Lab: Connecting to a Deployed Application
Module 4: Data Engineering Development Process
  • Advanced Transformations in DEI – Python, Update Strategy, and Macro
  • Hive ACID Use Case
  • Stateful Computing and Windowing
  • Lab: Creating a Reusable Python Transformation
  • Lab: Creating an Active Python Transformation
  • Lab: Performing Hive Upserts
  • Lab: Using Windowing Function LEAD
  • Lab: Using Windowing Function LAG
  • Lab: Creating a Macro Transformation
Module 5: Complex File Processing
  • Data Engineering file formats – Avro, Parquet, JSON
  • Complex file data types – Structs, Arrays, Maps
  • Complex Configuration, Operators and Functions
  • Lab: Converting Flat File data object to an Avro file
  • Lab: Using complex data types - Arrays, Structs, and Maps in a mapping
Module 6: Hierarchical Data Processing
  • Hierarchical Data Processing
  • Flatten Hierarchical Data
  • Dynamic Flattening with Schema Changes
  • Hierarchical Data Processing with Schema Changes
  • Complex Configuration, Operators and Functions
  • Dynamic Ports
  • Dynamic Input Rules
  • Lab: Flattening a complex port in a Mapping
  • Lab: Building dynamic mappings using dynamic ports
  • Lab: Building dynamic mappings using input rules
  • Lab: Performing Dynamic Flattening of complex ports
  • Lab: Parsing Hierarchical Data on the Spark Engine
Module 7: Mapping Optimization and Performance Tuning
  • Validation Environments
  • Execution Environment
  • Mapping Optimization
  • Mapping Recommendations and Insight
  • Scheduling, Queuing, and Node Labeling
  • Mapping Audits
  • Lab: Implementing Recommendation
  • Lab: Implementing Insight
  • Lab: Implementing Mapping Audits
Module 8: Monitoring Logs and Troubleshooting in Hadoop
  • Hadoop Environment Logs
  • Spark Engine Monitoring
  • Blaze Engine Monitoring
  • REST Operations Hub
  • Log Aggregator
  • Troubleshooting
  • Lab: Monitoring Mappings using REST Operations Hub
  • Lab: Viewing and analyzing logs using Log Aggregator
Module 9: Intelligent Structure Model
  • Intelligent Structure Discovery Overview
  • Intelligent Structure Model
  • Lab: Use an Intelligent Structure Model in a Mapping
Module 10: Databricks Overview
  • Databricks overview
  • Steps to configure Databricks
  • Databricks clusters
  • Notebooks, Jobs, and Data
  • Delta Lakes
Module 11: Databricks Integration
  • Databricks Integration
  • Components of the Informatica and the Databricks environments
  • Run-time process on the Databricks Spark Engine
  • Databricks Integration Task Flow
  • Pre-requisites for Databricks integration
  • Cluster Workflows


Enroll Now

Back to Course Overview

Power User Axon for Community Users (Instructor Led or onDemand) Axon Content Curation (Instructor Led) Axon for Power Users (Instructor Led) Axon Data Governance (Professional Certification) Axon Data Governance (Professional Certification) Axon Data Governance (Professional Certification) Some more content to make this bigger asdf asdf asdf

Informatica offers programs to extend learning in convenient and economic packages. Programs include self-paced subscriptions as well as bundled instructor led training and certifications. Each program is curated around a specific skillset to enable customer success.

365University Data Governance Annual Subscription

Informatica MasterPass Education Subscription

Informatica Learning Library

Data Governance & Privacy Journey Master

View Full Course Offerings