Big Data with Hadoop/Spark (English)

Deze cursus hebben we niet meer in ons assortiment maar nog wel andere cursussen, bijvoorbeeld:

Er zijn geen gerelateerde cursussen.

Kunnen wij je helpen?

Manipulating big data over a distributed cluster for processing and analysis is rampant in industry and one of the most sought after skills. This course is an a 3 day intensive introduction to big data with Apache Hadoop and Spark. Participants gain an understanding of what insights big data can deliver through hands-on experience with tools and systems used by big data and machine learning engineers. At the end of the course, participants will be able to build a complete end
to end data pipeline starting from data ingestion and storage to data processing and analysis. Topics covered include HDFS, Hive, Impala, Sqoop, MapReduce, HBase, Spark and SparkML. All hands on exercises are conducted on in Python and Shell, therefore some prior experience is advised. Participants will have an opportunity to
run code on a real Hadoop/Spark cluster.

Doel

This course gives an introduction to big data technologies using the Cloudera stack (Cloud/On-Prem). The use of the Cloudera stack (Cloud/On-Prem) is chosen because of the ease of usage and development it provides. Since infrastructure is already arranged for in this way, students can focus on concepts and applying them in lab exercises.

Doelgroep

The intended audience is beginning professionals in Big Data and Business Analytics.

Voorkennis

Introductie in Python – Mogelijkheden en Code Begrijpen of vergelijkbare kennis.
Prior experience in programming is not needed but Python basics are recommended. Instructions and code samples will be arranged by the instructor.

Bijzonderheden

Het cursusmateriaal van deze cursus is veelal digitaal. Om optimaal gebruik te kunnen maken van het cursusmateriaal adviseren we je om een eigen device (laptop, tablet) mee te nemen. Dit is echter alleen een advies, geen noodzaak.

Dejan de Gooijer

accountmanager

Onderwerpen

Day 1: Big Data Fundamentals
- Understanding Big Data:
- Applications: Examples from Retail / Financial Services / Healthcare / Manufacturing
- Overview of (Big) Data Technologies (storage models)
- Introduction to Hadoop
- HDFS
- HDFS Schema Design
- Lab: Working with HDFS: Technical Commands
- AWS S3 (Object store)
- Lab interacting with S3 on AWS
Day 2: Big Data Fundamentals
- Map Reduce (Quick Conceptual Overview)
- Hive (Detailed Overview)
- Lab: Load Data on HDFS. Create Tables on HDFS. Querying SQL (Joins etc)
- Pig (Quick Mention/Overview)
- Impala (Detailed Overview)
- Lab: Load Data on HDFS. Create Tables on Impala. Querying SQL (Joins etc)
- Hbase (Detailed overview)
- Lab: Setting up an Hbase Table, loading data, retrieving data
- Summary

Day 3: Big Data Fundamentals
- Spark
- Lab: PySpark exercise using DataFrames and SparkSQL
- SparkML
- Lab: Moving Data into and out of HDFS
- Real Life Case Application Architecture (End to End Pipeline)
- Wrap Up

Bekijk meer onderwerpenBekijk minder onderwerpen

Day 1: Big Data Fundamentals
- Understanding Big Data:
  - Definitions
  - The V’s of Big Data
  - Sources of Big Data
  - Types of Big Data: Structured / Unstructured / Semi structured
- Applications: Examples from Retail / Financial Services / Healthcare / Manufacturing
- Overview of (Big) Data Technologies (storage models)
  - relational e.g. mysql
  - key value e.g. redis, dynamodb
  - columnar e.g, hbase
  - document e.g. mongodb
  - graph e.g. Neo4J.
  - timeseries e.g. InfluxDB
  - factors to consider when selecting a (big) data storage
- Introduction to Hadoop
  - Scaling: Vertical vs Horizontal
  - Start of Hadoop with Google File System and Map Reduce
  - Hadoop Landscape and Components
  - Hadoop Distributions
  - Hadoop in the Cloud
- HDFS:
  - Design of HDFS
  - Storing and Reading Files in HDFS
  - Fault Tolerance and Replication
  - HDFS Storage Options: File Formats (CSV/TXT/Parquets/Avro) / Row vs Columnar / Compression and Serialization: What is it? And how it works?
- HDFS Schema Design:
  - Location
  - Partitioning
  - Bucketing
- Lab: Working with HDFS: Technical Commands
- AWS S3 (Object store)
  - Genealogy and Design
  - Design considerations
- Lab interacting with S3 on AWS
  - Comparison between S3 and HDFS
Day 2: Big Data Fundamentals
- Map Reduce (Quick Conceptual Overview)
  - What is it?
  - Processing Data with Map Reduce (The Algorithm)
  - A Word Count Example (in Python and not JAVA)
  - Explain briefly of YARN here
  - Introduction to Spark
- Hive (Detailed Overview)
  - Introduction
  - Architecture
  - Different Hive Query Engines (MR/Tez/Spark)
  - Data Flow in Hive
- Lab: Load Data on HDFS. Create Tables on HDFS. Querying SQL (Joins etc)
  - Map Reduce flow in Hive
- Pig (Quick Mention/Overview)
  - Introduction
  - Architecture
  - Data Flow in PIG
  - Map Reduce flow in PIG
- Impala (Detailed Overview)
  - Introduction
  - Architecture
  - Data Flow in Impala
- Lab: Load Data on HDFS. Create Tables on Impala. Querying SQL (Joins etc)
- Hbase (Detailed overview)
  - Genealogy (that its built on HDFS) and Architecture
  - Schema design of Hbase
  - Illustrate difference between Hbase and MySQL
  - Interacting with Hbase using shell
  - Retrieving data using Hbase Shell and REST API (explain briefly an API)
- Lab: Setting up an Hbase Table, loading data, retrieving data
- Summary:
  - Comparison of Hive / Pig / Impala / Hbase
  - When to use which?

Day 3: Big Data Fundamentals
- Spark:
  - Overview
  - Key concepts and ideas
  - Difference between Hadoop Map Reduce and Spark
  - SparkSQL
- Lab: PySpark exercise using DataFrames and SparkSQL
- SparkML:
  - Quick overview
  - Running cluster analysis on PySpark
  - Other components within a Cluster: Sqoop / Tour of Ambari or Cloudera Manager / Oozie / SOLR
- Lab: Moving Data into and out of HDFS
- Real Life Case Application Architecture (End to End Pipeline)
- Wrap Up

Planning & Prijs

Gerelateerde cursussen

Er zijn geen gerelateerde cursussen.

Ervaringen

ervaringen verzameld via

"Training was prima, goede tips gekregen met af en toe een grap en grol. Locatie was prima, goed verzogd vwb koffie/thee, fruit en koekje. Mensen ook zeer vriendelijk. Lunch was perfect en zeer uitgebreid."

9

"De cursus was goed, en de verzorging ook!Ik heb er veel van opgestoken! De lokatie in Nieuwegein is goed te bereiken met het openbaar vervoer, dus dat is prettig. Tot een volgenden keer."

10

"Ik vond de training erg leerzaam. De inhoud was van een hoog niveau en de docent was goed thuis in de materie. Ik stel het vooral op prijs dat er diep op de concepten werd ingegaan."

9

Categorieën

Onderwerpen

Leveranciers

Prijs