Big Data Essentials Bootcamp (L: ENG)

3,200.00  excl. VAT

Duration: 5 days;

Instructor: TBD;

Delivery dates: Upon request;

Location: TBD.

 

Description

Big Data needs proper tools and skills, and this workshop brings you “from zero to hero,” that is, provides the student with the necessary knowledge of Hadoop, Spark, and NoSQL. With these three fundamentals, you will be able to build systems processing massive amounts of data, in archival, batch, interactive and finally real-time manner. The workshop also lays foundations for proper analytics, allowing to extract insights from data.

What You Will Learn:

  • Hadoop: HDFS, MapReduce, Pig, Hive
  • Spark: Spark core, SparkSQL, Spark Java API, Spark Streaming
  • NoSQL: Cassandra/HBase architecture, Java API, drivers, data modeling

Format: lectures (50%) and hands-on labs (50%).

Prerequisites:

  • comfortable with Java programming language (most programming exercises are in java)
  • comfortable in Linux environment (be able to navigate Linux command line, edit files using vi / nano)

Lab environment:

Zero Install: There is no need to install Hadoop, Spark, etc. software on students’ machines! Working clusters and environments will be provided for students.

Students will need the following

  • an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
  • a browser to access the cluster

Course Outline:

Introduction to Hadoop
Hadoop history, concepts
ecosystem
distributions
High-level architecture
Hadoop myths
Hadoop challenges
hardware / softwareHDFS Overview
concepts (horizontal scaling, replication, data locality, rack awareness)
architecture (Namenode, Secondary NameNode, DataNode)
data integrity
future of HDFS : Namenode HA, Federation
lab exercisesMapReduce Overview
MapReducee concepts
phases : driver, mapper, shuffle/sort, reducer
thinking in MapReduce
future of mapreduce (yarn)
lab exercisesPig
pig vs java vs MapReduce
pig latin language
user defined functions
understanding pig job flow
basic data analysis with Pig
complex data analysis with Pig
multi datasets with Pig
advanced concepts
lab exercisesHive
hive concepts
architecture
data types
Hive data management
hive vs sql
lab exercisesSparkSpark BasicsBackground and history
Spark and hadoop
Spark concepts and architecture
Spark eco system (core, spark sql, mlib, streaming)
First look at Spark
Spark in local mode
Spark web UI
Spark shell
Analyzing dataset – part 1
Inspecting RDDsRDDs In DepthPartitions
RDD Operations / transformations
RDD types
MapReduce on RDD
Caching and persistence
Sharing cached RDDsSpark API programmingIntroduction to Spark API / RDD API
Submitting the first program to Spark
Debugging / logging
Configuration properties

Spark Streaming

Streaming overview
Streaming operations
Sliding window operations
Writing spark streaming applications

NoSQL

Introduction to Big Data / NoSQL
NoSQL overview
CAP theorem
When is NoSQL appropriate
NoSQL ecosystem
Cassandra Basics
Cassandra nodes, clusters, datacenters
Keyspaces, tables, rows and columns
Partitioning, replication, tokens
Quorum and consistency levels
Labs

Cassandra drivers
Introduction to Java driver
CRUD (Create / Read / Update, Delete) operations using Java client
Asynchronous queries
Labs

Data Modeling – part 1
introduction to CQL
CQL Datatypes
creating keyspaces & tables
Choosing columns and types
Choosing primary keys
Data layout for rows and columns
Time to live (TTL), create, insert, update
Querying with CQL
CQL updates
Labs

Data Modeling – part 2
Creating and using secondary indexes
Denormalization and join avoidance
composite keys (partition keys and clustering keys)
Time series data
Best practices for time series data
Counters
Lightweight transactions (LWT)

Data Modeling Labs : Group design sessions
multiple use cases from various domains are presented
students work in groups to come up designs and models
discuss various designs, analyze decisions
Lab : implement ‘Netflix’ data models, generate data