Manipulating big data over a distributed cluster for processing and analysis is rampant in industry and one of the most sought after skills. This course is an a 3 day intensive introduction to big data with Apache Hadoop and Spark. Participants gain an understanding of what insights big data can deliver through hands-on experience with tools and systems used by big data and machine learning engineers. At the end of the course, participants will be able to build a complete end
to end data pipeline starting from data ingestion and storage to data processing and analysis. Topics covered include HDFS, Hive, Impala, Sqoop, MapReduce, HBase, Spark and SparkML. All hands on exercises are conducted on in Python and Shell, therefore some prior experience is advised. Participants will have an opportunity to
run code on a real Hadoop/Spark cluster.