Online & Classroom

Data Pre-Processing Spark and Hadoop Training Course

Months Icon 3 Months
38 Modules
Get In Touch

Key Features

OSACAD is committed to bringing you the best learning experience with high-standard features including

Key Features
Real-time Practice Labs

Learning by doing is what we believe. State-of-the-art labs to facilitate competent training.

Key Features
Physical & Virtual Online Classrooms

Providing the flexibility to learn from our classrooms or anywhere you wish considering these turbulent times.

Key Features
24/7 Support On Slack

Technical or Technological, we give you assistance for every challenge you face round-the-clock.

Key Features
Job Interview & Assistance

Guiding in & out, until you get placed in your dream job.

Key Features
Live projects with our industry partners

An inside look & feel at industry environments by handling real-time projects.

Key Features
Internship after course

Opportunity to prove your talent as an intern at our partner firms and rope for permanent jobs.

Why Data Pre Processing with Spark and Hadoop ?

Why Data Science
Support for Multiple Languages

Using Spark, developers can write data-analysis jobs in Java, Scala or Python, using a set of more than 80 high-level operators

Why Data Science
Apache Spark Compatibility with Hadoop

Apache Spark is fully compatible with Hadoop’s Distributed File System (HDFS), as well as with other Hadoop components such as YARN (Yet Another Resource Negotiator) and the HBase distributed database

Why Data Science
Alternative to MapReduce

Spark provides an alternative to MapReduce. It executes jobs in short bursts of micro-batches that are five seconds or less apart

Who is This program for

  • Data Scientists,Data Engineers,Data Analysts
  • BI Professionals,Research professionals,Software Architects
  • Software Developers,Testing Professionals
  • Anyone who is looking to upgrade Big Data skills
Who is this program


Best-in-class content by leading faculty and industry leaders in the form of videos,
cases and projects, assignments and live sessions.

  • Using labs for preparation
  • Setup Development Environment (Windows 10) - Introduction
  • Setup Development Environment - Python and Spark - Prerequisites
  • Setup Development Environment - Python Setup on Windows
  • Setup Development Environment - Configure Environment Variables
  • Setup Development Environment - Setup PyCharm for developing Python application
  • Setup Development Environment - Pass run time arguments or parameters
  • Setup Development Environment - Download Spark compressed tar ball
  • Setup Development Environment - Install 7z for uncompress and untar on windows
  • Setup Development Environment - Setup Spark
  • Setup Development Environment - Install JDK
  • Setup Development Environment - Configure environment variables for Spark
  • Setup Development Environment - Install WinUtils - integrate Windows and HDFS
  • Setup Development Environment - Integrate PyCharm and Spark on Windows
  • Introduction and Setting up Python
  • Basic Programming Construct
  • Functions in Python
  • Python Collections
  • Map Reduce operations on Python Collections
  • Setting up Data Sets for Basic I/O Operations
  • Basic I/O operations and processing data using Collections
  • Get revenue for given order id - as application
  • Setup Environment - Options
  • Setup Environment - Locally
  • Setup Environment - using Cloudera Quickstart VM
  • Using Itversity platforms - Big Data Developer labs and forum
  • Using itversity's big data lab
  • Using Windows - Putty and WinSCP
  • Using Windows - Cygwin
  • HDFS Quick Preview
  • YARN Quick Preview
  • Setup Data Sets
  • Introduction
  • Introduction to Spark
  • Setup Spark on Windows
  • Quick overview about Spark documentation
  • Connecting to the environment
  • Initializing Spark job using pyspark
  • Create RDD from HDFS files
  • Create RDD from collection - using parallelize
  • Read data from different file formats - using sqlContext
  • Row level transformations - String Manipulation
  • Row Level Transformations - map
  • Row Level Transformations - flatMap
  • Filtering data using filter
  • Joining Data Sets - Introduction
  • Joining Data Sets - Inner Join
  • Joining Data Sets - Outer Join
  • Aggregations - Introduction
  • Aggregations - count and reduce - Get revenue for order id
  • Aggregations - reduce - Get order item with minimum subtotal for order id
  • Aggregations - countByKey - Get order count by status
  • Aggregations - understanding combiner
  • Aggregations - groupByKey - Get revenue for each order id
  • groupByKey - Get order items sorted by order_item_subtotal for each order id
  • Aggregations - reduceByKey - Get revenue for each order id
  • Aggregations - aggregateByKey - Get revenue and count of items for each order id
  • Sorting - sortByKey - Sort data by product price
  • Sorting - sortByKey - Sort data by category id and then by price descending
  • Ranking - Introduction
  • Ranking - Global Ranking using sortByKey and take
  • Ranking - Global using takeOrdered or top
  • Ranking - By Key - Get top N products by price per category - Introduction
  • Ranking - By Key - Get top N products by price per category - Python collection
  • Ranking - By Key - Get top N products by price per category - using flatMap
  • Ranking - By Key - Get top N priced products - Introduction
  • Ranking - By Key - Get top N priced products - using Python collections API
  • Ranking - By Key - Get top N priced products - Create Function
  • Ranking - By Key - Get top N priced products - integrate with flatMap
  • Set Operations - Introduction
  • Set Operations - Prepare data
  • Set Operations - union and distinct
  • Set Operations - intersect and minus
  • Saving data into HDFS - text file format
  • Saving data into HDFS - text file format with compression
  • Saving data into HDFS using Data Frames - json
  • Problem Statement
  • Launching pyspark
  • Reading data from HDFS and filtering
  • Joining orders and order_item
  • Aggregate to get daily revenue per product id
  • Load products and convert into RDD
  • Join and sort the data
  • Save to HDFS and validate in text file format
  • Saving data in avro file format
  • Get data to local file system using get or copyToLocal
  • Develop as application to get daily revenue per product
  • Run as application on the cluster
  • Different interfaces to run SQL - Hive, Spark SQL
  • Create database and tables of text file format - orders and order_items
  • Create database and tables of ORC file format - orders and order_items
  • Running SQL/Hive Commands using pyspark
  • Functions - Getting Started
  • Functions - String Manipulation
  • Functions - Date Manipulation
  • Functions - Aggregate Functions in brief
  • Functions - case and nvl
  • Row level transformations
  • Joining data between multiple tables
  • Group by and aggregation
  • Sorting the data
  • Set operations - union and union all
  • Analytics functions - aggregations
  • Analytics functions - ranking
  • Windowing functions
  • Creating Data Frames and register as temp tables
  • Write Spark Application - Processing Data using Spark SQL
  • Write Spark Application - Saving Data Frame to Hive tables
  • Data Frame Operations
  • Introduction to Setting up Environment for Practice
  • Overview of ITVersity Boxes GitHub Repository
  • Creating Virtual Machine
  • Starting HDFS and YARN
  • Gracefully Stopping Virtual Machine
  • Understanding Datasets provided in Virtual Machine
  • Using GitHub Content for the practice
  • Using Resources for Practice

Apache Spark 3.x -Data Processing -Getting Started

  • Introduction
  • Review of Setup Steps for Spark Environment
  • Using ITVersity labs
  • Apache Spark Official Documentation (Very Important)
  • Quick Review of Spark APIs
  • Spark Modules
  • Spark Data Structures - RDDs and Data Frames
  • Develop Simple Application
  • Apache Spark - Framework

Apache Spark 3.x -Data Frame and Predefined Functions

  • Introduction
  • Data Frames - Overview
  • Create Data Frames from Text Files
  • Create Data Frames from Hive Tables
  • Create Data Frames using JDBC
  • Data Frame Operations - Overview
  • Spark SQL - Overview
  • Overview of Functions to manipulate data in Data Frame fields or columns

Apache Spark 3.x -Processing Data using Data Frames-Basic Transformations

  • Define Problem Statement - Get Daily Product Revenue
  • Selection or Projection of Data in Data Frames
  • Filtering Data from Data Frames
  • Joining multiple Data Frames
  • Perform Aggregations using Data Frames
  • Sorting Data in Data Frames
  • Development Life Cycle using Data Frames
  • Run applications using Spark Submit
Hours of Content
Case Study & Projects
Live Sessions
Coding Assignments
Capstone Projects to Choose From
Tools, Languages & Libraries

Languages and Tools covered

Languages and Tools covered Languages and Tools covered Languages and Tools covered Languages and Tools covered Languages and Tools covered Languages and Tools covered

Hands On Projects

Airline Dataset Analysis

Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala

Data Analysis and Visualisation using Spark and Zeppelin

In this big data project, we will talk about Apache Zeppelin. We will write code, write notes, build charts and share all in one single data analytics environment using Hive, Spark and Pig

Big Data Project on Processing Unstructured Data using Spark

In this project, we will evaluate and demonstrate how to handle unstructured data using Spark


Our training is based on latest cutting-edge infrastructure technology which makes you ready for the industry.Osacad will Present this certificate to students or employee trainees upon successful completion of the course which will encourage and add to trainee’s resume to explore a lot of opportunities beyond position

Enroll Now
Learn From Home

First-Ever Hybrid Learning System

Enjoy the flexibility of selecting online or offline classes with Osacad first-ever hybrid learning model.
Get the fruitful chance of choosing between the privilege of learning from home or the
advantage of one-on-one knowledge gaining - all in one place.

Learn From Home

Learn from Home

Why leave the comfort and safety of your home when you can learn the eminent non-technical courses right at your fingertips? Gig up to upskill yourself from home with Osacad online courses.

Learn From Home

Learn from Classroom

Exploit the high-tech face-to-face learning experience with esteemed professional educators at Osacad. Our well-equipped, safe, and secure classrooms are waiting to get you on board!

Our Alumina Works at
Our Alumina Works Our Alumina Works Our Alumina Works Our Alumina Works Our Alumina Works Our Alumina Works Our Alumina Works Our Alumina Works Our Alumina Works



Artificial Intelligence which is a global company with headquarters in Chicago, USA. Artificial Intelligence has partnered with GamaSec, a leading Cyber Security product company. Artificial Intelligence is focusing on building Cyber Security awareness and skills in India as it has a good demand in consulting and product support areas. The demand for which is predicted to grow exponentially in the next 3 years. The Artificial Intelligence training programs are conducted by individuals who have in depth domain experience. These training sessions will equip you with the fundamentalknowledge and skills required to be a professional cyber security consultant.

All graduates of commerce, law, science and engineering who want to build a career in cyber security can take this training.

There are a number of courses, which are either 3 months or 6 months long. To become a cyber security consultant we recommend at least 6 to 9 months of training followed by 6 months of actual project work.During project work you will be working under a mentor and experiencing real life customer scenarios.

You can get started by enrolling yourself. The enrollment can be initiated from this website by clicking on "ENROLL NOW". If you are having questions or difficulties regarding this, you can talk to our counselors and they can help you with the same.

Once you enroll with us you will receive access to our Learning Center. All online classrooms, recordings, assignments, etc. can be accessed here.

Get in touch with us

What do you benefit from this programs
  • Master the concepts on Apache Spark framework & development
  • In-depth exercises and real-time projects on Apache Spark
  • Learn about Apache Spark Core, Spark Internals, RDD, Spark SQL, etc
  • Get comprehensive knowledge on Scala Programming language

I’m Interested

Related Courses


Python is a powerful general-purpose programming language. It is used in web development, data science, creating software... Read More

R For Data Analysis

R is a programming language that is designed and used mainly in the statistics, data science, and scientific communities. R has... Read More

Python For Data Analysis

Python for Data Analysis is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in.... Read More


Tableau is a powerful and fastest growing data visualization tool used in the Business Intelligence Industry. It helps in simplifying... Read More

Success Stories

4th floor, Khajaguda Main Road, next to Andhra Bank, near DPS, Khajaguda, Gachibowli, Hyderabad, Telangana 500008

Success Stories
Madhapur ( Headquarters, Hyderabad)

Plot No. 430, Sri Ayyappa Society, Khanamet, Madhapur, Hyderabad-500081

Success Stories

Uptown Cyberabad Building, Block-C, 1st Floor Plot – 532 & 533, 100 Feet Road Sri Swamy Ayyappa Housing Society, Madhapur, Hyderabad, Telangana 500081

Success Stories

5999 S New Wilke Rd, Bldg 3, #308 Rolling Meadows, IL 60008

Call Us