RDD Transformations and Actions APIs in Apache Spark

1. Objective

This Spark API Guide explains all the important APIs of Apache Spark. The tutorial describes the transformations and actions used to process the the data in spark. Spark is the next gen Big Data Tool to learn more about Apache Spark follow this introductory guide.

2. Transformation

Transformations build new RDD(Resilient Distributed Dataset) from previous RDD with the help of operations like filter, map, flatmap  etc. Transformations are lazy operation on RDD, i.e. they don’t execute immediately, instead after calling actions transformations are executed. Transformations are functions that take input and produce one or many “new” output RDDs.
The result Rdd will be always different from their parent Rdd and they can be smaller or bigger or of the same size. To improve performance of computations transformation uses pipelined which is an optimization technique.

2.1. Map:

It passes each element through user-defined function. It returns a new dataset on passing each element to the function. It is applying function on each row / item of RDD. Size of input and output will remain same.
One -> one & size of A = size of B & One element in -> one element out.

2.2. FlatMap:

It does the similar job like map but the difference is that flatmap returns a list of elements (0 or more) as an iterator & output of flatmap is flattened. Function in flat map returns list of elements, array or sequence
One -> many & size of B>= size of A & one element-in -> 0 or more element-out.

2.3. Filter:

It returns a new dataset which is formed by selecting those elements of source on which function returns true. It returns those elements only that satisfy a predicate, predicate is a function that accepts parameter and returns Boolean value either true or false. It keeps only those elements which passes / satisfies the condition and filter out those which don’t, so the new RDD will be set of those elements for which function returns true.

2.4. MapPartitions:

It runs one at a time on each partition or block of the Rdd, so function must be of type iterator<T>. It improves performance by reducing creation of object in map function.

2.5. MappartionwithIndex:

It is similar to MapPartition but with one difference that it takes two parameters, the first parameter is the index and second is an iterator through all items within this partition (Int, Iterator<t>).

2.6. Union:

It performs standard set operation. It is the same as operator ‘++”.It returns a new RDD by making union with other RDD.

2.7. Distinct:

Returns a new dataset containing unique elements. It returns distinct values from one array.

2.8. Intersection:

It returns value or elements from two RDD which are identical but with de-duplication.

2.9. GroupBy:

It works on key value pair, returns a new dataset of grouped items. It will return the new RDD which is made up with key (which is a group) and list of items of that group. Order of elements within group may not be the same when you apply same operation on same RDD over and over. It’s a wide operation as it shuffles data from multiple partitions / divisions and create another RDD.

2.10. ReduceByKey:

It uses associative reduce function, where it merges value of each key. It can be used with Rdd only in key value pair. It’s wide operation which shuffles data from multiple partitions/divisions and creates another RDD. It merges data locally using associative function for optimized data shuffling. Result of the combination (e.g. a sum) is of the same type that the values, and that the operation when combined from different partitions is also the same as the operation when combining values inside a partition.

2.11. AggregateByKey:

It will combine values for particular key and result of such combination can be any object that you specify. You need to specify how values are combined or added inside one partition which is executed in same node and how you combine the result from different partitions (that may be in different nodes).
Aggregate the values of each key in an RDD, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s, The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

2.12. SortByKey:

They will work with any key type K that has an implicit Ordering[K] in scope. Ordering objects already exist for all of the standard primitive types. Users can also define their own orderings for custom types, or to override the default ordering. The implicit ordering that is in the closest scope will be used.
When called on Dataset of (K,V) where k is Ordered returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the ascending  argument.

2.13. Join:

It is joining two datasets. When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

2.14. Coalesce:

It changes number of partition where data is stored. It combines original partitions to new number of partitions, so it reduces number of partitions. It is an optimized version of repartition that allows data movement, but only if you are decreasing number of RDD partitions. It runs operations more efficiently after filtering large datasets.

2.15. Repartition:

Repartition will reshuffle the data in your RDD to produce the final number of partitions you request. it may reduce or increase number of partitions and shuffles data all over network.
Before using Transformation / Actions you have to install Spark, to Install Spark follow this Installation Guide.

3. Actions

It triggers some computation and returns a final result of RDD computations. It uses linage graph to load data from original RDD, carry out all intermediate transformations and returns value back / final result to either driver program or write it out to file system. It is synchronous and only action can materialize a value in spark program with real data. It runs jobs using SparkContext.runJob or directly DAGScheduler.runJob.

3.1. Count ():

It returns number of elements or items in RDD. So it basically counts the number of items present in dataset and returns a number after count.

3.2. Collect():

It returns all the data / elements present in an RDD in the form of array. It prints values of array back to console and used in debugging programs.

3.3. Reduce():

It takes function with two arguments an accumulator and a value which should be commutative and Associative in mathematical nature. It reduces a list of element s into one as a result. This function produces same result when continuously applied on same set of RDD data with multiple partitions irrespective of elements order. It is wide operation.
It executes the provided function to combine the elements into result set .It takes two arguments and returns one. Function should be either commutative or associative so that it can generate  reproducible result in parallel .

3.4. Take(n):

It fetches or extracts first n requested number of elements of RDD and returns them as an array.

3.5. First():

Retrieves the very first data or element of RDD.It is similar to take (1).

3.6. TakeSample():

It is an action that is used to return a fixed-size random sample subset of an RDD includes Boolean option of with or without replacement and random generator seed. It returns an array. It internally randomizes order of elements returned.

3.7. TakeOrdered (count&ordering):

Fetches the specified number of first n items ordered by specified ordering function based on default, natural order or custom comparator.

3.8. CountByKey():

It counts the value of RDD consisting of two components tuple for each distinct key. It actually counts the number of elements for each key and return the result to the master as lists of (key, count) pairs.

3.9. Foreach():

It executes the function on each item in RDD. It is good for writing database or publishing to a web services. It executes parameter less function for each data items.

3.10. SaveAsTextfile():

It writes the content of RDD to text file or saves the RDD as a text file in file path directory using string representation.
To practically implement ans use these APIs follow this beginner’s guide.
other: http://data-flair.training/blogs/introduction-spark-tutorial-quick-start/

Hive Installation and Quick Start Guide

1. Objective

This Hive tutorial contains simple steps for installing and running hive on Ubuntu. Hive is a datawarehousing infrastructure on the top of Hadoop. This hive quickstart will help you setup and configure hive and run several Hive QL queries to learn the concepts of hive.

2. Introduction

Apache Hive is a warehouse infrastructure designed on high of Hadoop for providing information summarization, query, and ad-hoc analysis. Hence, in order to get your Hive running successfully, Java and Hadoop ought to be pre-installed and should be functioning well on your Linux OS. For installation procedure of Java and Hadoop you can refer Hadoop installation Guide

3. Hive Installation

Now in order to get Hive successfully installed on your system, please follow the below steps and execute them on your Linux OS:

3.1. Download Hive

In this tutorial we will use hive-0.13.1-cdh5.3.2. (you can also use any latest version of hive) Download hive using below mentioned link: http://apache.petsads.us/hive/hive-0.13.1-cdh5.3.2/ apache-hive-0.13.1-cdh5.3.2.tar.gz. This file gets downloaded on your Downloads directory.
After the successful download of Hive, we will get the following response:
1
apache-hive-0.13.1-cdh5.3.2 hive-0.13.1-cdh5.3.2.tar.gz

3.1.1. Untar the file

Move the setup file in home directory and untar/unzip the downloaded file by executing the below command:
1
$ tar zxvf hive-0.13.1-cdh5.3.2.tar.gz

3.2. Setting up Hive Environment Variables

3.2.1. Editing .bashrc file

In order to set up the Hive environment we need to append the following lines at the end of the ~/.bashrc file.
1
2
3
4
export HADOOP_USER_CLASSPATH_FIRST=true
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_HOME=/home/dataflair/hadoop-2.6.0-cdh5.5.1
export HIVE_HOME=/home/dataflair/hive-0.13.1-cdh5.3.2
Note: Here enter correct name & version of your hive and correct path of your Hive File “/home/dataflair/hive-0.13.1-cdh5.3.2” this is the path of my Hive File and “hive-0.13.1-cdh5.3.2” is the name of my hive file. So please enter correct path and name of your Hive file. After adding save this file.
And in order to execute this file use the following command:
1
$ source ~/.bashrc

4. Launching HIVE

1
$ hive
The following output gets displayed:
1
2
Logging initialized using configuration in jar:file:/home/dataflair/HADOOP/hive-0.13.1-cdh5.3.2/lib/hive-common-0.13.1-cdh5.3.2.jar!/hive-log4j.properties
hive>

5. Exit from Hive:

1
hive> exit;
Congratulations!! Hive gets successfully installed on your system. Now you can easily execute your commands.

Before using hive you should change the meta-store layer of hive, follow this tutorial to change meta-store of hive from derby to MySQL.

6. Hive Queries

Below are the some basic Hive queries which you will need while using Hive.

6.1. Show Databases

Syntax:
1
show databases;
Usage:
1
show databases;
This query gives a list of databases which are present in your Hive. If you had newly installed Hive and had not created any database, then by default a database named “default” is present there and would be shown up after executing above query.

6.2. Create Database

Syntax:
1
create database_name;
Usage:
1
create database test;
This will create a new database named “test”. And you can check this database by writing “show databases;” query.

6.3. Use

USE query is used to use the database created by you.
Syntax:
1
USE database_name;
Usage:
1
USE test;

6.4. Current Database

Syntax:
1
set hive.cli.print.current.db=true;
It is used to know the name of database in which you are currently working.

6.5. DROP

DROP query is used to delete a database
Syntax:
1
DROP database database_name;
Usage:
1
DROP database test1;

6.6. CREATE TABLE

This command is used to create new table.
Syntax:
1
2
3
4
5
6
CREATE TABLE TABLE_NAME (Parameters)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
Usage:
1
create table employee ( Name String comment ‘Employee Name’, Id int, MobileNumber String, Salary Float) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile;

6.7. View tables

Syntax:
1
show tables;
It will list you all the tables created by you on the current directory.

6.8. Alter Table

It is used to change attributes inside a table.
Syntax: We can change a number of attributes inside a table what we want to change.
1
2
3
4
5
ALTER TABLE TableName RENAME TO new_name
ALTER TABLE TableName ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE TableName DROP [COLUMN] column_name
ALTER TABLE TableName CHANGE column_name new_name new_type
ALTER TABLE TableName REPLACE COLUMNS (col_spec[, col_spec ...])
Usage:
1
ALTER TABLE employee RENAME TO demo1;

6.9. Describe table

Syntax:
1
desc TableName;
Usage:
1
desc employee;
This command gives a description of the parameters inside the table.

6.10. Load data

Syntax:
1
LOAD DATA LOCAL INPATH 'Path of the File' OVERWRITE INTO TABLE 'Name of the Table';
Usage:
1
LOAD DATA LOCAL INPATH '/home/dataflair/Desktop/details.txt' OVERWRITE INTO TABLE employee;
This command loads the data from your file path to the selected table created by you in Hive.