2017

Home » Archives for 2017

Docker Container

Containers are instances of Docker images that can be run using the Docker run command. The basic purpose of Docker is to run containers. Let’s discuss how to work with containers.

Run a Container-

Running of containers is managed with the Docker run command. To run a container in an interactive mode, first launch the Docker container.

#sudo docker run –it centos /bin/bash

Then hit Crtl+p and you will return to your OS shell.

Listing of Containers-

One can list all of the containers on the machine via the docker ps command. This command is used to return the currently running containers.

#docker ps

docker ps -a-

This command is used to list all of the containers on the system

#docker ps -a

─a – It tells the docker ps command to list all of the containers on the system.

docker history-

With this command, you can see all the commands that were run with an image via a container.

#docker history ImageID

ImageID – This is the Image ID for which you want to see all the commands that were run against it.

docker top-

With this command, you can see the top processes within a container.

#docker top ContainerID

ContainerID – This is the Container ID for which you want to see the top processes

docker stop-

This command is used to stop a running container.

#docker stop ContainerID

ContainerID – This is the Container ID which needs to be stopped.

docker rm-

This command is used to delete a container.

#docker rm ContainerID

ContainerID – This is the Container ID which needs to be removed.

Return Value

The output will give the ID of the removed container.

docker stats-

This command is used to provide the statistics of a running container.

#docker stats ContainerID

ContainerID – This is the Container ID for which the stats need to be provided.

Return Value

The output will show the CPU and Memory utilization of the Container.

docker attach-

This command is used to attach to a running container.

#docker attach ContainerID

Once you have attached to the Docker container, you can run the above command to see the process utilization in that Docker container.

docker pause

This command is used to pause the processes in a running container.

#docker pause ContainerID

docker unpause-

This command is used to unpause the processes in a running container.

#docker unpause ContainerID

The Docker pause command is used to pause an existing Docker container

docker kill-

This command is used to kill the processes in a running container.

#docker kill ContainerID

Docker – Container Lifecycle-

The following illustration explains the entire lifecycle of a Docker container.

Initially, the Docker container will be in the created state.

Then the Docker container goes into the running state when the Docker run command is used.

The Docker kill command is used to kill an existing Docker container.

The Docker stop command is used to pause an existing Docker container.

The Docker run command is used to put a container back from a stopped state to a running state

| HT | 04:29 Read More

Docker Image

In Docker, everything is based on Images. An image is a combination of a file system and parameters. Let’s take an example of the following command in Docker.

#docker run hello-world

1. The Docker command is specific and tells the Docker program on the Operating System that something needs to be done.

2.The run command is used to mention that we want to create an instance of an image, which is then called a container.

3. Finally, "hello-world" represents the image from which the container is made.

Now let’s look at how we can use the CentOS image available in Docker Hub to run CentOS on our Ubuntu machine. We can do this by executing the following command on our Ubuntu machine.

#sudo docker run centos –it /bin/bash

Note the following points about the above sudo command:

* We are using the sudo command to ensure that it runs with root access.

*Here, centos is the name of the image we want to download from Docker Hub and install on our Ubuntu machine.

* ─it is used to mention that we want to run in interactive mode.

* /bin/bash is used to run the bash shell once CentOS is up and running.

Displaying Docker Images:-

To see the list of Docker images on the system, you can issue the following command.

#docker images

This command is used to display all the images currently installed on the system.

Return Value

This command is used to display all the images currently installed on the system.

From the above output, you can see that the server has three images: centos, newcentos, and jenkins. Each image has the following attributes:

· TAG – This is used to logically tag images.

· Image ID – This is used to uniquely identify the image.

· Created – The number of days since the image was created.

· Virtual Size – The size of the image.

Downloading Docker Images:-

Images can be downloaded from Docker Hub using the Docker run command. Let’s see in detail how we can do this.

Syntax

The following syntax is used to run a command in a Docker container.

#docker run imageName

Return Value

The output will run the command in the desired container.

You will now see the CentOS Docker image downloaded. Now, if we run the Docker images command to see the list of images on the system, we should be able to see the centos image as well.

Removing Docker Images:-

The Docker images on the system can be removed via the docker rmi command. Let’s look at this command in more detail.

#docker rmi ImageID

Docker Images ID :-

This command is used to return only the Image ID’s of the images

#docker images -q

Docker Inspect-

This command is used see the details of an image or container.

#docker inspect Repository

Repository – This is the name of the Image

Example-

#sudo docker inspect jenkins

When we run the above command, it will produce the following result:

For More Details..

https://linuxvmaws.blogspot.com

https://docs.docker.com/engine/reference/commandline/images/

| HT | 04:28 Read More

Top Important Docker Commands

How Do You Use a Docker?

The biggest advantage of VMs is that they create snapshots which can be revisited instantly later.

Docker containers further enhance the lightweight process virtualization by being OS independent and using the Linux Kernel’s functionality. They are created from Docker images – like snapshots. Docker images are created using a Docker file which can be customized or used as is. The default execution driver for creating a docker container is ‘libcontainer’. Docker Hub can be used for serching docker images and seeing the way they have been built.

Command	Description
docker attach	Attach local standard input, output, and error streams to a running container
docker build	Build an image from a Dockerfile
docker checkpoint	Manage checkpoints
docker commit	Create a new image from a container’s changes
docker config	Manage Docker configs
docker container	Manage containers
docker cp	Copy files/folders between a container and the local filesystem
docker create	Create a new container
docker deploy	Deploy a new stack or update an existing stack
docker diff	Inspect changes to files or directories on a container’s filesystem
docker events	Get real time events from the server
docker exec	Run a command in a running container
docker export	Export a container’s filesystem as a tar archive
docker history	Show the history of an image
docker image	Manage images
docker images	List images
docker import	Import the contents from a tarball to create a filesystem image
docker info	Display system-wide information
docker inspect	Return low-level information on Docker objects
docker kill	Kill one or more running containers
docker load	Load an image from a tar archive or STDIN
docker login	Log in to a Docker registry
docker logout	Log out from a Docker registry
docker logs	Fetch the logs of a container
docker network	Manage networks
docker node	Manage Swarm nodes
docker pause	Pause all processes within one or more containers
docker plugin	Manage plugins
docker port	List port mappings or a specific mapping for the container
docker ps	List containers
docker pull	Pull an image or a repository from a registry
docker push	Push an image or a repository to a registry
docker rename	Rename a container
docker restart	Restart one or more containers
docker rm	Remove one or more containers
docker rmi	Remove one or more images
docker run	Run a command in a new container
docker save	Save one or more images to a tar archive (streamed to STDOUT by default)
docker search	Search the Docker Hub for images
docker secret	Manage Docker secrets
docker service	Manage services
docker stack	Manage Docker stacks
docker start	Start one or more stopped containers
docker stats	Display a live stream of container(s) resource usage statistics
docker stop	Stop one or more running containers
docker swarm	Manage Swarm
docker system	Manage Docker
docker tag	Create a tag TARGET_IMAGE that refers to SOURCE_IMAGE
docker top	Display the running processes of a container
docker unpause	Unpause all processes within one or more containers
docker update	Update configuration of one or more containers
docker version	Show the Docker version information
docker volume	Manage volumes
docker wait	Block until one or more containers stop, then print their exit codes

Note- These all above commands are basic commands. According to your Permissions try to use.

For More Details-

Reference- https://docs.docker.com

| HT | 04:27 Read More

33 Frequently used HDFS shell commands

# Open a terminal window to the current working directory.
# /home/training
# 1. Print the Hadoop version
hadoop version
# 2. List the contents of the root directory in HDFS
#
hadoop fs -ls /
# 3. Report the amount of space used and
# available on currently mounted filesystem
#
hadoop fs -df hdfs:/
# 4. Count the number of directories,files and bytes under
# the paths that match the specified file pattern
#
hadoop fs -count hdfs:/

# 5. Run a DFS filesystem checking utility
#
hadoop fsck – /
# 6. Run a cluster balancing utility
#
hadoop balancer
# 7. Create a new directory named “hadoop” below the
# /user/training directory in HDFS. Since you’re
# currently logged in with the “training” user ID,
# /user/training is your home directory in HDFS.
#
hadoop fs -mkdir /user/training/hadoop
# 8. Add a sample text file from the local directory
# named “data” to the new directory you created in HDFS
# during the previous step.
#
hadoop fs -put data/sample.txt /user/training/hadoop
# 9. List the contents of this new directory in HDFS.
#
hadoop fs -ls /user/training/hadoop
# 10. Add the entire local directory called “retail” to the
# /user/training directory in HDFS.
#
hadoop fs -put data/retail /user/training/hadoop
# 11. Since /user/training is your home directory in HDFS,
# any command that does not have an absolute path is
# interpreted as relative to that directory. The next
# command will therefore list your home directory, and
# should show the items you’ve just added there.
#
hadoop fs -ls
# 12. See how much space this directory occupies in HDFS.
#
hadoop fs -du -s -h hadoop/retail
# 13. Delete a file ‘customers’ from the “retail” directory.
#
hadoop fs -rm hadoop/retail/customers
# 14. Ensure this file is no longer in HDFS.
#
hadoop fs -ls hadoop/retail/customers
# 15. Delete all files from the “retail” directory using a wildcard.
#
hadoop fs -rm hadoop/retail/*
# 16. To empty the trash
#
hadoop fs -expunge
# 17. Finally, remove the entire retail directory and all
# of its contents in HDFS.
#
hadoop fs -rm -r hadoop/retail
# 18. List the hadoop directory again
#
hadoop fs -ls hadoop
# 19. Add the purchases.txt file from the local directory
# named “/home/training/” to the hadoop directory you created in HDFS
#
hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/
# 20. To view the contents of your text file purchases.txt
# which is present in your hadoop directory.
#
hadoop fs -cat hadoop/purchases.txt
# 21. Add the purchases.txt file from “hadoop” directory which is present in HDFS directory
# to the directory “data” which is present in your local directory
#
hadoop fs -copyToLocal hadoop/purchases.txt /home/training/data
# 22. cp is used to copy files between directories present in HDFS
#
hadoop fs -cp /user/training/*.txt /user/training/hadoop
# 23. ‘-get’ command can be used alternaively to ‘-copyToLocal’ command
#
hadoop fs -get hadoop/sample.txt /home/training/
# 24. Display last kilobyte of the file “purchases.txt” to stdout.
#
hadoop fs -tail hadoop/purchases.txt
# 25. Default file permissions are 666 in HDFS
# Use ‘-chmod’ command to change permissions of a file
#
hadoop fs -ls hadoop/purchases.txt
sudo -u hdfs hadoop fs -chmod 600 hadoop/purchases.txt
# 26. Default names of owner and group are training,training
# Use ‘-chown’ to change owner name and group name simultaneously
#
hadoop fs -ls hadoop/purchases.txt
sudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt
# 27. Default name of group is training
# Use ‘-chgrp’ command to change group name
#
hadoop fs -ls hadoop/purchases.txt
sudo -u hdfs hadoop fs -chgrp training hadoop/purchases.txt
# 28. Move a directory from one location to other
#
hadoop fs -mv hadoop apache_hadoop
# 29. Default replication factor to a file is 3.
# Use ‘-setrep’ command to change replication factor of a file
#
hadoop fs -setrep -w 2 apache_hadoop/sample.txt
# 30. Copy a directory from one node in the cluster to another
# Use ‘-distcp’ command to copy,
# -overwrite option to overwrite in an existing files
# -update command to synchronize both directories
#
hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop
# 31. Command to make the name node leave safe mode
#
hadoop fs -expunge
sudo -u hdfs hdfs dfsadmin -safemode leave
# 32. List all the hadoop file system shell commands
#
hadoop fs
# 33. Last but not least, always ask for help!
#
hadoop fs -help

| HT | 02:33 Read More

RDD Transformations and Actions APIs in Apache Spark

1. Objective

This Spark API Guide explains all the important APIs of Apache Spark. The tutorial describes the transformations and actions used to process the the data in spark. Spark is the next gen Big Data Tool to learn more about Apache Spark follow this introductory guide.

2. Transformation

Transformations build new RDD(Resilient Distributed Dataset) from previous RDD with the help of operations like filter, map, flatmap etc. Transformations are lazy operation on RDD, i.e. they don’t execute immediately, instead after calling actions transformations are executed. Transformations are functions that take input and produce one or many “new” output RDDs.

The result Rdd will be always different from their parent Rdd and they can be smaller or bigger or of the same size. To improve performance of computations transformation uses pipelined which is an optimization technique.

2.1. Map:

It passes each element through user-defined function. It returns a new dataset on passing each element to the function. It is applying function on each row / item of RDD. Size of input and output will remain same.

One -> one & size of A = size of B & One element in -> one element out.

2.2. FlatMap:

It does the similar job like map but the difference is that flatmap returns a list of elements (0 or more) as an iterator & output of flatmap is flattened. Function in flat map returns list of elements, array or sequence

One -> many & size of B>= size of A & one element-in -> 0 or more element-out.

2.3. Filter:

It returns a new dataset which is formed by selecting those elements of source on which function returns true. It returns those elements only that satisfy a predicate, predicate is a function that accepts parameter and returns Boolean value either true or false. It keeps only those elements which passes / satisfies the condition and filter out those which don’t, so the new RDD will be set of those elements for which function returns true.

2.4. MapPartitions:

It runs one at a time on each partition or block of the Rdd, so function must be of type iterator<T>. It improves performance by reducing creation of object in map function.

2.5. MappartionwithIndex:

It is similar to MapPartition but with one difference that it takes two parameters, the first parameter is the index and second is an iterator through all items within this partition (Int, Iterator<t>).

2.6. Union:

It performs standard set operation. It is the same as operator ‘++”.It returns a new RDD by making union with other RDD.

2.7. Distinct:

Returns a new dataset containing unique elements. It returns distinct values from one array.

2.8. Intersection:

It returns value or elements from two RDD which are identical but with de-duplication.

2.9. GroupBy:

It works on key value pair, returns a new dataset of grouped items. It will return the new RDD which is made up with key (which is a group) and list of items of that group. Order of elements within group may not be the same when you apply same operation on same RDD over and over. It’s a wide operation as it shuffles data from multiple partitions / divisions and create another RDD.

2.10. ReduceByKey:

It uses associative reduce function, where it merges value of each key. It can be used with Rdd only in key value pair. It’s wide operation which shuffles data from multiple partitions/divisions and creates another RDD. It merges data locally using associative function for optimized data shuffling. Result of the combination (e.g. a sum) is of the same type that the values, and that the operation when combined from different partitions is also the same as the operation when combining values inside a partition.

2.11. AggregateByKey:

It will combine values for particular key and result of such combination can be any object that you specify. You need to specify how values are combined or added inside one partition which is executed in same node and how you combine the result from different partitions (that may be in different nodes).

Aggregate the values of each key in an RDD, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s, The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

2.12. SortByKey:

They will work with any key type K that has an implicit Ordering[K] in scope. Ordering objects already exist for all of the standard primitive types. Users can also define their own orderings for custom types, or to override the default ordering. The implicit ordering that is in the closest scope will be used.

When called on Dataset of (K,V) where k is Ordered returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the ascending argument.

2.13. Join:

It is joining two datasets. When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

2.14. Coalesce:

It changes number of partition where data is stored. It combines original partitions to new number of partitions, so it reduces number of partitions. It is an optimized version of repartition that allows data movement, but only if you are decreasing number of RDD partitions. It runs operations more efficiently after filtering large datasets.

2.15. Repartition:

Repartition will reshuffle the data in your RDD to produce the final number of partitions you request. it may reduce or increase number of partitions and shuffles data all over network.

Before using Transformation / Actions you have to install Spark, to Install Spark follow this Installation Guide.

3. Actions

It triggers some computation and returns a final result of RDD computations. It uses linage graph to load data from original RDD, carry out all intermediate transformations and returns value back / final result to either driver program or write it out to file system. It is synchronous and only action can materialize a value in spark program with real data. It runs jobs using SparkContext.runJob or directly DAGScheduler.runJob.

3.1. Count ():

It returns number of elements or items in RDD. So it basically counts the number of items present in dataset and returns a number after count.

3.2. Collect():

It returns all the data / elements present in an RDD in the form of array. It prints values of array back to console and used in debugging programs.

3.3. Reduce():

It takes function with two arguments an accumulator and a value which should be commutative and Associative in mathematical nature. It reduces a list of element s into one as a result. This function produces same result when continuously applied on same set of RDD data with multiple partitions irrespective of elements order. It is wide operation.

It executes the provided function to combine the elements into result set .It takes two arguments and returns one. Function should be either commutative or associative so that it can generate reproducible result in parallel .

3.4. Take(n):

It fetches or extracts first n requested number of elements of RDD and returns them as an array.

3.5. First():

Retrieves the very first data or element of RDD.It is similar to take (1).

3.6. TakeSample():

It is an action that is used to return a fixed-size random sample subset of an RDD includes Boolean option of with or without replacement and random generator seed. It returns an array. It internally randomizes order of elements returned.

3.7. TakeOrdered (count&ordering):

Fetches the specified number of first n items ordered by specified ordering function based on default, natural order or custom comparator.

3.8. CountByKey():

It counts the value of RDD consisting of two components tuple for each distinct key. It actually counts the number of elements for each key and return the result to the master as lists of (key, count) pairs.

3.9. Foreach():

It executes the function on each item in RDD. It is good for writing database or publishing to a web services. It executes parameter less function for each data items.

3.10. SaveAsTextfile():

It writes the content of RDD to text file or saves the RDD as a text file in file path directory using string representation.

To practically implement ans use these APIs follow this beginner’s guide.
other: http://data-flair.training/blogs/introduction-spark-tutorial-quick-start/

| HT | 08:08 Read More

Crazy Scandal

Docker Container

7. Docker Container

Docker Container

Run a Container-

The Docker run command is used to put a container back from a stopped state to a running state

Docker Image

Docker Image

Top Important Docker Commands

33 Frequently used HDFS shell commands

RDD Transformations and Actions APIs in Apache Spark

1. Objective

2. Transformation

2.1. Map:

2.2. FlatMap:

2.3. Filter:

2.4. MapPartitions:

2.5. MappartionwithIndex:

2.6. Union:

2.7. Distinct:

2.8. Intersection:

2.9. GroupBy:

2.10. ReduceByKey:

2.11. AggregateByKey:

2.12. SortByKey:

2.13. Join:

2.14. Coalesce:

2.15. Repartition:

3. Actions

3.1. Count ():

3.2. Collect():

3.3. Reduce():

3.4. Take(n):

3.5. First():

3.6. TakeSample():

3.7. TakeOrdered (count&ordering):

3.8. CountByKey():

3.9. Foreach():

3.10. SaveAsTextfile():

Stay Connected !

Popular Posts

Blog Archive