2Big Data Open Source for the Impatient, Part 1: Hadoop Tutorial: Hello World with Java, Pig, Hive, Flume, Fuse, Oozie, and Sqoop with Informix, DB2, and MySQL

adsensecode1

How to start with Hadoop and your favorite databases
This article focuses on explaining Big Data and then providing simple examples worked on Hadoop, the most important open source player in Big Data space. You will be pleased to hear that Hadoop is NOT a replacement for Informix® or DB2®based that interacts very well with existing infrastructure. There are multiple components in the Hadoop family and this article will detail specific code samples that show the possibilities. There will be no elephant stampede if you try these examples on your own PC.

There's a lot of excitement about Big Data, but it's also a lot of confusion. This article will provide a working definition of Big Data and will then go through a few series of examples so that you can have a first-hand understanding of some of the possibilities of Hadoop, the leading open source technology within the Big Data domain. Let's focus specifically on the following questions.

What is Big Data, Hadoop, Sqoop, Hive, and Pig, and why is there so much excitement in this area?
How does Hadoop relate to IBM DB2 and Informix? Can these technologies work together?
How can I start with Big Data? What are some simple examples that run on a single PC?
For the super impatient, if you can already define Hadoop and want to start immediately with code samples, then do the following.

Launch your Informix or DB2 instance.
Download the VMWare image from the Cloudera Web site and increase the virtual machine RAM value to 1.5 GB.
Jump to the section that contains code samples.
There is a MySQL instance built into the VMWare image. If you are performing exercises without network connectivity, use MySQL examples.

For everyone else, read on ...

What is Big Data?

Big Data are large in quantity, are captured at a rapid rate, and are structured or unstructured, or some combination of the above. These factors make Big Data difficult to capture, extract, and manage using traditional methods. There is so much publicity in this space that there could be a wide debate just in relation to the definition of big data.

The use of Big Data technology is not restricted to large volumes. The examples in this article use small examples to illustrate the possibilities of technology. From the year 2012, which are large clusters are in the range of 100 Petabyte.

Big Data can be both structured and unstructured. Traditional relational databases, such as Informix and DB2, provide proven solutions for structured data. Through extensibility, they also manage unstructured data. Hadoop technology brings new and more accessible programming techniques to work on mass storage of data with both structured and unstructured data.

Why all the excitement?

There are many factors that contribute to the publicity surrounding Big Data, including the following.

Gathering the computation and storage in a product hardware: the result is an incredible speed at a low cost.

Profitability: Hadoop technology provides significant savings (think about a factor of about 10) with significant performance improvements (again, think of a factor of 10). Your mileage may vary. If existing technology can be so dramatically defeated, it is worth examining whether Hadoop can supplement or replace aspects of its current architecture.

Linear Scaling: Each parallel technology asserts scaling upwards. Hadoop has genuine escalation since the last release is expanding the limit on the number of nodes beyond 4000.

Access to unstructured data: A highly scalable data warehouse with a good parallel programming model, MapReduce, has been a challenge for the industry for some time. The Hadoop programming model does not solve all the problems, but it is a robust solution for many tasks.

What is Hadoop?

Following are several definitions of Hadoop, each directed to an audience within the company:

For Executives: Hadoop is an open source Apache software project to gain incredible volume / speed / variety of data about your organization. Use the data instead of discarding most of them.

For Technical Managers: An open source software suite that extracts the structured and unstructured BigData about your company. Integrates with your existing Business Intelligence ecosystem.

For legal department: An open source software suite that is packaged and supported by multiple vendors. See section Resources regarding IP compensation.

Engineering: A massively parallel, non-shared, Java based map-reduce execution environment. Think of hundreds to thousands of computers working on the same problem, with built-in fault resiliency. Projects in the Hadoop ecosystem provide data loading, high-level languages, automated cloud deployment, and other possibilities.

Security: A software suite with Kerberos security.

What are the components of Hadoop?

The Apache Hadoop project has two core components, file storage called the Hadoop Distributed File System (HDFS), and the programming infrastructure called MapReduce. There are several support projects that take advantage of HDFS and MapReduce. This article will provide a summary, and encourages you to obtain OReily's book "Hadoop The Definitive Guide", 3rd Edition, for more details.

The definitions below are intended to provide the background needed to use the following code examples. This article really intends to start it with hands-on experience with technology. This is more an article of "how to do" than of "what is" or "we discuss".

HDFS: If you want more than 4000 computers working on your data, then you better distribute your data over more than 4000 computers. HDFS does this for you. HDFS has few inserts. Datanodes stores your data, and Namenode tracks the place where things are stored. There are other issues, but you already have enough to start.

MapReduce: This is the programming model for Hadoop. There are two phases, it is not surprising that they are called Map and Reduce. To impress your friends tell them that there is a kind of mix between the Map phase and the Reduce phase. JobTracker manages the more than 4000 components of your MapReduce job. TaskTrackers takes JobTracker orders. If you like Java then encode it in Java. If you like SQL or another language other than Java you're lucky, you can use a utility called Hadoop Streaming.

Hadoop Streaming: A utility to allow MapReduce coding in any language: C, Perl, Python, C ++, Bash, etc. Examples include a Python mapper and an AWK reducer.

Hive and Hue: If you like SQL, will be happy to hear that you can write SQL and make it a hive become MapReduce job. No, you do not get a complete ANSI-SQL environment, but you get 4000 notes and multi-petabyte scalability. Hue gives you a graphical browser-based interface to do your Hive work.

Pig: A programming environment for high - level MapReduce coding. The Pig language is called Pig Latin. You may find it unconventional, but you get incredible profitability and high availability.

Sqoop: Provides bidirectional data transfer between Hadoop and relational database favorite.

Oozie: Manage work flow Hadoop. This does not replace your scheduler or BPM tool, but provides "if-then-else" branching and control within your Hadoop jobs.

HBase: A super value storage scalable key. It works similarly to a persistent hash-map (for python aficionados think in dictionary). It is not a relational database despite the name HBase.

FlumeNG: A charger to transmit real - time data into Hadoop. Stores data in HDFS and HBase. You'll want to start with FlumeNG, which improves the original channel.

Whirr Supply Hadoop cloud. You can boot a cluster in a few minutes with a very short configuration file.

Mahout: Machine Learning Hadoop. Used for predictive analysis and other advanced analyzes.

Fuse: Makes the HDFS system appear as a normal file system so you can use ls, rm, cd, and other data HDFS

Zookeeper: Used to manage synchronization for the cluster. You will not be working much with Zookeeper, but you work hard for you. If you think you need to write a program that uses Zookeeper you are either very, very smart and could form a committee for an Apache project, or you are about to have a terrible day.

Figure 1 shows the key parts of Hadoop.

This figure shows the Hadoop architecture

HDFS, the lower layer, lies on a cluster of product hardware. Simple rack-mounted servers, each with 2 Hex cores, 6 to 12 disks, and 32 gig ram. For a map-reduce job, the mapper layer reads from the disks at very high speed. The correlator emits pairs of key values that are ordered and presented to the reducer, and the reducer layer summarizes the key value pairs. No, you do not need to summarize, you may actually have a map-reduce job that only has correlators. This should be easier to understand when you get to the python-awk example.

How does Hadoop integrate with my Informix or DB2 infrastructure?

Hadoop integrates very well with Informix and DB2 databases with Sqoop. Sqoop is the leading open source implementation for moving data between Hadoop and relational databases. Uses JDBC to read and write Informix, DB2, MySQL, Oracle, and other sources. Adapters are optimized for multiple databases, including Netezza and DB2. See section Resources for how to download these adapters. The examples are all specific to Sqoop.

Getting Started: How to Run Simple Hadoop, Hive, Pig, Oozie, and Sqoop Examples

Already finished the introductions and definitions, now is the moment of the good. To continue, you will need to download VMWare, virtual box, or another image from the Cloudera Website and start performing MapReduce! The virtual image assumes that you have a 64bit computer and one of the popular virtualization environments. Most virtualization environments have a free download. When you attempt to boot a 64bit virtual image you may receive complaints about BIOS settings. Figure 2 shows the required change in BIOS, in this case on a Thinkpad ™. Use caution when making changes. Some corporate security packages will require a password after a change and BIOS before the system reloads.

This figure shows the BIOS settings for a 64bit virtual guest

The big data used here are actually small. The point is not to make your laptop catch fire by processing a massive file, but to show your sources of data that are interesting, and map-reduce jobs that answer meaningful questions.

Download image Virtual Hadoop

It is highly recommended that you use the Cloudera image to run these examples. Hadoop is a technology that solves problems. The Cloudera image packaging allows you to focus on big-data questions. But if you decide to assemble all the parts by yourself, Hadoop has become the problem, not the solution.

Download an image. The CDH4 image, which is the latest offering is available here: CDH4 image . The previous version, CDH3 is available here: CDH3 image .

You have options with virtualization technologies. You can download a free virtualization environment from VMWare and others. For example, go to vmware.com and download the vmware-player. Your laptop is probably running on Windows so you will probably download the vmware-player for Windows. The examples in this article will use VMWare, and will run Ubuntu Linux using "tar" instead of "winzip" or equivalent.

Once downloaded, perform the spread / unzip as follows: tar -zxvf cloudera-demo-vm-cdh4.0.0-vmware.tar.gz .

Or, if you use CDH3, then use the following: tar -zxvf cloudera-demo-vm-cdh3u4-vmware.tar.gz

Unzip typically works on tar files. Once decompressed, you can start the image as follows:

vmplayer cloudera-demo-vm.vmx .

Now you will have a screen that looks like the one shown in Figure 3.

This figure shows the vm image screenshot

The vmplayer command fully pops up and starts the virtual machine. If you are using CDH3, then you will need to turn off the machine and change the memory settings. Use the power button icon near the clock in the lower middle of the screen to turn off the virtual machine. You then have access to edit the settings of the virtual machine.

For CDH3 the next step is to supercharge the virtual image with more RAM. Most configurations can only be changed with the virtual machine turned off. Figure 4 shows how to access the settings and increase the RAM assigned to more than 2GB.

Figure 4. Adding RAM to the virtual machine

This figure shows adding RAM to virtual machine

As shown in Figure 5, you can change the settings for bridged network. With this configuration the virtual machine will get its own IP address.If this creates problems on your network, then you can optionally use Network Address Translation (NAT). You will be using the network to connect to the database.

Figure 5. Changing the configurations of the network to bridged

This figure shows changing the settings to bridged Network

You are limited by RAM on the host system, so do not try to allocate more RAM than what is on your machine. If you do, the computer will work very slowly.

Now, the moment you've been waiting for, turn on the virtual machine. The cloudera user automatically starts session at startup. If needed, the password is Cloudera: Cloudera.

Installing Informix and DB2

You will need a database with which to work. If you do not already have a database, you can download Informix Developer Edition here, or free DB2 Express-C Edition .

Another alternative to installing DB2 is to download the VMWare image that already has DB2 installed on a SuSE Linux operating system. Log in as root with the password: password.

Change to db2inst1 userid. Working as a root is like driving without a seat belt. Please talk to your local DBA about running your database.This article will not cover that here. Do not attempt to install the database inside the Cloudera virtual image because there is not enough free disk space.

The virtual machine will connect to the database using Sqoop, which requires a JDBC driver. You will need to have the JDBC driver for your database in the virtual image. You can install the driver Informix here.

The DB2 driver is located here: http://www.ibm.com/services/forms/preLogin.do?source=swg-idsdjs or http://www-01.ibm.com/support/docview.wss?rs = 4020 & uid = swg21385217 .

The installation of the Informix JDBC driver (remember, only the driver within the virtual image, not the database) is shown in Listing 1.

Listing 1. Informix JDBC Driver Installation

  Tar -xvf ../JDBC.3.70.JC5DE.tar
 followed by 
 Java -jar setup.jar

Note: Select a subdirectory relative to / home / cloudera to not require root permission for the installation.

The DB2 JDBC driver is in zip format, so simply unzip it to the destination directory, as shown in Listing 2.

Listing 2. DB2 JDBC driver install

  mkdir db2jdbc
 cd db2jdbc
 unzip ../ibm_data_server_driver_for_jdbc_sqlj_v10.1.zip

A Quick Introduction to HDFS and MapReduce

Before you start moving data between your relational database and Hadoop, you need a quick introduction to HDFS and MapReduce. There are many "hello world" style tutorials for Hadoop, so the examples here are intended to give you just enough background for the database exercises to make sense to you.

HDFS provides storage along the nodes in your cluster. The first step in using Hadoop is to place the data in HDFS. The code shown in Listing 3 gets a copy of a book by Mark Twain and a book by James Fenimore Cooper and copies these texts into HDFS.

Listing 3. Load Mark Twain and James Fenimore Cooper in HDFS

# Install wget utility into the virtual image
 Sudo yum install wget
                
 # Use wget to download the Twain and Cooper's works
 $ Wget -U firefox http://www.gutenberg.org/cache/epub/76/pg76.txt
 $ Wget -U firefox http://www.gutenberg.org/cache/epub/3285/pg3285.txt
                
 # Load both into the HDFS file system
 # First give the files better names
 # DS for Deerslayer
 # HF for Huckleberry Finn
 $ Mv pg3285.txt DS.txt
 $ Mv pg76.txt HF.txt
                
 # This next command will fail if the directory already exists
 $ Hadoop fs -mkdir / user / cloudera 
                
 # Now put the text into the directory 
 $ Hadoop fs -put HF.txt / user / cloudera
                
                
 # Way too much typing, create aliases for hadoop commands
 $ Alias hput = "hadoop fs -put"
 $ Alias hcat = "hadoop fs -cat"
 $ Alias hls = "hadoop fs -ls"
 # For CDH4 
 $ Alias hrmr = "hadoop fs -rm -r"
 # For CDH3 
 $ Alias hrmr = "hadoop fs -rmr"
                
 # Load the other article
 # But add some compression because we can 
                
 $ Gzip DS.txt 
                
 # The.  In the next command references the cloudera home directory
 # In hdfs, / user / cloudera 
                
 $ Hput DS.txt.gz.
                
 # Now take a look at the files we have in place
 $ hls
 Found 2 items
 -rw-r-r-- 1 cloudera supergroup 459386 2012-08-08 19:34 /user/cloudera/DS.txt.gz
 -rw-r-r-- 1 cloudera supergroup 597587 2012-08-08 19:35 /user/cloudera/HF.txt

You now have two files in a directory in HDFS. Please do not get too excited yet. Seriously, on a single node and with only about 1 megabyte, these are as exciting as seeing how the paint dries. But if this were a cluster of 400 nodes and you had 5 petabytes live, then you could not contain your excitement.

Many of the Hadoop tutorials use the word count example that is included in the jar file example. It happens that much of the analysis involves counting and adding. The example in Listing 4 shows you how to invoke the word counter.

Listing 4. Counting Words in Twain and Cooper

  # Hadoop comes with some examples
 # This next line uses the provided java implementation of a 
 # Word count program
                
 # For CDH4:
 Hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar wordcount HF.txt HF.out

 # For CDH3:
 Hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount HF.txt HF.out
                
 # For CDH4:
 Hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar wordcount DS.txt.gz DS.out

 # For CDH3:
 Hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount DS.txt.gz DS.out

The .gz suffix of DS.txt.gz asks Hadoop to deal with decompression as part of Map-Reduce processing. Cooper is a bit lavish, so well worth the compaction.

There are quite a few message sequences when executing your word count job. Hadoop is eager to provide many details about the Mapping and Reducing programs that run for you. The critical lines that you want to look for are listed in Listing 5, including a second listing of a failed job and how to fix one of the most common errors you will encounter when running MapReduce.

Listing 5. MapReduce messages - the "happy route"

  $ Hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount HF.txt HF.out
 12/08/08 19:23:46 INFO input.FileInputFormat: Total input paths to process: 1
 08/08/08 19:23:47 WARN snappy.LoadSnappy: Snappy native library is available
 12/08/08 19:23:47 INFO util.NativeCodeLoader: Loaded the native-hadoop library
 8/8/08 7:23:47 PM snappy.LoadSnappy: Snappy native library loaded
 08/08/08 07:23:47 INFO mapred.JobClient INFO: Running job: job_201208081900_0002
 08/08/08 19:23:48 INFO mapred.JobClient INFO: map 0% reduce 0%
 12/08/08 19:23:54 INFO mapred.JobClient INFO: map 100% reduce 0%
 12/08/08 19:24:01 INFO mapred.JobClient INFO: map 100% reduce 33%
 12/08/08 19:24:03 INFO mapred.JobClient INFO: map 100% reduce 100%
 08/08/08 07:24:04 INFO mapred.JobClient INFO: Job complete: job_201208081900_0002
 12/08/08 19:24:04 INFO mapred.JobClient: Counters: 26
 12/08/08 19:24:04 INFO mapred.JobClient: Job Counters 
 12/08/08 19:24:04 INFO mapred.JobClient: Launched reduce tasks = 1
 08/08/08 07:24:04 INFO mapred.JobClient INFO: SLOTS_MILLIS_MAPS = 5959
 12/08/08 19:24:04 INFO mapred.JobClient: Total time spent by all reduces ...
 12/08/08 19:24:04 INFO mapred.JobClient: Total time spent by all maps waiting ...
 12/08/08 19:24:04 INFO mapred.JobClient: Launched map tasks = 1
 12/08/08 19:24:04 INFO mapred.JobClient: Data-local map tasks = 1
 08/08/08 07:24:04 INFO mapred.JobClient INFO: SLOTS_MILLIS_REDUCES = 9433
 08/08/08 11:24:04 INFO mapred.JobClient INFO: FileSystemCounters
 08/08/08 19:24:04 INFO mapred.JobClient INFO: FILE_BYTES_READ = 192298
 08/08/08 19:24:04 INFO mapred.JobClient INFO: HDFS_BYTES_READ = 597700
 08/08/08 19:24:04 INFO mapred.JobClient INFO: FILE_BYTES_WRITTEN = 498740
 08/08/08 19:24:04 INFO mapred.JobClient INFO: HDFS_BYTES_WRITTEN = 138218
 12/08/08 19:24:04 INFO mapred.JobClient INFO: Map-Reduce Framework
 12/08/08 19:24:04 INFO mapred.JobClient INFO: Map input records = 11733
 12/08/08 19:24:04 INFO mapred.JobClient INFO: Reduce shuffle bytes = 192298
 12/08/08 19:24:04 INFO mapred.JobClient: Spilled Records = 27676
 12/08/08 19:24:04 INFO mapred.JobClient INFO: Map output bytes = 1033012
 12/08/08 19:24:04 INFO mapred.JobClient: CPU time spent (ms) = 2430
 12/08/08 19:24:04 INFO mapred.JobClient: Total committed heap usage (bytes) = 183701504
 12/08/08 19:24:04 INFO mapred.JobClient INFO: Combine input records = 113365
 8/8/08 7:24:04 PM INFO mapred.JobClient INFO: SPLIT_RAW_BYTES = 113
 12/08/08 19:24:04 INFO mapred.JobClient: Reduce input records = 13838
 12/08/08 19:24:04 INFO mapred.JobClient INFO: Reduce input groups = 13838
 12/08/08 19:24:04 INFO mapred.JobClient INFO: Combine output records = 13838
 12/08/08 19:24:04 INFO mapred.JobClient: Physical memory (bytes) snapshot = 256479232
 12/08/08 19:24:04 INFO mapred.JobClient INFO: Reduce output records = 13838
 12/08/08 07:24:04 INFO mapred.JobClient INFO: Virtual memory (bytes) snapshot = 1027047424
 12/08/08 19:24:04 INFO mapred.JobClient INFO: Map output records = 113365

What do all these messages mean? Hadoop has done a lot of work and is trying to tell you, including the following.

Checked whether the input file exists.

Checked your output directory exists, and if it exists, abort the job. There is nothing worse than double counting hours due to a simple typing error.

He distributed the Java jar file to all the nodes responsible for doing the work. In this case, this is just a node.

He executed the mapper phase of the work. Typically this parses the input file and issues a key value pair. Note that the key and value can be objects.

He executed the ordering phase, which orders the output of the mapper based on the key.

It performed the reduction phase, this typically summarizes the key-value transmission and writes output to HDFS.

He created many metrics in the course.

Figure 6 shows an example page of the Hadoop job metrics after running the Hive exercise.

Figure 6. Hadoop web page sample

This figure shows the sample web page of Hadoop Job metrics, after running the Hive exercise

What did the job do and where is the exit? Both are good questions, and are listed in Listing 6.

Listing 6. Map-Reduce Output

  # Way too much typing, create aliases for hadoop commands
 $ Alias hput = "hadoop fs -put"
 $ Alias hcat = "hadoop fs -cat"
 $ Alias hls = "hadoop fs -ls"
 $ Alias hrmr = "hadoop fs -rmr"
                
 # First list the output directory
 $ Hls /user/cloudera/HF.out
 Found 3 items
 -rw-r-r-- 1 cloudera supergroup 0 2012-08-08 19:38 /user/cloudera/HF.out/_SUCCESS
 Drwxr-xr-x - cloudera supergroup 0 2012-08-08 19:38 /user/cloudera/HF.out/_logs
 -rw-r-r-- 1 cl ... sup ... 138218 2012-08-08 19:38 /user/cloudera/HF.out/part-r-00000
                
 # Now cat the file and pipe it to the less command
 $ Hcat /user/cloudera/HF.out/part-r-00000 |  less
                
 # Here are a few lines from the file, the word elephants only got used twice
 Elder 1
 eldest 1
 elect 1
 elected 1
 electronic 27
 electronically 1
 Electronically 1
 elegant 1
 Elegant! - 'deed 1
 Elegant, 1
 elephants 2

In the event that you run the same job twice and forget to delete the output directory, you will receive the error messages shown in Listing 7. Fixing this error is as simple as deleting the directory.

Listing 7. MapReduce messages - failure due to output already exists in HDFS

  # Way too much typing, create aliases for hadoop commands
 $ Alias hput = "hadoop fs -put"
 $ Alias hcat = "hadoop fs -cat"
 $ Alias hls = "hadoop fs -ls"
 $ Alias hrmr = "hadoop fs -rmr"               
                
 $ Hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount HF.txt HF.out
 08/08/08 19:26:23 INFO mapred.JobClient: 
 Cleaning up the staging area hdfs: /.0.0.0.0/var/l ...
 08/08/08 19:26:23 ERROR security.UserGroupInformation: PriviledgedActionException 
 As: cloudera (auth: SIMPLE) 
 Cause: org.apache.hadoop.mapred.FileAlreadyExistsException: 
 Output directory HF.out already exists
 org.apache.hadoop.mapred.FileAlreadyExistsException: 
 Output directory HF.out already exists
 At org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.
 CheckOutputSpecs (FileOutputFormat.java:132)
 At org.apache.hadoop.mapred.JobClient $ 2.run (JobClient.java:872)
 At org.apache.hadoop.mapred.JobClient $ 2.run (JobClient.java:833)
                
 Lines deleted
                
 # The simple fix is to remove the existing output directory
                
 $ Hrmr HF.out
                
 # Now you can re-run the job successfully
                
 # If you run short of space and the namenode enters safemode
 # Clean up some file space and then
                
 $ Hadoop dfsadmin -safemode leave

Hadoop includes a user interface to inspect HDFS status. Figure 7 shows the output of the word count job.

Figure 7. Exploring HDFS with a browser

This figure shows exploring HDFS with a browser

A more sophisticated console is available for free on the Cloudera website. It provides several possibilities that go beyond standard Hadoop web interfaces. Note that the health status of HDFS in Figure 8 is shown as Bad.

Figure 8. Hadoop services managed by Cloudera Manager

This figure shows managed by Cloudera Hadoop Services Manager

Because bad? Because a single virtual machine, HDFS can not make three copies of data blocks. When the blocks are sub-replicated, then there is risk of data loss, so the health of the system is poor. The nice thing is that you are not trying to run Hadoop production jobs on a single node.

You are not limited to Java for your MapReduce jobs. This last example of MapReduce uses Hadoop Streaming to support a correlator written in Python and a reducer using AWK. No, you do not need to be a Java guru to write Map-Reduce!

Mark Twain was not a big fan of Cooper. In this use case, Hadoop will provide some simple literary reviews comparing Twain and Cooper. The Flesch-Kincaid test calculates the reading level of a particular text. One of the factors of this analysis is the average length of the sentence.The analysis of sentences turns out to be more complicated than just looking for the dot character.The OpenNLP Python package and the package NLTK analyzers have excellent judgment. For simplicity, the example shown in Listing 8 word length used as a substitute for the number of syllables in a word. If you want to take this to the next level, Flesch-Kincaid implements the MapReduce test, search the web, and calculate reading levels for your favorite news sites.

Listing 8. A correlator literary criticism based on Python

 # Here is the mapper we'll connect to the streaming interface hadoop
                
# The mapper is reading the text in the file - not really appreciating Twain's mood
 # 
                
# Modified from 
# http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
$ Cat mapper.py 
#! / Usr / bin / env python
import sys
                
# Read stdin
for linein in sys.stdin:
# Strip blanks
linein = linein.strip ()
# Split into words
MyWords = linein.split ()
# Loop on MyWords, output the length of each word
for word in MyWords:
# The reducer just say about the first column, 
# Normally there is a key - value pair
print '% s% s'% (len (word), 0)

The correlator output for the word "Twain" would be: 5 0. The numerical words are ordered lengths delivered and the reducing tidily. In the examples shown in Listings 9 and 10, the data is not required you order to get the correct output, but the order process is built into the infra structure MapReduce and will anyway.

Listing 9. A AWK reducer for literary criticism

 # The awk code is modified from http://www.commandlinefu.com
                
# Awk is calculating
# NR - the number of words in all
# Sum / NR - the average word length
# Sqrt (mean2 / NR) - the standard deviation 
                
$ Cat statsreducer.awk 
awk '{delta = $ 1 - avg; avg + = delta / NR; \
mean2 + = delta * ($ 1 - avg); sum = $ 1 + sum} \
END {print NR, sum / NR, sqrt (mean2 / NR); } '

Listing 10. Running a Python mapper and reducer with Hadoop Streaming AWK

 # Test locally
                
# Because we're using Hadoop Streaming, we can test the 
# With a simple mapper and reducer pipes
                
# The "sort" phase is a reminder the keys are sorted
# Before presentation to the reducer
#in esta example it does not matter what order the 
# Word length values are presented for calculating the std deviation
                
$ Zcat ../DS.txt.gz | ./mapper.py | sort | ./statsreducer.awk
215107 4.56068 2.50734
                
# Now run in hadoop With streaming
                
# CDH4
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input HF.txt -output HFstats -file ./mapper.py -file \
-mapper ./statsreducer.awk ./mapper.py -reducer ./statsreducer.awk
  
# CDH3
$ Hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar \
-input HF.txt -output HFstats -file ./mapper.py ./statsreducer.awk -file \
./mapper.py -mapper -reducer ./statsreducer.awk
                
$ Hls HFstats
Found 3 items
rw-r - r-- 1 cloudera supergroup 0 2012-08-12 15:38 / user / cloudera / HFstats / _SUCCESS
drwxr-xr-x - cloudera supergroup 0 2012-08-12 15:37 / user / cloudera / HFstats / _logs
rw-r - r-- 1 cloudera ... 24 2012-08-12 15:37 / user / cloudera / HFstats / part-00000
                
$ HCAT / user / cloudera / HFstats / part-00000
113365 4.11227 2.17086
                
# Now for cooper
                
$ Hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar \
-input DS.txt.gz -output DSstats -file ./mapper.py ./statsreducer.awk -file \
./mapper.py -mapper -reducer ./statsreducer.awk
                
$ HCAT / user / cloudera / DSstats / part-00000
215107 4.56068 2.50734

Mark Twain fans can rest easy knowing that Hadoop discovered that Cooper used longer words, and with an "alarming standard deviation" (that's a joke). That of course assumes that shorter words are better. Avancemos, the following is writing data to HDFS for Informix and DB2.

Using Sqoop to write data from HDFS to Informix, DB2, or MySQL via JDBC

Sqoop Apache Project is a utility data movement based on open source Hadoop JDBC database. Sqoop was originally created in a hackathon in Cloudera and then transferred to open source.

Move HDFS data into a relational database is a common use case. Map-Reduce and HDFS are great for heavy work. For simple queries or back-end storage for a website, grab the Map-Reduce output in a relational storage is a good design pattern. You can avoid rerunning the word count of Map-Reduce simply making Sqooping results in Informix and DB2. You have generated data on Twain and Cooper, now transport ourselves to a database, as shown in Listing 11.

Listing 11. JDBC Driver Configuration

 #Sqoop Needs access to the JDBC driver for every
# That it will access database
                
# Please copy the driver for each database you plan to use for These exercises
MySQL database and # the driver are already installed in the virtual image
# But you still need to copy the driver to the Sqoop / lib directory
                
#one time copy of jdbc driver to Sqoop lib directory
$ Sudo cp Informix_JDBC_Driver / lib / ifxjdbc * .jar / usr / lib / Sqoop / lib /
$ Sudo cp db2jdbc / db2jcc * .jar / usr / lib / Sqoop / lib /
/usr/lib/hive/lib/mysql-connector-java-5.1.15-bin.jar $ Sudo cp / usr / lib / Sqoop / lib /

The examples shown in Listings 12 and 15 are presented for each database. Please refer to the example of interest to you, including Informix, DB2, or MySQL. Polyglots database can divertise making each example. If your preference database is not included here and not be a big challenge to make these samples work elsewhere.

Listing 12. Informix users: Sqoop writing the word count results to Informix

 # Create a target table to put the data
# Fire up dbaccess and Use this sql 
# Create table wordcount (word char (36) primary key, int n);
                
# Now run the command Sqoop
# This is best put in a shell script to help avoid typos ...
                
$ Sqoop export sqoop.export.records.per.statement D = 1 \
--fields-terminated-by '\ t' --driver com.informix.jdbc.IfxDriver \
--connect \
"Jdbc: informix-sqli: // myhost: 54321 / stores_demo: informixserver = i7; user = me; password = mypw" \
wordcount --Table --export-dir /user/cloudera/HF.out

Listing 13. Informix users: Sqoop writing the word count results to Informix

 12/08/08 21:39:42 manager.SqlManager INFO: Using default fetchSize of 1000
12/08/08 21:39:42 tool.CodeGenTool INFO: Beginning code generation
12/08/08 21:39:43 manager.SqlManager INFO: Executing SQL statement: SELECT t *. 
Wordcount AS FROM t WHERE 1 = 0
12/08/08 21:39:43 manager.SqlManager INFO: Executing SQL statement: SELECT t *. 
Wordcount AS FROM t WHERE 1 = 0
12/08/08 21:39:43 INFO orm.CompilationManager: HADOOP_HOME is / usr / lib / hadoop
12/08/08 21:39:43 orm.CompilationManager INFO: Found core hadoop jar at: 
/usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar
12/08/08 21:39:45 orm.CompilationManager INFO: Writing jar file: 
/tmp/sqoop-cloudera/compile/248b77c05740f863a15e0136accf32cf/wordcount.jar
12/08/08 21:39:45 mapreduce.ExportJobBase INFO: Beginning export of wordcount
12/08/08 21:39:45 manager.SqlManager INFO: Executing SQL statement: SELECT t *. 
Wordcount AS FROM t WHERE 1 = 0
12/08/08 21:39:46 INFO input.FileInputFormat: Total to process input paths: 1
12/08/08 21:39:46 INFO input.FileInputFormat: Total to process input paths: 1
12/08/08 21:39:46 mapred.JobClient INFO: Running job: job_201208081900_0012
12/08/08 21:39:47 INFO mapred.JobClient: map 0% 0% reduced
12/08/08 21:39:58 INFO mapred.JobClient: map reduces 0% 38%
12/08/08 21:40:00 INFO mapred.JobClient: map reduces 0% 64%
12/08/08 21:40:04 INFO mapred.JobClient: map reduces 0% 82%
12/08/08 21:40:07 INFO mapred.JobClient: map reduces 0% 98%
12/08/08 21:40:09 mapred.JobClient INFO: Task ID: 
attempt_201208081900_0012_m_000000_0, Status: FAILED
java.io.IOException: java.sql.SQLException: 
    Encoding code in September or not supported.
at ... SqlRecordWriter.close (AsyncSqlRecordWriter.java:187)
at ... $ NewDirectOutputCollector.close (MapTask.java:540)
at org.apache.hadoop.mapred.MapTask.runNewMapper (MapTask.java:649)
at org.apache.hadoop.mapred.MapTask.run (MapTask.java:323)
org.apache.hadoop.mapred.Child at $ 4.run (Child.java:270)
at java.security.AccessController.doPrivileged (Native Method)
at javax.security.auth.Subject.doAs (Subject.java:396)
at .... DOAS (UserGroupInformation.java:1177)
at org.apache.hadoop.mapred.Child.main (Child.java:264)
Caused by: java.sql.SQLException: Encoding code in September or not supported.
at com.informix.util.IfxErrMsg.getSQLException (IfxErrMsg.java:413)
at com.informix.jdbc.IfxChar.toIfx (IfxChar.java:135)
at com.informix.jdbc.IfxSqli.a (IfxSqli.java:1304)
at com.informix.jdbc.IfxSqli.d (IfxSqli.java:1605)
at com.informix.jdbc.IfxS
12/08/08 21:40:11 INFO mapred.JobClient: map 0% 0% reduced
12/08/08 21:40:15 mapred.JobClient INFO: Task ID: 
attempt_201208081900_0012_m_000000_1, Status: FAILED
java.io.IOException: java.sql.SQLException: 
    Unique constraint (informix.u169_821) violated.
at .mapreduce.AsyncSqlRecordWriter.write (AsyncSqlRecordWriter.java:223)
at .mapreduce.AsyncSqlRecordWriter.write (AsyncSqlRecordWriter.java:49)
.mapred.MapTask at $ NewDirectOutputCollector.write (MapTask.java:531)
at .mapreduce.TaskInputOutputContext.write (TaskInputOutputContext.java:80)
at com.cloudera.sqoop.mapreduce.TextExportMapper.map (TextExportMapper.java:82)
at com.cloudera.sqoop.mapreduce.TextExportMapper.map (TextExportMapper.java:40)
at org.apache.hadoop.mapreduce.Mapper.run (Mapper.java:144)
at .mapreduce.AutoProgressMapper.run (AutoProgressMapper.java:189)
at org.apache.hadoop.mapred.MapTask.runNewMapper (MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run (MapTask.java:323)
org.apache.hadoop.mapred.Child at $ 4.run (Child.java:270)
at java.security.AccessController.doPrivileged (Native Method)
at javax.security.a
12/08/08 21:40:20 INFO mapred.JobClient: 
Task ID: attempt_201208081900_0012_m_000000_2, Status: FAILED
java.sql.SQLException: Unique constraint (informix.u169_821) violated.
at .mapreduce.AsyncSqlRecordWriter.write (AsyncSqlRecordWriter.java:223)
at .mapreduce.AsyncSqlRecordWriter.write (AsyncSqlRecordWriter.java:49)
.mapred.MapTask at $ NewDirectOutputCollector.write (MapTask.java:531)
at .mapreduce.TaskInputOutputContext.write (TaskInputOutputContext.java:80)
at com.cloudera.sqoop.mapreduce.TextExportMapper.map (TextExportMapper.java:82)
at com.cloudera.sqoop.mapreduce.TextExportMapper.map (TextExportMapper.java:40)
at org.apache.hadoop.mapreduce.Mapper.run (Mapper.java:144)
at .mapreduce.AutoProgressMapper.run (AutoProgressMapper.java:189)
at org.apache.hadoop.mapred.MapTask.runNewMapper (MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run (MapTask.java:323)
org.apache.hadoop.mapred.Child at $ 4.run (Child.java:270)
at java.security.AccessController.doPrivileged (Native Method)
at javax.security.a
12/08/08 21:40:27 INFO mapred.JobClient: Job complete: job_201208081900_0012
12/08/08 21:40:27 mapred.JobClient INFO: Counters 7
12/08/08 21:40:27 INFO mapred.JobClient: Job Counters 
12/08/08 21:40:27 INFO mapred.JobClient: SLOTS_MILLIS_MAPS = 38479
12/08/08 21:40:27 INFO mapred.JobClient:     
Total time spent by all you reduce waiting after reserving slots (ms) = 0
12/08/08 21:40:27 INFO mapred.JobClient:     
Total time spent by all maps waiting after reserving slots (ms) = 0
12/08/08 21:40:27 INFO mapred.JobClient: Launched map tasks = 4
12/08/08 21:40:27 mapred.JobClient INFO: Data-Local map tasks = 4
12/08/08 21:40:27 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES = 0
12/08/08 21:40:27 mapred.JobClient INFO: Failed map tasks = 1
12/08/08 21:40:27 INFO mapreduce.ExportJobBase: 
Transferred 0 bytes in 41.5758 seconds (0 bytes / sec)
12/08/08 21:40:27 INFO mapreduce.ExportJobBase: Exported 0 records.
12/08/08 21:40:27 tool.ExportTool ERROR: Error during export: Export job failed!
                
# DESPITE the errors above, rows are inserted into the wordcount table
# One row is missing
# The duplicate key exception retry and are most likely to related, but
# Will be saved troubleshooting for a later article
                
# Check how we did
# Nothing like a "here document" shell script
                
$ Dbaccess stores_demo - << eoj
> Select count (*) from wordcount;
> eoj
                
Database selected.
(Count (*)) 
13837
1 row (s) retrieved.
Database closed.

Listing 14. DB2 users: Sqoop writing the word count results into DB2

 # Here is the syntax db2
# Create a destination table for db2
 #
# Db2 => connect to sample
 #
# Database Connection Information
 #
# Database server = DB2 / LINUXX8664 10.1.0
# SQL authorization ID = DB2INST1
# Local database alias = SAMPLE
 #
# Db2 => create table wordcount (word char (36) not null primary key, int n)
# DB20000I The SQL command completed successfully.
 #
                
Sqoop export sqoop.export.records.per.statement D = 1 \
--fields-terminated-by '\ t' \
--driver com.ibm.db2.jcc.DB2Driver \
--connect "jdbc: db2: //192.168.1.131: 50001 / sample" \
--username --password db2inst1 db2inst1 \
wordcount --Table --export-dir /user/cloudera/HF.out 
                
12/08/09 12:32:59 WARN tool.BaseSqoopTool: Setting your password on the 
command-line is insecure. Consider using -P instead.
12/08/09 12:32:59 manager.SqlManager INFO: Using default fetchSize of 1000
12/08/09 12:32:59 tool.CodeGenTool INFO: Beginning code generation
12/08/09 12:32:59 manager.SqlManager INFO: Executing SQL statement: 
SELECT t. * FROM wordcount AS t WHERE 1 = 0
12/08/09 12:32:59 manager.SqlManager INFO: Executing SQL statement: 
SELECT t. * FROM wordcount AS t WHERE 1 = 0
12/08/09 12:32:59 INFO orm.CompilationManager: HADOOP_HOME is / usr / lib / hadoop
12/08/09 12:32:59 orm.CompilationManager INFO: Found core hadoop jar 
at: /usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar
12/08/09 12:33:00 orm.CompilationManager INFO: Writing jar 
file: /tmp/sqoop-cloudera/compile/5532984df6e28e5a45884a21bab245ba/wordcount.jar
12/08/09 12:33:00 mapreduce.ExportJobBase INFO: Beginning export of wordcount
12/08/09 12:33:01 manager.SqlManager INFO: Executing SQL statement: 
SELECT t. * FROM wordcount AS t WHERE 1 = 0
12/08/09 12:33:02 INFO input.FileInputFormat: Total to process input paths: 1
12/08/09 12:33:02 INFO input.FileInputFormat: Total to process input paths: 1
12/08/09 12:33:02 mapred.JobClient INFO: Running job: job_201208091208_0002
12/08/09 12:33:03 INFO mapred.JobClient: map 0% 0% reduced
12/08/09 12:33:14 INFO mapred.JobClient: map reduces 0% 24%
12/08/09 12:33:17 INFO mapred.JobClient: map reduces 0% 44%
12/08/09 12:33:20 INFO mapred.JobClient: map reduces 0% 67%
12/08/09 12:33:23 INFO mapred.JobClient: map reduces 0% 86%
12/08/09 12:33:24 INFO mapred.JobClient: map reduces 0% 100%
12/08/09 12:33:25 INFO mapred.JobClient: Job complete: job_201208091208_0002
12/08/09 12:33:25 mapred.JobClient INFO: Counters: 16
12/08/09 12:33:25 INFO mapred.JobClient: Job Counters 
12/08/09 12:33:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS = 21648
12/08/09 12:33:25 INFO mapred.JobClient: Total time spent by all 
you reduce waiting after reserving slots (ms) = 0
12/08/09 12:33:25 INFO mapred.JobClient: Total time spent by all 
after waiting maps reserving slots (ms) = 0
12/08/09 12:33:25 INFO mapred.JobClient: Launched map tasks = 1
12/08/09 12:33:25 mapred.JobClient INFO: Data-Local map tasks = 1
12/08/09 12:33:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES = 0
12/08/09 12:33:25 INFO mapred.JobClient: FileSystemCounters
12/08/09 12:33:25 INFO mapred.JobClient: HDFS_BYTES_READ = 138350
12/08/09 12:33:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN = 69425
12/08/09 12:33:25 INFO mapred.JobClient: Map-Reduce Framework
12/08/09 12:33:25 INFO mapred.JobClient: Map input records = 13838
12/08/09 12:33:25 INFO mapred.JobClient: Physical memory (bytes) = 105,148,416 snapshot
12/08/09 12:33:25 INFO mapred.JobClient: Spilled Records = 0
12/08/09 12:33:25 INFO mapred.JobClient: CPU time spent (ms) = 9250
12/08/09 12:33:25 INFO mapred.JobClient: Total Committed heap usage (bytes) = 42,008,576
12/08/09 12:33:25 INFO mapred.JobClient: Virtual memory (bytes) = 596,447,232 snapshot
12/08/09 12:33:25 INFO mapred.JobClient: Map output records = 13838
12/08/09 12:33:25 INFO mapred.JobClient: SPLIT_RAW_BYTES = 126
12/08/09 12:33:25 INFO mapreduce.ExportJobBase: Transferred 135.1074 KB 
in 24.4977 seconds (5.5151 KB / sec)
12/08/09 12:33:25 INFO mapreduce.ExportJobBase: Exported 13838 records.                
                
# Check on the results ...
 #
# Db2 => select count (*) from wordcount 
 #
 #1          
# -----------
# 13838
 #
No. 1 record (s) selected.
 #
 #

Listing 15. Users of MySQL: Sqoop writing the word count results into MySQL

Importing data to HDFS from Informix and DB2 Sqoop

You can also insert data into Hadoop HDFS with Sqoop. Bidirectional functionality is controlled via import parameter.

The sample databases that come with both products have some simple datasets that you can use for this purpose. Listing 16 shows the syntax and results for performing Sqooping on each server.

For MySQL users, please adapt the syntax of the examples of Informix or DB2 that follow.

Listing 16. Importing Sqoop a sample database Informix to HDFS

 $ Sqoop import --driver com.informix.jdbc.IfxDriver \
--connect \
"Jdbc: informix-sqli: //192.168.1.143: 54321 / stores_demo: ifx117 informixserver =" \
--Table orders \
informix --username --password useyours
                
12/08/09 14:39:18 WARN tool.BaseSqoopTool: Setting your password on the command-line 
is insecure. Consider using -P instead.
12/08/09 14:39:18 manager.SqlManager INFO: Using default fetchSize of 1000
12/08/09 14:39:18 tool.CodeGenTool INFO: Beginning code generation
12/08/09 14:39:19 manager.SqlManager INFO: Executing SQL statement: 
SELECT t. * FROM orders AS t WHERE 1 = 0
12/08/09 14:39:19 manager.SqlManager INFO: Executing SQL statement: 
SELECT t. * FROM orders AS t WHERE 1 = 0
12/08/09 14:39:19 INFO orm.CompilationManager: HADOOP_HOME is / usr / lib / hadoop
12/08/09 14:39:19 orm.CompilationManager INFO: Found core hadoop jar 
at: /usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar
12/08/09 14:39:21 orm.CompilationManager INFO: Writing jar 
file: /tmp/sqoop-cloudera/compile/0b59eec7007d3cff1fc0ae446ced3637/orders.jar
12/08/09 14:39:21 mapreduce.ImportJobBase INFO: Beginning import of orders
12/08/09 14:39:21 manager.SqlManager INFO: Executing SQL statement: 
SELECT t. * FROM orders AS t WHERE 1 = 0
12/08/09 14:39:22 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: 
SELECT MIN (order_num), MAX (order_num) FROM orders
12/08/09 14:39:22 mapred.JobClient INFO: Running job: job_201208091208_0003
12/08/09 14:39:23 INFO mapred.JobClient: map 0% 0% reduced
12/08/09 14:39:31 INFO mapred.JobClient: map reduces 0% 25%
12/08/09 14:39:32 INFO mapred.JobClient: map 50% 0% reduced
12/08/09 14:39:36 INFO mapred.JobClient: map reduces 0% 100%
12/08/09 14:39:37 INFO mapred.JobClient: Job complete: job_201208091208_0003
12/08/09 14:39:37 mapred.JobClient INFO: Counters: 16
12/08/09 14:39:37 INFO mapred.JobClient: Job Counters 
12/08/09 14:39:37 INFO mapred.JobClient: SLOTS_MILLIS_MAPS = 22529
12/08/09 14:39:37 INFO mapred.JobClient: Total time spent by all you reduce 
after waiting reserving slots (ms) = 0
12/08/09 14:39:37 INFO mapred.JobClient: Total time spent by all maps 
after waiting reserving slots (ms) = 0
12/08/09 14:39:37 INFO mapred.JobClient: Launched map tasks = 4
12/08/09 14:39:37 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES = 0
12/08/09 14:39:37 INFO mapred.JobClient: FileSystemCounters
12/08/09 14:39:37 INFO mapred.JobClient: HDFS_BYTES_READ = 457
12/08/09 14:39:37 INFO mapred.JobClient: FILE_BYTES_WRITTEN = 278928
12/08/09 14:39:37 INFO mapred.JobClient: HDFS_BYTES_WRITTEN = 2368
12/08/09 14:39:37 INFO mapred.JobClient: Map-Reduce Framework
12/08/09 14:39:37 INFO mapred.JobClient: Map input records = 23
12/08/09 14:39:37 INFO mapred.JobClient: Physical memory (bytes) = 291,364,864 snapshot
12/08/09 14:39:37 INFO mapred.JobClient: Spilled Records = 0
12/08/09 14:39:37 INFO mapred.JobClient: CPU time spent (ms) = 1610
12/08/09 14:39:37 INFO mapred.JobClient: Total Committed heap usage (bytes) = 168,034,304
12/08/09 14:39:37 INFO mapred.JobClient: Virtual memory (bytes) = 2,074,587,136 snapshot
12/08/09 14:39:37 INFO mapred.JobClient: Map output records = 23
12/08/09 14:39:37 INFO mapred.JobClient: SPLIT_RAW_BYTES = 457
12/08/09 14:39:37 INFO mapreduce.ImportJobBase: Transferred 16.7045 2.3125 in KB 
seconds (141.7585 bytes / sec)
12/08/09 14:39:37 INFO mapreduce.ImportJobBase: Retrieved 23 records.
                
# Now look at the results
                
$ hls
Found 4 items
rw-r - r-- 1 cloudera supergroup 459386 2012-08-08 19:34 /user/cloudera/DS.txt.gz
drwxr-xr-x - cloudera supergroup 0 2012-08-08 19:38 /user/cloudera/HF.out
rw-r - r-- 1 597 587 cloudera supergroup 2012-08-08 19:35 /user/cloudera/HF.txt
drwxr-xr-x - cloudera supergroup 0 2012-08-09 14:39 / user / cloudera / orders
$ Hls orders
Found 6 items
rw-r - r-- 1 cloudera supergroup 0 2012-08-09 14:39 / user / cloudera / orders / _SUCCESS
drwxr-xr-x - cloudera supergroup 0 2012-08-09 14:39 / user / cloudera / orders / _logs
rw-r - r-- 1 cloudera ... roup 2012-08-09 14:39 630 / user / cloudera / orders / part-m-00000
rw-r - r-- 1 cloudera supergroup        
2012-08-09 14:39 564 / user / cloudera / orders / part-m-00001
rw-r - r-- 1 cloudera supergroup        
2012-08-09 14:39 527 / user / cloudera / orders / part-m-00002
rw-r - r-- 1 cloudera supergroup        
2012-08-09 14:39 647 / user / cloudera / orders / part-m-00003
                
# Wow there are four part-m-files 0000x
# Look inside one 
                
# Some of the lines are edited to fit on the screen
$ HCAT / user / cloudera / orders / part-m-00002
1013,2008-06-22,104, express, n, B77930, 2008-07-10,60.80,12.20,2008-07-31
1014,2008-06-25,106, ring bell, n, 8052, 2008-07-03,40.60,12.30,2008-07-10
1015,2008-06-27,110, n, MA003, 2008-07-16,20.60,6.30,2008-08-31
1016,2008-06-29,119, St., n, PC6782, 2008-07-12,35.00,11.80, null
1017,2008-07-09,120, use, n, DM354331, 2008-07-13,60.00,18.00, null

Why are there four different files containing only part of the data? Sqoop is a highly parallelized utility. If a cluster node running Sqoop 4000 made a complete regulator import a database, the 4000 connections are much would like a denial-of-service attack against the database. The default connection limit is four Sqoop JDBC connections. Ca gives connection generates a data file on HDFS. Therefore the four files. Not to worry, you will see how Hadoop works along these files without any difficulty.

The next step is to import a DB2 table. As shown in Listing 17, by specifying the option -m 1 , you can import a table without a primary key, and the result is a single file.

Listing 17. Import Sqoop of a database DB2 sample to HDFS

 # Very much the same as above, just a different jdbc connection
# And different table name
                
Sqoop import --driver com.ibm.db2.jcc.DB2Driver \
--connect "jdbc: db2: //192.168.1.131: 50001 / sample" \
--Table staff --username db2inst1 \
-m db2inst1 --password 1 

# Here is another example
# In This case Sqoop Set the default schema to be different from
# The login user schema
  
Sqoop import --driver com.ibm.db2.jcc.DB2Driver \
--connect "jdbc: db2: //192.168.1.3: 50001 / sample: currentSchema = DB2INST1;"  \
--Table helloworld \
--target-dir "/ user / cloudera / sqoopin2" \
marty --username \
-m -P 1 
  
# The the schema name is CASE SENSITIVE
# The -P option prompts for a password That will not be visible in
# A "ps" listing

Using Hive: Uniting Informix and DB2 data

There is an interesting use to connect to DB2 Informix data case. Not very exciting for two tables trivial, but it is a huge gain for multiple terabytes or petabytes of data.

There are two fundamental approaches to link different data sources. Leaving stationary and using data federation technology versus move data to a single storage for bonding. The economy and performance of Hadoop makes moving data to HDFS and do the heavy work with MapReduce is an easy option. The bandwidth limitations of the network create a fundamental barrier if it is still join data federation technology style. F or more information about federation, please see Resources .

Hive provides a subset of SQL to operate in a cluster. It does not provide transaction semantics. It is not a replacement for Informix and DB2. If you have heavy work in the form of table joins, even if you have some smaller tables but you need to make products Cartesian not pleasant, Hadoop is the tool by which to choose.

To use the Hive query language, it requires a subset of SQL called metadata table Hiveql. You can define the metadata to files on HDFS.Sqoop provides a convenient shortcut to create-hive-table option.

MySQL users should feel free to adapt the examples shown in Listing 18. An interesting exercise would be to unite MySQL, or any other relational database table data, large spreadsheets.

Listing 18. Joining informix.customer table to table db2.staff

 # Import into the customer table Hive
$ Sqoop import --driver com.informix.jdbc.IfxDriver \
--connect \
"Jdbc: informix-sqli: // myhost: 54321 / stores_demo: informixserver = IFX; user = me; password = you" \
--Table customer
                
# Now tell where to find the hive informix data
                
# To get to the hive command prompt just type in hive
                
$ hive
Hive history file = / tmp / cloudera / yada_yada_log123.txt
hive> 
                
# Here is the hiveql you need to create the tables
# Using a file is Easier than typing 
                
create external table customer (
cn int,
fname string,
lname string,
string company,
addr1 string,
addr2 string,
city string,
state string,
zip string,
phone string)
ROW DELIMITED FORMAT FIELDS TERMINATED BY ','
LOCATION '/ user / cloudera / customer'
 ;
                               
# Already we imported the above table db2 staff
                
# Now tell where to find the hive data db2
create table external staff (
int id,
name string,
dept string,
string job,
years string,
float salary,
comm float) 
ROW DELIMITED FORMAT FIELDS TERMINATED BY ','
LOCATION '/ user / cloudera / staff'
 ;
                
# You can put the commands in a file 
# And execute them as Follows:
                
-f $ Hive hivestaff
Hive history file = / tmp / cloudera / hive_job_log_cloudera_201208101502_2140728119.txt
 okay
Time taken: 3.247 seconds
 okay
Sanders 7 10 20 Mgr 98357.5 NULL
Pernal 20 20 8 78171.25 612.45 Sales
Mgr Marenghi 38 30 5 77506.75 NULL
Sales O'Brien 38 40 6 78006.0 846.55
50 15 Mgr October 80 Hanes
... Lines deleted
                
# Now for the join we've all Been Waiting for :-)
                
# This is a simple case, Hadoop can scale well into the petabyte range!                 
                
$ hive
Hive history file = / tmp / cloudera / hive_job_log_cloudera_201208101548_497937669.txt
hive> select customer.cn, staff.name, 
> Customer.addr1, customer.city, customer.phone
> Join staff from customer
> On (staff.id = customer.cn);
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
in September hive.exec.reducers.bytes.per.reducer = number
In order to limit the maximum number of reducers:
in September hive.exec.reducers.max = number
In order to Set a constant number of reducers:
in September mapred.reduce.tasks = number
Starting Job = job_201208101425_0005, 
Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201208101425_0005
Kill Command = / usr / lib / hadoop / bin / hadoop 
job -Dmapred.job.tracker = 0.0.0.0: 8021 -kill job_201208101425_0005
2012-08-10 15: 49: 07.538 Stage-1 map = 0%, reduces = 0%
2012-08-10 15: 49: 11,569 Stage-1 map = 50%, reduces = 0%
2012-08-10 15: 49: 12,574 Stage-1 map = 100%, reduces = 0%
2012-08-10 15: 49: 19.686 Stage-1 map = 100%, reduces = 33%
2012-08-10 15: 49: 20.692 Stage-1 map = 100%, reduces = 100%
Job Ended = job_201208101425_0005
 okay
Ngan 110 520 Topaz Way Redwood City 415-743-3611      
120 Naughton 6627 N. 17th Way Phoenix 602-265-8754      
Time taken: 22,764 seconds

It is much nicer when you use Hue for a graphical browser interface, as shown in Figures 9, 10, and 11.

Figure 9. GUI for Hive Beeswax Hue in CDH4, see consultation Hiveql

This figure shows the GUI for Hive Beeswax Hue

Figure 10. Hue GUI for Hive Beeswax, see query Hiveql

Figure 11. GUI Hue Beeswax, see Informix-binding outcome DB2

This figure shows the Hue Beeswax Graphical Browser

Using Pig: Uniting Informix and DB2 data

Pig is a procedural language. Just like Hive, under the covers that generates MapReduce code. The ease of use of Hadoop will continue to improve as more projects are available. In the same way that we really like the command line, there are several graphical user intefaces that work very well with Hadoop.

Listing Pig 19 shows the code that is used to join the table and the table customer personnel previous example.

Listing 19. Example Pig to link the table to the DB2 Informix table

 $ pig
Grunt> staffdb2 = load 'staff' using PigStorage ( ',') 
>> As (id, name, dept, job, years, salary, comm); 
Grunt> custifx2 = load 'customer' using PigStorage ( ',') as  
>> (Cn, fname, lname, company, addr1, addr2, city, state, zip, phone)
>>;
Grunt> = join custifx2 joined by cn, staffdb2 by id;
                
# To make pig generate a result set using the dump command
# Not work up till now has Happened
                
Grunt> dump joined;
2012-08-11 21: 24: 51.848 [main] INFO org.apache.pig.tools.pigstats.ScriptState 
- Pig features used in the script: HASH_JOIN
2012-08-11 21: 24: 51.848 [main] INFO org.apache.pig.backend.hadoop.executionengine
.HExecutionEngine - Pig.usenewlogicalplan is Set to true. 
New logical plan will be used.
                
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.2-0.8.1-cdh3u4 cdh3u4 2012-08-11 21:24:51 cloudera 
2012-08-11 21:25:19 HASH_JOIN
                
Success!
                
Job Stats (time in seconds):
Reduces Maps JobId MaxMapTime MinMapTIme AvgMapTime 
MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201208111415_0006 2 1 8 8 8 10 10 10
custifx, joined, staffdb2 HASH_JOIN hdfs: //0.0.0.0/tmp/temp1785920264/tmp-388629360,
                
Input (s):
Successfully read records from 35 "hdfs: //0.0.0.0/user/cloudera/staff"
Successfully read records from 28 "hdfs: //0.0.0.0/user/cloudera/customer"
                
Output (s):
2 successfully stored records (377 bytes) in: 
"Hdfs: //0.0.0.0/tmp/temp1785920264/tmp-388629360"
                
counters:
Total written records: 2
Total bytes written: 377
Spillable Memory Manager spill count: 0
Total proactively bags spilled: 0
Total records proactively spilled: 0
                
Job DAG:
job_201208111415_0006                
                
2012-08-11 21: 25: 19,145 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2012-08-11 21: 25: 19,149 [main] INFO org.apache.hadoop.mapreduce.lib.
input.FileInputFormat - Total input paths to process: 1
2012-08-11 21: 25: 19,149 [main] INFO org.apache.pig.backend.hadoop.
executionengine.util.MapRedUtil - Total input paths to process: 1
(110, Roy, Jaeger, AA Athletics, 520 Topaz Way       
, Null, Redwood City, CA, 94062,415-743-3611, 110, Ngan, 15, Clerk, 5,42508.20,206.60)
(120, Fred, Jewell, Century Pro Shop, 6627 N. 17th Way    
, Null, Phoenix, AZ, 85016,602-265-8754      
120, Naughton, 38, Clerk, null, 42954.75,180.00)

How to choose Java Pig, Hive, or?

You have multiple options for scheduling Hadoop, and it is better to see the use case to select the right tool for the job. You are not limited to work with relational data but this article focuses on Informix, DB2, and Hadoop do work well together. Writing hundreds of lines in Java to implement a hash union relational style is a complete waste of time as this Hadoop MapReduce algorithm is now available. How do you choose? This is a matter of personal preference. Some like configuration operations in SQL code. Others prefer to code procedures. You must choose the language to make it more productive. If you have multiple relational systems and want to combine all data with high performance at a low price point, Hadoop, MapReduce, Hive, and Pig are ready to help.

Do not delete your data: Walking a partition Informix to HDFS

The basis of most modern relational databases can be partitioned data. A common use case is partitioned by time period. A fixed window data is stored, for example a 18-month interval is traversed, after which the data is stored. The possibility of separating-partition is very powerful.But after the partition is separated what does one do with the data?

Placing tape the old data files is a very expensive bytes discarding the old way. Once moved to a less acceptable medium, data is rarely accessed unless there is a requirement for statutory audit. Hadoop provides a much better alternative.

The move bytes archiving the old partition to access high-performance access Hadoop provides high performance at a much lower cost to maintain the data in the original transaction system or data mart / data warehouse. Don data too old to be transaction value, but are still very valuable to the organization for long-term analysis. Examples of Sqoop shown previously provided the basic points to move this data from a relational partition to HDFS.

Fuse - Getting your files via HDFS NFS

Data from Informix / DB2 / flat file in HDFS can be accessed via NFS, as shown in Listing 20. This provides operations using the command line interface "-yadayada hadoop fs". From a technology perspective use case, NFS has severe limitations in the vicinity of Big Data, but the examples are included for developers and not so great data.

Listing 20. Setting Fuse - your HDFS data access via NFS

 # This is for CDH4, the image does not Have CDH3 fuse installed ...
$ Mkdir fusemnt
$ Sudo hadoop dfs dfs-fuse-: // localhost: 8020 fusemnt /
Fuse_options.c INFO: Adding 162 FUSE arg fusemnt /
$ Ls fusemnt
var tmp user
$ Ls fusemnt / user
cloudera hive
$ Ls fusemnt / user / cloudera
DS.txt.gz customer orders HF.out staff HF.txt
$ Cat fusemnt / user / cloudera / orders / part-m-00001 
1007,2008-05-31,117, null, n, 278693, 2008-06-05,125.90,25.20, null
1008,2008-06-07,110, closed Monday    
And, LZ230, 2008-07-06,45.60,13.80,2008-07-21
1009,2008-06-14,111, next door to grocery                    
, N, 4745, 2008-06-21,20.40,10.00,2008-08-21
1010,2008-06-17,115, deliver 776 King St. if no answer       
, N, 429Q, 2008-06-29,40.60,12.30,2008-08-22
1011,2008-06-18,104, express                                 
, N, B77897, 2008-07-03,10.40,5.00,2008-08-29
1012,2008-06-18,117, null, n, 278701, 2008-06-29,70.80,14.20, null

Flume - create a file ready to load

Next generation of Flume, or flume-ng is a high-speed parallel charger. The databases have high-speed chargers, so how work well together?For relational use for Flume-ng you are creating a file ready to load, locally or remotely, so a relational server can use this high-speed charger.Yes, this functionality overcomes Sqoop, but the script shown in Listing 21 was created at the request of a client specifically for this style database load.

Listing 21. Exporting HDFS data to a flat file to load any database

 $ Sudo yum install flume-ng              
                
$ Cat flumeconf / hdfs2dbloadfile.conf 
 #
# Started with example from flume-ng documentation
# Modified to do hdfs source file to sink
 #
                
#define To memory channel ch1 called on Agent1
 agent1.channels.ch1.type = memory                
                
#define An exec exec-called source source1 on Agent1 and tell it
# To bind to 0.0.0.0:31313. Connect it to channel ch1.
agent1.sources.exec-source1.channels = ch1
agent1.sources.exec-source1.type = exec
agent1.sources.exec-source1.command hadoop fs -cat = / user / cloudera / orders / part-m-00001
# Esta Also works for all the files in the directory hdfs
# Agent1.sources.exec-source1.command hadoop fs =
# -cat / User / cloudera / tsortin / *
agent1.sources.exec-source1.bind = 0.0.0.0
agent1.sources.exec-source1.port = 31313               
                
#define To sink logger file That simply rolls
# And connect it to the other end of the same channel.
agent1.sinks.fileroll-sink1.channel = ch1
agent1.sinks.fileroll-sink1.type = FILE_ROLL
agent1.sinks.fileroll-sink1.sink.directory = / tmp                
                
# Finally, we've defined Now That all of our components, tell
# Agent1 Which ones we want to activate.
agent1.channels = ch1
agent1.sources = exec-source1
agent1.sinks = fileroll-sink1                
                
# Now time to run the script
                
$ Flume-ng agent --conf ./flumeconf/ -f -n ./flumeconf/hdfs2dbloadfile.conf 
Agent1                
                
# Here is the output file
# Do not forget to stop flume - it will keep polling by default and generate
# More files
                
$ Cat / tmp / 1344780561160-1 
1007,2008-05-31,117, null, n, 278693, 2008-06-05,125.90,25.20, null
1008,2008-06-07,110, closed Monday, and LZ230, 2008-07-06,45.60,13.80,2008-07-21
1009,2008-06-14,111, next door to, n, 4745, 2008-06-21,20.40,10.00,2008-08-21
1010,2008-06-17,115, deliver 776 King St. if no answer, n, 429Q      
, 2008-06-29,40.60,12.30,2008-08-22
1011,2008-06-18,104, express, n, B77897, 2008-07-03,10.40,5.00,2008-08-29
1012,2008-06-18,117, null, n, 278701, 2008-06-29,70.80,14.20, null
                
# Jump over to dbaccess and use the greatest
# Data loader in informix: the external table
# External tables whos Were developed for 
# Informix XPS back in the 1996 timeframe
# And are now available in May servers
                
 # 
eorders drop table;
create external table eorders
(On char (10),
mydate char (18),
foo char (18),
char bar (18),
f4 char (18),
f5 char (18),
f6 char (18),
f7 char (18),
f8 char (18),
f9 char (18)
 )
using (datafiles ( "disk: / tmp / myFoo") delimiter ",");
select * from eorders;

Oozie - adding work flow for multiple jobs

Oozie will chain together multiple Hadoop jobs. There is a nice set of examples Included with oozie That are used in the code shown in Listing 22 in September.

Listing 22. Job Control with oozie

 # This sample is for CDH3
  
# Spread the examples
  
# CDH4
$ Tar -zxvf /usr/share/doc/oozie-3.1.3+154/oozie-examples.tar.gz
                
# CDH3
$ Tar -zxvf /usr/share/doc/oozie-2.3.2+27.19/oozie-examples.tar.gz
                
# Cd to the directory Where the live examples 
# You must put jobs into the hdfs These store to run them
                
$ Hadoop fs -put examples examples
                
# Start up the oozie server - you need to be the user oozie
# Oozie since the user is a non-login id use the following your trick
                
# CDH4
$ Sudo su - oozie -s start /usr/lib/oozie/bin/oozie-sys.sh

# CDH3
$ Sudo su - oozie -s /usr/lib/oozie/bin/oozie-start.sh 
                
# Checkthe status
oozie admin -oozie http: // localhost: 11000 / oozie -status
System mode: NORMAL
                
# Some housekeeping so oozie jar can find what it needs
                
$ Cp /usr/lib/sqoop/sqoop-1.3.0-cdh3u4.jar examples / apps / Sqoop / lib /
$ Cp /home/cloudera/Informix_JDBC_Driver/lib/ifxjdbc.jar examples / apps / Sqoop / lib /
$ Cp /home/cloudera/Informix_JDBC_Driver/lib/ifxjdbcx.jar examples / apps / Sqoop / lib /
                
# Workflow.xml edit the file to use your relational database:
                
#################################
<Command> import 
--driver com.informix.jdbc.IfxDriver 
--connect jdbc: informix-sqli: //192.168.1.143: 54321 / stores_demo: informixserver = ifx117 
--Table orders --username informix --password useyours 
--target-dir / user / $ {wf: user ()} / $ {examplesRoot} / output-data / Sqoop --verbose <command>
#################################
                
# From the directory you un-tarred Where the file do the following examples:
                
$ Hrmr examples, examples hput examples
                
# Now you can run your job Sqoop by Submitting it to oozie
                
$ Oozie job -oozie http: // localhost: 11000 / oozie -config \
    examples / apps / Sqoop / job.properties -run
                
job: 0000000-120812115858174-oozie-Oozi-W
                
# Get the job status from the server oozie
                
$ Oozie job -oozie http: // localhost: 11000 / oozie -info 0000000-120812115858174-oozie-Oozi-W
Job ID: 0000000-120812115858174-oozie-Oozi-W
-------------------------------------------------- ---------------------
Workflow Name: Sqoop-wf
App Path: hdfs: // localhost: 8020 / user / cloudera / examples / apps / Sqoop / workflow.xml
Status: SUCCEEDED
Run: 0
User: cloudera
Group: users
Created: 2012-08-12 16:05
Started: 2012-08-12 16:05
Last Modified: 2012-08-12 16:05
Ended: 2012-08-12 16:05
                
 Actions
-------------------------------------------------- --------------------
Status ID Status Err Ext Ext ID Code  
 -------------------------------------------------- -------------------
0000000-120812115858174-oozie-Oozi-W @ Sqoop-node OK
job_201208120930_0005 SUCCEEDED -         
 -------------------------------------------------- ------------------
                
# How to kill a job May come in useful at some point
                
oozie job -oozie http: // localhost: 11000 / oozie -kill 
0000013-120812115858174-oozie-Oozi-W                
                
# Job output will be in the file tree 
$ HCAT / user / cloudera / examples / output-data / Sqoop / part-m-00003
1018,2008-07-10,121, SW corner of Biltmore Mall, n, S22942    
, 2008-07-13,70.50,20.00,2008-08-06
1019,2008-07-11,122, closed till noon Mondays, n, Z55709    
, 2008-07-16,90.00,23.00,2008-08-06
1020,2008-07-11,123, express, n, W2286     
, 2008-07-16,14.00,8.50,2008-09-20
1021,2008-07-23,124, ask for Elaine, n, C3288     
, 2008-07-25,40.00,12.00,2008-08-22
1022,2008-07-24,126, express, n, W9925     
, 2008-07-30,15.00,13.00,2008-09-02
1023,2008-07-24,127, no deliveries after 3pm, n, KF2961    
, 2008-07-30,60.00,18.00,2008-08-22               
                
                
# If you run into there is a mistake esta good chance That your
# Database lock file is owned by root
$ Oozie job -oozie http: // localhost: 11000 / oozie -config \
examples / apps / Sqoop / job.properties -run
                
Error: E0607: E0607: Other error in operation [<openjpa-1.2.1-r752877: 753278 
store fatal error> org.apache.openjpa.persistence.RollbackException: 
The transaction has-been rolled back. See the nested exceptions for
That details on the errors occurred.], {1}
                
# Fix This as Follows
$ Sudo chown oozie: oozie /var/lib/oozie/oozie-db/db.lck 
                
# And restart the server oozie
$ Sudo su - oozie -s /usr/lib/oozie/bin/oozie-stop.sh 
$ Sudo su - oozie -s /usr/lib/oozie/bin/oozie-start.sh

HBase is a key value storage high performance

HBase, storage key is a value high performance If your use case requires scalability and only requires the equivalent of database transactions self-confirmation, HBase technology may well be employed. HBase is not a database. The name is unfortunate because for some, the term base involves database. But does an excellent job for storage of high-performance key value. There is some overlap between the functionality of HBase, Informix, DB2 and other relational databases. For ACID transactions, full SQL compliance, and multiple indexes traditional data base is the obvious choice.

This last exercise code is intended to provide basic familiarity with HBase. It's simple by design and in no way represents the scope of the functionality of HBase. Please use this example to understand some of the basic possibilities in HBase. "HBase, The Definitive Guide", by Lars George, is required reading if you plan to implement or reject HBase for your particular use case.

This last example, shown in Listings 23 and 24, use the REST interface provided with HBase to insert key values in a HBase table. The test frame is based on spiral.

Listing 23. Create a HBase table and insert a row

 # Enter the command line shell for hbase
                
$ Hbase shell
HBase Shell; enter 'help <RETURN> for list of supported commands.
Type "exit <RETURN> to leave the HBase Shell
Version 0.90.6-cdh3u4, r, Mon May 7 13:14:00 PDT 2012
                
# Create a table with a single column family
                
hbase (main): 001: 0> create 'mytable', 'mycolfamily'   
                
# If you get errors from hbase you need to fix the 
# Network config
                
# Here is a sample of the error:
                
ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase 
is reliable to connect to ZooKeeper but the connection closes immediately. 
This Could be a sign That the server has too many connections 
(30 is the default). Consider inspecting your server logs for ZK
That mistake And Then make sure you are reusing HBaseConfiguration 
as Often as you can. See HTable's javadoc for more information.
                
# Fix networking:
                
# Add the eth0 interface to / etc / hosts with a hostname
                
$ Sudo su - 
# Ifconfig | grep addr
eth0 Link encap: Ethernet HWaddr 00: 0C: 29: 8C: C7: 70  
inet addr: 192.168.1.134 Bcast: 192.168.1.255 Mask: 255.255.255.0
Interrupt: 177 Base address: 0x1400 
inet addr: 127.0.0.1 Mask: 255.0.0.0
[Root @ myhost ~] # hostname myhost
[root @ myhost ~] # echo " 192.168.1.134 myhost"> gt; / etc / hosts
[Root @ myhost ~] # cd /etc/init.d
                
# Now That the host and address are defined restart Hadoop
                
[Root @ myhost init.d] # for i in hadoop *
> do
> ./$i Restart
> done
                
# Create table now try again:
                
$ Hbase shell
HBase Shell; enter 'help <RETURN> for list of supported commands.
Type "exit <RETURN> to leave the HBase Shell
Version 0.90.6-cdh3u4, r, Mon May 7 13:14:00 PDT 2012
                
hbase (main): 001: 0> create 'mytable', 'mycolfamily'
0 row (s) in 1.0920 seconds
                
hbase (main): 002: 0> 
                
# Insert a row into the table you created
# Use some single telephone call log data
# Notice That mycolfamily CAN HAVE multiple cells
# This is very troubling for DBAs at first, but
# You do get used to it
                
hbase (main): 001: 0> put 'mytable', 'key123', 'mycolfamily: number', '6175551212'
0 row (s) in 0.5180 seconds
hbase (main): 002: 0> put 'mytable', 'key123', 'mycolfamily: duration', '25'      
                
# Now described and then a scan the table
                
hbase (main): 005: 0> describes 'mytable'
DESCRIPTION ENABLED                    
{NAME => 'mytable' FAMILIES => [{NAME => 'mycolfam true                       
ily 'BLOOMFILTER =>' NONE ', REPLICATION_SCOPE =>'                            
0 ', COMPRESSION =>' NONE 'VERSIONS =>' 3 ', TTL =>                             
'2147483647', BLOCKSIZE => '65536', in_memory => 'f                            
alse 'BLOCKCACHE =>' true '}]}                                                 
1 row (s) in 0.2250 seconds                
                
# That timestamps are included notice
                
hbase (main): 007: 0> scan 'mytable'
ROW COLUMN + CELL    
key123 column = mycolfamily: duration, 
timestamp = 1346868499125, value = 25  
key123 column = mycolfamily: number, 
timestamp = 1346868540850, value = 6175551212  
1 row (s) in 0.0250 seconds

Listing 24. Using the HBase REST interface

 # HBase REST includes a server
                
$ Hbase rest start -p 9393 &
                
# You get a bunch of messages .... 
                
# Get the status of the server HBase
                
$ Curl http: // localhost: 9393 / status / cluster
                
# Lots of output ...
# Many lines deleted ...
                
mytable ,, 1346866763530.a00f443084f21c0eea4a075bbfdfc292.
stores = 1
storefiless = 0
storefileSizeMB = 0
memstoreSizeMB = 0
storefileIndexSizeMB = 0
                
# Now scan the contents of mytable
                
$ Curl http: // localhost: 9393 / mytable / *
                
# Lines deleted
12/09/05 15:08:49 DEBUG client.HTable $ ClientScanner: 
Finished with scanning at REGION => 
# Lines deleted
<? Xml version = "1.0" encoding = "UTF-8" standalone = "yes"?>
<CellSet> <Row key = " a2V5MTIz ">
<Cell timestamp = "1346868499125" column = "bXljb2xmYW1pbHk6ZHVyYXRpb24 ="> MJU = </ Cell>
<Cell timestamp = "1346868540850" column = "bXljb2xmYW1pbHk6bnVtYmVy"> NjE3NTU1MTIxMg == </ Cell>
<Cell timestamp = "1346868425844" column = "bXljb2xmYW1pbHk6bnVtYmVy"> NjE3NTU1MTIxMg == </ Cell>
</ Row> </ CellSet>
                
# The values from the REST interface are base64 encoded
$ Echo a2V5MTIz | base64 -d
key123
$ Echo bXljb2xmYW1pbHk6bnVtYmVy | base64 -d
mycolfamily: number
                
# The above table scan Gives the schema needed to insert into the HBase table
                
$ Echo RESTinsertedKey | base64
== UkVTVGluc2VydGVkS2V5Cg
                
$ Echo 7815551212 | base64
NzgxNTU1MTIxMgo =
                
# Add a table entry with a key value of "RESTinsertedKey" and
# A phone number of "7815551212"
                
# Note - curl is all on one line
$ Curl -H "Content-Type: text / xml" -d '<CellSet>
<Row key = "UkVTVGluc2VydGVkS2V5Cg ==">
<Cell column = "bXljb2xmYW1pbHk6bnVtYmVy"> NzgxNTU1MTIxMgo = <Cell>
<Row> <CellSet> http://192.168.1.134:9393/mytable/dummykey
                
12/09/05 15:52:34 DEBUG rest.RowResource: POST http://192.168.1.134:9393/mytable/dummykey
12/09/05 15:52:34 DEBUG rest.RowResource: PUT row = RESTinsertedKey \ x0A, 
= {families (family = mycolfamily, 
keyValues = (RESTinsertedKey \ x0A / mycolfamily: number / 9223372036854775807 / Put / vlen = 11)}
                
# Trust, but verify
                
hbase (main): 002: 0> scan 'mytable'
ROW COLUMN + CELL                           
RESTinsertedKey \ x0A column = mycolfamily: number, timestamp = 1,346,874,754,883, value = 7815551212 \ x0A
key123 column = mycolfamily: duration, timestamp = 1,346,868,499,125, value = 25 
key123 column = mycolfamily: number, timestamp = 1,346,868,540,850, value = 6175551212 
2 row (s) in 0.5610 seconds
                
# Notice the \ x0A at the end of the key and value
# This is the newline generated by the "echo" command
# Lets fix that
                
$ Printf 8885551212 | base64
== ODg4NTU1MTIxMg
                
$ Printf mykey | base64
bXlrZXk =
                
# Note - curl statement is all on one line!
curl -H "Content-Type: text / xml" -d '<CellSet> <Row key = "bXlrZXk =">
<Cell column = "bXljb2xmYW1pbHk6bnVtYmVy"> ODg4NTU1MTIxMg == <Cell>
<Row> <CellSet> 
http://192.168.1.134:9393/mytable/dummykey              
                
# Trust but verify
hbase (main): 001: 0> scan 'mytable'
ROW COLUMN + CELL                                   
RESTinsertedKey \ x0A column = mycolfamily: number, timestamp = 1,346,875,811,168, value = 7815551212 \ x0A
key123 column = mycolfamily: duration, timestamp = 1,346,868,499,125, value = 25     
key123 column = mycolfamily: number, timestamp = 1,346,868,540,850, value = 6175551212
mykey column = mycolfamily: number, timestamp = 1346877875638, value = 8885551212
3 row (s) in 0.6100 seconds

go back up

conclusion

Ohh, you came all the way, congratulations! This is only the beginning of understanding of Hadoop and how it interacts with Informix and DB2. Here are some suggestions for your next steps.