Database Access with Apache Hadoop

The DBInputFormat and DBOutputFormat component provided in Hadoop 0.19 finally allows easy import and export of data between Hadoop and many relational databases, allowing relational data to be more easily incorporated into your data processing pipeline.

To import and export data between Hadoop and MySQL, you surely need Hadoop, MySQL installation on your machine.

  • My System Configuration

UBUNTU 13.4

JAVA 1.7.0_25

HADOOP 1.1.2

MySQL

Download mysql-connector-java-5.0.5.jar file and copy it to in $HADOOP_HOME/lib and restart the Hadoop ecosystem.

  • Database and table creation in MySQL
mysql> use testDb;

mysql> create table studentinfo (  id integer ,  name varchar(32) );

mysql> insert into studentinfo values(1,'archana');

mysql> insert into studentinfo values(2,'XYZ');

mysql> insert into studentinfo values(3,'archana');

  • Project Structure

The program contains the following java files.

Main.java
Map.java
Reduce.java
DBInputWritable.java
DBOutputWritable.java

To access the data from DB we have to create a class to define the data which we are going to fetch and write back to DB. In my project I created a class namely DBInputWritable.java and DBOutputWritable.java to accomplish the same.

DBInputWritable.java

package example;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.ResultSet;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.lib.db.DBWritable;

public class DBInputWritable implements Writable, DBWritable
{
   private int id;
   private String name;

   public void readFields(DataInput in) throws IOException {   }

   public void readFields(ResultSet rs) throws SQLException
   //Resultset object represents the data returned from a SQL statement
   {
     id = rs.getInt(1);
     name = rs.getString(2);
   }

   public void write(DataOutput out) throws IOException {  }

   public void write(PreparedStatement ps) throws SQLException
   {
     ps.setInt(1, id);
     ps.setString(2, name);
   }

   public int getId()
   {
     return id;
   }

   public String getName()
   {
     return name;
   }
}

This class “DBInputWritable” will be used in our Map class. Now let’s write our Mapper class.

Map.java

package example;

import java.io.IOException;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;

public class Map extends Mapper<LongWritable, DBInputWritable, Text, IntWritable>
{
   private IntWritable one = new IntWritable(1);

   protected void map(LongWritable id, DBInputWritable value, Context ctx)
   {
     try
     {
        String[] keys = value.getName().split(" ");

        for(String key : keys)
        {
           ctx.write(new Text(key),one);
        }
     } catch(IOException e)
     {
        e.printStackTrace();
     } catch(InterruptedException e)
     {
        e.printStackTrace();
     }
   }
}

DBOutputWritable.java

package example;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.ResultSet;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.lib.db.DBWritable;

public class DBOutputWritable implements Writable, DBWritable
{
   private String name;
   private int count;

   public DBOutputWritable(String name, int count)
   {
     this.name = name;
     this.count = count;
   }

   public void readFields(DataInput in) throws IOException {   }

   public void readFields(ResultSet rs) throws SQLException
   {
     name = rs.getString(1);
     count = rs.getInt(2);
   }

   public void write(DataOutput out) throws IOException {    }

   public void write(PreparedStatement ps) throws SQLException
   {
     ps.setString(1, name);
     ps.setInt(2, count);
   }
}

This class “DBOutputWritable” will be used in our Reduce class. Now let’s write our Reducer class.

Reduce.java

package example;

import java.io.IOException;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;

public class Reduce extends Reducer<Text, IntWritable, DBOutputWritable, NullWritable>
{
   protected void reduce(Text key, Iterable<IntWritable> values, Context ctx)
   {
     int sum = 0;

     for(IntWritable value : values)
     {
       sum += value.get();
     }

     try
     {
     ctx.write(new DBOutputWritable(key.toString(), sum), NullWritable.get());
     } catch(IOException e)
     {
       e.printStackTrace();
     } catch(InterruptedException e)
     {
       e.printStackTrace();
     }
   }
}

Main.java

package example;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
import org.apache.hadoop.mapreduce.lib.db.DBInputFormat;
import org.apache.hadoop.mapreduce.lib.db.DBOutputFormat;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;

public class Main
{
   public static void main(String[] args) throws Exception
   {
     Configuration conf = new Configuration();
     DBConfiguration.configureDB(conf,
     "com.mysql.jdbc.Driver",   // driver class
     "jdbc:mysql://localhost:3306/testDb", // db url
     "root",    // user name
     "hadoop123"); //password

     Job job = new Job(conf);
     job.setJarByClass(Main.class);
     job.setMapperClass(Map.class);
     job.setReducerClass(Reduce.class);
     job.setMapOutputKeyClass(Text.class);
     job.setMapOutputValueClass(IntWritable.class);
     job.setOutputKeyClass(DBOutputWritable.class);
     job.setOutputValueClass(NullWritable.class);
     job.setInputFormatClass(DBInputFormat.class);
     job.setOutputFormatClass(DBOutputFormat.class);

     DBInputFormat.setInput(
     job,
     DBInputWritable.class,
     "studentinfo",   //input table name
     null,
     null,
     new String[] { "id", "name" }  // table columns
     );

     DBOutputFormat.setOutput(
     job,
     "output",    // output table name
     new String[] { "name", "count" }   //table columns
     );

     System.exit(job.waitForCompletion(true) ? 0 : 1);
   }
}

Now you are ready to run the program. Convert the code to .jar file and run it.

Execute jar file

$ hadoop jar /home/hduser/DbIpOp.jar

Result

mysql> select * from output;

+-----------+-------+

| name      | count |

+-----------+-------+

| archana |     2    |

|  XYZ      |     1  |

+----------+------=--+2 rows in set (0.19 sec)

Reference

http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/

Advertisements

Sqoop:Exporting Data From HDFS to MySQL

Step 1: Install and start MySQL if you have not already done so

MySQL Installation Tutorial for instructions of how to install MySQL.

Step 2: Configure the MySQL Service and Connector

Downloadmysql-connector-java-5.0.5.jar file and copy it to $SQOOP_HOME/lib directory.

Step 3: Sqoop Installation

Sqoop Installation Tutorial for instructions of how to install Sqoop.

  • Database and table creation in MySQL

First connect to MySQL

$ mysql -u root -p

Enter password:

Create database ‘testDb’ and use ‘testDb’ database as a current database.

mysql> create database testDb;

mysql> use testDb;

Create table ‘stud1’

mysql> create table stud1(id integer,name char(20)); 

mysql> exit; 
  • HDFS File ‘student’
$hadoop dfs -cat /user/hduser/student 

1,Archana 

2,XYZ 

Sqoop Export

$sqoop export --connect jdbc:mysql://localhost/testDb --table stud1 -m 1 --export-dir /user/hduser/student

This example takes the files in /user/hduser/student and injects their contents in to the “stud1” table in the testDb” database. The target table must already exist in the database.

Note :

If you will get this

Error

com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Access denied for user ''@'localhost' to database 'testDb' 

Solution

Grant all privileges on testDb database to user:

mysql> grant all privileges on testDb.* to ''@localhost ;

Table Contents in MySQL

mysql> use testDb; 

mysql> select * from stud1; 

+------+----------+ 
| id   | name     | 
+------+----------+ 
| 1    | Archana  | 
| 2    | XYZ      | 
+------+----------+ 
2 rows in set (0.00 sec) 

Reference:

http://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html

Note: The information provided here is best of my knowledge and experiences if at all any modifications are to be made please help me with your valuable suggestions which are always welcome….  🙂

Sqoop:Importing data from MySQL into HDFS

Step 1: Install and start MySQL if you have not already done so

MySQL Installation Tutorial for instructions of how to install MySQL.

Step 2: Configure the MySQL Service and Connector

Download mysql-connector-java-5.0.5.jar file and copy it to $SQOOP_HOME/lib directory.

Step 3: Sqoop Installation

Sqoop Installation Tutorial for instructions of how to install Sqoop.

  • Database and table creation in MySQL

First connect to MySQL

$ mysql -u root -p

Enter password:

Create database ‘testDb’ and use ‘testDb’ database as a current database.

mysql> create database testDb;

mysql> use testDb;

Create table ‘student’

mysql> create table student(id integer,name char(20));

Add following 2 records to the table

mysql> insert into student values(1,'Archana');

mysql> insert into student values(2,'XYZ');

 Exit from MySQL

mysql> exit;

Sqoop import
1. Importing a table into HDFS

  • 1st way

Command for import

sqoop import --connect jdbc:mysql://localhost/databasename --username $USER_NAME --password $PASSWORD$ --table tablename --m 1

Execute the sqoop import

Here we are using database ‘testDb’ , username ‘root’, password ‘hadoop123’, and table student.

$ sqoop import --connect jdbc:mysql://localhost/testDb --username root --password hadoop123 --table student --m 1

——————- NOTE——————–

If you have not defined primary key for your table then you have to give ‘-m 1′ option for import.
Otherwise it gives error
ERROR tool.ImportTool: Error during import: No primary key could be found for table student1. Please specify one with --split-by or perform a sequential import with '-m 1'.
  • 2nd Way

Create a config file $HOME/import.txt add following to the config file

              import.txt

import
--connect
jdbc:mysql://localhost/testDb
--username
root
--password
hadoop123

Execute the sqoop import

sqoop --options-file /home/hduser/import.txt --table student -m 1

Once import is done you can find student.jar, student.class and student.java at following location /tmp/sqoop-hduser/compile/—-/student.jar

Files created in HDFS

$ hadoop dfs -ls -R student

Found 3 items

-rw-r--r--   1 hduser supergroup          0 2013-09-13 15:38 /user/hduser/student/_SUCCESS

drwxr-xr-x   - hduser supergroup          0 2013-09-13 15:38 /user/hduser/student/_logs

-rw-r--r--   1 hduser supergroup         16 2013-09-13 15:38 /user/hduser/student/part-m-00000

Data file contents

$ hadoop dfs -cat /user/hduser/student/part-m-00000

1,Archana
2,XYZ

2 Import all rows of a table in MySQL, but specific columns of the table

$ sqoop import --connect jdbc:mysql://localhost/testDb --username root --password hadoop123 --table student --columns "name" -m 1

Data file contents

$ hadoop dfs -cat  /user/hduser/student/part-m-00000

Archana
Xyz

3 Import all columns, filter rows using where clause

$ sqoop import --connect jdbc:mysql://localhost/testDb --username root --password hadoop123 --table student --where "id>1" -m 1 --target-dir /user/hduser/ar

Data file contents

$ hadoop dfs -cat  /user/hduser/ar/part-m-00000
2,XYZ

——————- NOTE——————– 

If you are facing FileExistsException then use “--target-dir /user/hduser/give_diff_name”
Error:
org.apache.commons.io.FileExistsException: Destination '/home/hduser/./student.java' already exists

Reference

http://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html

Note: The information provided here is best of my knowledge and experiences if at all any modifications are to be made please help me with your valuable suggestions which are always welcome…. :)

Changing Default metastore [derby] of Hive to MySQL

Hive comes configured to use Derby as the metastore. , but it can support only one active user at a time and is not certified for production use. The solution to supporting multiple users is to use a standalone database. MySQL is a popular choice for the standalone metastore.

  • Step 1: Install and start MySQL if you have not already done so

MySQL Installation Tutorial for instructions of how to install MySQL.

  • Step 2: Configure the MySQL Service and Connector

Download mysql-connector-java-5.0.5.jar file and copy it to $HIVE_HOME/lib directory.

  • Step 3. Create the Database and User

Create a metastore_db database in MySQL database using root user

$ mysql -u root -p

Enter password:

mysql> CREATE DATABASE metastore_db;

Create a User [hiveuser] in MySQL database using root user

mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'hivepassword';

mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'hivepassword';

[here the user hiveuser is same as ConnectionUserName in hive-site.xml file.]

mysql>  flush privileges;
  • Step 4: Install Hive if you have not already done so

Hive Installation Tutorial for instructions of how to install Hive.

  • Step 5: Configure the Metastore Service to Communicate with the MySQL Database

Create hive-site.xml file in $HIVE_HOME/conf directory and add the following configurations:

            <configuration>

            <property>

            <name>javax.jdo.option.ConnectionURL</name>

            <value>jdbc:mysql://localhost/metastore_db?createDatabaseIfNotExist=true</value>

            <description>metadata is stored in a MySQL server</description>

            </property>

            <property>

            <name>javax.jdo.option.ConnectionDriverName</name>

            <value>com.mysql.jdbc.Driver</value>

            <description>MySQL JDBC driver class</description>

            </property>

            <property>

            <name>javax.jdo.option.ConnectionUserName</name>

            <value>hiveuser</value>

            <description>user name for connecting to mysql server </description>

            </property>

            <property>

            <name>javax.jdo.option.ConnectionPassword</name>

            <value>hivepassword</value>

            <description>password for connecting to mysql server </description>

            </property>

            </configuration>
  • That’s all. Now hive will create the schema at the backend. Let us check:

Hive console:

hive> create table hivetesting(id string);

MySql console:

There are 2 ways to access metastore_db

1.      mysql -u root -p
        Enter password:
        mysql> use metastore_db;
        mysql> show tables ;

2.     mysql -u hiveuser -p metastore_db
       Enter password:
       mysql> show tables;

metastore

You can query the metastore schema in your MySQL database. Something like:

mysql> select * from TBLS;

On your MySQL database you will see the names of your Hive tables.

Add a new table and verify that the above query returns updated results.

  • Reference:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-Introduction

How to install MySQL on Ubuntu

The MySql installation is done in below version of Ubuntu.

UBUNTU 13.4

  • First of all, make sure your package management tools are up-to-date. Also make sure you install all the latest software available.

            sudo apt-get update

             sudo apt-get dist-upgrade

  • Install the MySQL server and client packages:

sudo apt-get install mysql-server mysql-client

The apt-get command will also install the mysql-client package which is necessary to login to mysql from the server itself.

During the installation, MySQL will ask you to set a root password.

c

  • You can now access your MySQL server like this:

mysql -u root -p

 mysql>

  •  Have fun using MySQL Server 🙂
  • What is mysql server and mysql client

The mysql server package will install the mysql database server which you can interact with using a mysql client. You can use the mysql client to send commands to any mysql server; on a remote computer or your own.

The mysql server is used to persist the data and provide a query interface for it (SQL). The mysql clients purpose is to allow you to use that query interface.