Yann's Blog - Technology at large

July 19, 2008

Using Facebook’s Thrift with Python and HBase

Filed under: Python,Software — Tags: , , — Yann @ 1:25 am

Today I’m going to show you how to interface Python to Apache HBase using Facebook’s Thrift package. Hbase is a documented oriented database which is very similar to Google’s BigTable (in fact its more or less a clone of BigTable as seen in the BigTable paper). HBase has two primary interfaces – a REST API which is relatively slow, and a Thrift interface, which is recommended for high speed communication. For speed and other reasons, we’re going to be using the Thrift API.

Note that I am going to be touching on some Hbase jargon (such as column families). Its not essential to understand what those are if you are just trying to build a Python Thrift client. But if you’re trying to use HBase, I would consider that knowledge essential.

Getting Setup

First thing’s first, you need need to grab a copy of both HBase and Thrift. For this tutorial, I am using the Subversion copy of HBase (as of July 18th) and Thrift version 20080411p1. Thrift is shipped as a source package, you will need a compiler toolchain, as well as any Python development packages or header files your system may require (such as python-dev on Debian/Ubuntu). You’ll also need the Java JDK package (such as sun-java6-jdk on Ubuntu).

Thrift can be compiled using the standard routine:

./configure
make -j4
sudo make install

After installing thrift, you should have a system-wide ‘thrift’ command available, which should provide some usage information. Thrift uses a descriptor file for the communication layer, available as a .thrift file. I’m not going to describe how to create such a descriptor file here (perhaps in a later blog post), as we’ll be using the one provided by HBase (with one small tweak). You will need the HBase source package for this exercise.

Build a Thrift Client Package

Open up [hbasesrc]/src/java/org/apache/hadoop/hbase/thrift/Hbase.thrift in your favorite text editor. Search for lines containing ruby_namespace, and add the following line in the same region:

namespace py hbase

(Alert readers will wonder why we didn’t use py_namespace. The reason is simple, the xxx_namespace Thrift commands are deprecated, replaced with namespace xxx).

Next up, we’ll generate our Python HBase thrift interface. Fire up your shell to the same location, and run

thrift --gen py Hbase.thrift

Now we have generated a set of Python classes in the gen-py folder which will allow you to talk to the Hbase thrift server automatically. Lets setup our Python Thrift server now. I’ll grab the hbase folder inside of the gen-py folder, and move it to a project directory of your choosing.

Building a Client

Next up, we’ll need to work on the Python Thrift client application. I suggest starting with the Thrift server tutorial for a boilerplate template. Below is the file we’re going to use (lets just assume it is called client.py for this discussion):

#!/usr/bin/env python
import sys
 
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
 
from hbase import Hbase
from hbase.ttypes import *

This is general Thrift boilerplate. The application specific portions up to now are the last two lines. Hbase is the name of the service as described in the Hbase.thrift file.

Next up, we’re going to try to connect to our HBase instance. To do that, we will first create a TSocket, then add a TBufferedTransport over the raw socket, and then wrap that in a TBinaryProtocol. If someone has studied too much Java, it was the Thrift developers ;).

# Make socket
transport = TSocket.TSocket('localhost', 9090)
 
# Buffering is critical. Raw sockets are very slow
transport = TTransport.TBufferedTransport(transport)
 
# Wrap in a protocol
protocol = TBinaryProtocol.TBinaryProtocol(transport)

Now two application specific lines – we’re going to build a Hbase.Client() object, and then finally open up our transport.

client = Hbase.Client(protocol)
 
transport.open()

We can do a quick validation pass now, and start up Hbase (if you have a running Hbase server somewhere, you can omit this step of course). If you have a source checkout of Hbase, compiling is as simple as running the ant tool. Assuming you have the JDK installed, Hbase should be ready for action in under a minute. Start up a master Hbase instance by running bin/hbase master start &. Then, start up a thrift server for Hbase, by running bin/hbase thrift start.

Running our client script now should lead to no errors. If it does, stop, and try to figure out what is wrong (did you move the gen-py/hbase directory to where your client.py script is or set the python path appropriately?).

Using the Client

Lets call our first method: getTableNames(). Add this to the end of our script:

print client.getTableNames()

By default, it will simply print a blank list ([]), unless of course you have created tables. This is the simplest example of using Thrift with HBase and Python, where no special data structures are needed or passed around. But if we look at the HBase Thrift API (not up to date – for full details look at the Hbase.thrift file), we can see some methods will require parameters in the form of Thrift structs.

Lets try to create a table in Hbase. When we consult HBase.thrift, we can see it requires a list of ColumnDescriptors.

  /**
   * Create a table with the specified column families.  The name
   * field for each ColumnDescriptor must be set and must end in a
   * colon (:).  All other fields are optional and will get default
   * values if not explicitly specified.
   *
   * @param tableName name of table to create
   * @param columnFamilies list of column family descriptors
   *
   * @throws IllegalArgument if an input parameter is invalid
   * @throws AlreadyExists if the table name already exists
   */
  void createTable(1:Text tableName, 2:list columnFamilies)
    throws (1:IOError io, 2:IllegalArgument ia, 3:AlreadyExists exist)

Luckily, the thrift compiler has generated a Python class for this ColumnDescriptor (which we acquired by importing hbase.ttypes.*). Sadly, this isn’t the most Python of all classes, but will be quite serviceable for our needs. Lets build a ColumnDescriptor for a column-family called foo. For Hbase, we need to specify the column family in the name: format – so don’t forget that colon, or you will be faced with an IllegalArgumentException.

desc = ColumnDescriptor( { 'name' : 'foo:' } )

Note that there are many more fields you can use. Either consult the Hbase.thrift file or the hbase/ttypes.py file for details.

Now we’re ready to create our table!

client.createTable('our_table', [desc])
 
print client.getTableNames()

Running this script should yield a [] followed by ['our_table']. Now we have a table in Hbase! Congratulations!

Handling Errors

If you run the script again, you’ll notice that you get an exception since the table name is already in use. This is of course expected, but also highlights Thrift’s ability to propagate exceptions from the remote system.

Exceptions must be predefined in the .thrift interface file. For the case of the createTable method, there are three possible exceptions. Catching them is much like any other exception. Here is our program, changed to catch the AlreadyExists exception:

try:
    desc = ColumnDescriptor( d = { 'name' : 'foo:' } )
 
    client.createTable('our_table', [desc])
 
    print client.getTableNames()
 
except AlreadyExists, tx:
    print "Thrift exception"
    print '%s' % (tx.message)

Note specifically the presence of the message attribute. The Thrift compiler doesn’t generate a nice __str__ or __repr__ method for Python exceptions, so in many cases to determine the exact cause of the error, you need to grab the message attribute.

Wrapping Up

Before this turns into an exhaustive documentation of the HBase Thrift API, I’m going to put a close on this post :). I hope this short example will help you with using Hbase and Python, and combining Hbase and Thrift. In a future post, I will touch upon how to create a Python Thrift server, and define your own Thrift interface file.

18 Comments

  1. You’re a champ.
    I’ve been itching to try out HBase since I wrote something about the AppEngine Datastore API that illicited the kind of comments that made me think that either I’m missing something, or other people are.

    Hbase is a BigTable copy, but I’m flabbergasted at how simple the GAE Datastore API makes BigTable. Sure, they control the deployment environment, but there’s more lessons in that than I can begin to understand.

    Thanks very much for taking the time with this.
    Cheers
    -Rich

    Comment by Rich — July 20, 2008 @ 5:22 am

  2. Thank you very much for this tutorial!

    Comment by Paolo — August 5, 2008 @ 1:21 am

  3. I just tried building Thrift 20080411p1 on Debian Lenny (installed libboost) but after spitting out many warnings make fails finally with:

    src/main.cc:1114: error: ‘PATH_MAX’ was not declared in this scope
    src/main.cc:1115: error: ‘rp’ was not declared in this scope
    src/main.cc:1118: error: ‘rp’ was not declared in this scope

    I wonder how you managed to build it.

    Comment by Marek Kubica — August 24, 2008 @ 1:07 am

  4. Update: It was a problem with GCC 4.3, see the Thrift FAQ. I could get the Thrift SVN version from the Apache Incubator to run.

    Comment by Marek Kubica — August 25, 2008 @ 4:16 am

  5. Great post, any luck in using thrift to get a HDFS API??

    http://wiki.apache.org/hadoop/HDFS-APIs

    Comment by Goran Cetusic — December 17, 2008 @ 3:39 am

  6. Not yet but will try it soon!

    Comment by Yann — December 23, 2008 @ 10:46 am

  7. @Goran: There’s also Dumbo which lets you do map/reduce on hadoop in python. Though it uses hadoop streaming, not thrift.

    http://wiki.github.com/klbostee/dumbo

    Comment by Tim — March 13, 2009 @ 2:59 am

  8. Hey, Great tutorial – Thanks!.

    It didnt work out of the box for me however with latest HBase (0.20.2) and thrift (0.2.0-incubator).
    It seems that either the thrift auto-generated code or Hbase thrift API changed somewhat.

    in particular, one should use kwargs parameter style when creating ColumnDescriptor:

    Instead of this:
    desc = ColumnDescriptor( { ‘name’ : ‘foo:’ } )
    Use this:
    desc = ColumnDescriptor(name=’foo’)

    Adam

    Comment by Adam — December 25, 2009 @ 11:28 am

  9. One suggestion: If your client.py (as described in this blog) is throwing error saying “…… thrift.transport.TTransport.TTransportException: Could not connect to localhost:9090″ . This could be due to using ‘localhost’…Fix: replace localhost with your IP.
    Note: you may required to make some changes in you /etc/hosts file

    Comment by Guru Prasad — August 10, 2010 @ 1:50 am

  10. Awesome writeup, thanks so much!

    I’ve been playing around with HBase to see if it’s feasible to replace MySQL in our current project; This blog post stopped just a little short of what I needed to know, and I had to figure out how to mutate and delete rows by hand.

    Here’s my benchmark code, I hope it might serve someone else!
    http://pastebin.com/AuuDCKTi

    Comment by Leeward Bound — November 14, 2010 @ 10:13 am

  11. Just wanted to update my previous post –
    I’ve noticed that my server’s not bottlenecked on either processor or memory while running the create and delete operations in the above script. I can only assume the socket connections to thrift are the bottleneck; I was primarily interested in the random seek times, so I didn’t bother modifying the script to multithread.

    Comment by Leeward Bound — November 14, 2010 @ 8:10 pm

  12. When I write a client to insert data into table, I found the CPU is just totally 30% on 8 core system.
    The bottleneck seems that the client read data from local disk file and insert data into htable are serial, so when read , inserting is suspended, and when insert read is suspended.

    After I use multi-thread, the thread is hang on thrift’s hbase.py: recv_mutateRows.

    Does thrift doesn’t support multi-thread?

    Comment by adebug — November 26, 2010 @ 5:54 am

  13. Hi,

    I am trying to use a perl script to connect to my Hbase setup just to perform reads from tables. I have the thrift server running and all. Also compiled thrift successfully and generated the Hbase.pm, Constants.pm and Types.pm files for a perl client. What i cant seem to figure out is where to get the following CPAN packages:

    use Thrift::BinaryProtocol;
    use Thrift::BufferedTransport;
    use Thrift::Socket;

    Any help will be appreciated.

    Thanks,
    Sami

    Comment by Sami — February 13, 2011 @ 2:26 am

  14. I figured out the anwer to my last post. Under the lib/perl directory after extracting Thrift and running make. You need to run make in the lib/perl directory to install these .pm packages.

    Thanks,
    Sami

    Comment by Sami — February 13, 2011 @ 2:53 am

  15. When i configured thrift 0.5.0 and build a client as given here, i got a type error as

    “TypeError: write() argument 1 must be string or read-only character buffer, not dict” for createTable method

    I solved this by using using client.createTable(“table_name”, [ColumnDescriptor(name="foo:")])

    Thought this might be useful for some newbie like me :)

    Cheers,
    Varadharajan

    Comment by Varadharajan — February 26, 2011 @ 2:54 am

  16. @Varadharajan I also got that error “TypeError: write() argument 1 must be string or read-only character buffer, not dict” for createTable method

    I solved it with

    desc = ColumnDescriptor( *{‘name’:’foo:’})
    the star translates it into kwargs

    Comment by Oscar — March 15, 2011 @ 2:17 pm

  17. There is also HappyBase, a developer-friendly Python library to interact with Apache HBase: https://github.com/wbolster/happybase

    Comment by Wouter Bolsterlee — May 24, 2012 @ 3:10 pm

  18. Hbase provides Big table-like capabilities on top of Hadoop and HDFS.

    Comment by benslinkard — September 26, 2012 @ 3:18 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress