<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Yann&#039;s Blog &#187; Python</title>
	<atom:link href="http://yannramin.com/tag/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://yannramin.com</link>
	<description></description>
	<lastBuildDate>Sat, 21 Jan 2012 05:23:08 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Using Facebook&#8217;s Thrift with Python and HBase</title>
		<link>http://yannramin.com/2008/07/19/using-facebook-thrift-with-python-and-hbase/</link>
		<comments>http://yannramin.com/2008/07/19/using-facebook-thrift-with-python-and-hbase/#comments</comments>
		<pubDate>Sat, 19 Jul 2008 09:25:47 +0000</pubDate>
		<dc:creator>Yann</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Thrift]]></category>
		<category><![CDATA[Tome]]></category>

		<guid isPermaLink="false">http://yannramin.com/?p=58</guid>
		<description><![CDATA[Today I&#8217;m going to show you how to interface Python to Apache HBase using Facebook&#8217;s Thrift package. Hbase is a documented oriented database which is very similar to Google&#8217;s BigTable (in fact its more or less a clone of BigTable as seen in the BigTable paper). HBase has two primary interfaces &#8211; a REST API [...]]]></description>
			<content:encoded><![CDATA[<p>Today I&#8217;m going to show you how to interface Python to <a href="http://www.hbase.org">Apache HBase</a> using <a href="http://developers.facebook.com/thrift/">Facebook&#8217;s Thrift</a> package. Hbase is a documented oriented database which is very similar to Google&#8217;s BigTable (in fact its more or less a clone of BigTable as seen in the <a href="http://labs.google.com/papers/bigtable.html">BigTable paper</a>). HBase has two primary interfaces &#8211; a REST API which is relatively slow, and a Thrift interface, which is recommended for high speed communication. For speed and other reasons, we&#8217;re going to be using the Thrift API.</p>
<p>Note that I am going to be touching on some Hbase jargon (such as column families). Its not essential to understand what those are if you are just trying to build a Python Thrift client. But if you&#8217;re trying to use HBase, I would consider that knowledge essential.</p>
<h2>Getting Setup</h2>
<p>First thing&#8217;s first, you need need to grab a copy of both HBase and Thrift. For this tutorial, I am using the Subversion copy of HBase (as of July 18th) and Thrift version  20080411p1. Thrift is shipped as a source package, you will need a compiler toolchain, as well as any Python development packages or header files your system may require (such as <em>python-dev</em> on Debian/Ubuntu). You&#8217;ll also need the Java JDK package (such as <em>sun-java6-jdk</em> on Ubuntu).</p>
<p>Thrift can be compiled using the standard routine:</p>
<pre>./configure
make -j4
sudo make install</pre>
<p>After installing thrift, you should have a system-wide &#8216;thrift&#8217; command available, which should provide some usage information. Thrift uses a descriptor file for the communication layer, available as a .thrift file. I&#8217;m not going to describe how to create such a descriptor file here (perhaps in a later blog post), as we&#8217;ll be using the one provided by HBase (with one small tweak). You will need the HBase source package for this exercise.</p>
<h2>Build a Thrift Client Package</h2>
<p>Open up <em>[hbasesrc]/src/java/org/apache/hadoop/hbase/thrift/Hbase.thrift</em> in your favorite text editor. Search for lines containing <em>ruby_namespace</em>, and add the following line in the same region:</p>
<pre>namespace py hbase</pre>
<p>(Alert readers will wonder why we didn&#8217;t use py_namespace. The reason is simple, the xxx_namespace Thrift commands are deprecated, replaced with namespace xxx).</p>
<p>Next up, we&#8217;ll generate our Python HBase thrift interface. Fire up your shell to the same location, and run</p>
<pre>thrift --gen py Hbase.thrift</pre>
<p>Now we have generated a set of Python classes in the <em>gen-py</em> folder which will allow you to talk to the Hbase thrift server automatically. Lets setup our Python Thrift server now. I&#8217;ll grab the <em>hbase</em> folder inside of the gen-py folder, and move it to a project directory of your choosing.</p>
<h2>Building a Client</h2>
<p>Next up, we&#8217;ll need to work on the Python Thrift client application. I suggest starting with the Thrift server tutorial for a boilerplate template. Below is the file we&#8217;re going to use (lets just assume it is called <em>client.py</em> for this discussion):</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/env python</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">from</span> thrift <span style="color: #ff7700;font-weight:bold;">import</span> Thrift
<span style="color: #ff7700;font-weight:bold;">from</span> thrift.<span style="color: black;">transport</span> <span style="color: #ff7700;font-weight:bold;">import</span> TSocket
<span style="color: #ff7700;font-weight:bold;">from</span> thrift.<span style="color: black;">transport</span> <span style="color: #ff7700;font-weight:bold;">import</span> TTransport
<span style="color: #ff7700;font-weight:bold;">from</span> thrift.<span style="color: black;">protocol</span> <span style="color: #ff7700;font-weight:bold;">import</span> TBinaryProtocol
&nbsp;
<span style="color: #ff7700;font-weight:bold;">from</span> hbase <span style="color: #ff7700;font-weight:bold;">import</span> Hbase
<span style="color: #ff7700;font-weight:bold;">from</span> hbase.<span style="color: black;">ttypes</span> <span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #66cc66;">*</span></pre></div></div>

<p>This is general Thrift boilerplate. The application specific portions up to now are the last two lines. Hbase is the name of the service as described in the <em>Hbase.thrift</em> file.</p>
<p>Next up, we&#8217;re going to try to connect to our HBase instance. To do that, we will first create a TSocket, then add a TBufferedTransport over the raw socket, and then wrap that in a TBinaryProtocol. If someone has studied too much Java, it was the Thrift developers <img src='http://yannramin.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> .</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;"># Make socket</span>
transport = TSocket.<span style="color: black;">TSocket</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'localhost'</span>, <span style="color: #ff4500;">9090</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Buffering is critical. Raw sockets are very slow</span>
transport = TTransport.<span style="color: black;">TBufferedTransport</span><span style="color: black;">&#40;</span>transport<span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Wrap in a protocol</span>
protocol = TBinaryProtocol.<span style="color: black;">TBinaryProtocol</span><span style="color: black;">&#40;</span>transport<span style="color: black;">&#41;</span></pre></div></div>

<p>Now two application specific lines &#8211; we&#8217;re going to build a Hbase.Client() object, and then finally open up our transport.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">client = Hbase.<span style="color: black;">Client</span><span style="color: black;">&#40;</span>protocol<span style="color: black;">&#41;</span>
&nbsp;
transport.<span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>We can do a quick validation pass now, and start up Hbase (if you have a running Hbase server somewhere, you can omit this step of course). If you have a source checkout of Hbase, compiling is as simple as running the <em>ant</em> tool. Assuming you have the JDK installed, Hbase should be ready for action in under a minute. Start up a master Hbase instance by running <em>bin/hbase master start &amp;</em>. Then, start up a thrift server for Hbase, by running <em>bin/hbase thrift start</em>.</p>
<p>Running our client script now should lead to no errors. If it does, stop, and try to figure out what is wrong (did you move the gen-py/hbase directory to where your client.py script is or set the python path appropriately?).</p>
<h2>Using the Client</h2>
<p>Lets call our first method: getTableNames(). Add this to the end of our script:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">print</span> client.<span style="color: black;">getTableNames</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>By default, it will simply print a blank list ([]), unless of course you have created tables. This is the simplest example of using Thrift with HBase and Python, where no special data structures are needed or passed around. But if we look at the <a href="http://wiki.apache.org/hadoop/Hbase/ThriftApi">HBase Thrift API</a> (not up to date &#8211; for full details look at the <em>Hbase.thrift</em> file), we can see some methods will require parameters in the form of Thrift structs.</p>
<p>Lets try to create a <em>table</em> in Hbase. When we consult HBase.thrift, we can see it requires a list of ColumnDescriptors.</p>
<pre>  /**
   * Create a table with the specified column families.  The name
   * field for each ColumnDescriptor must be set and must end in a
   * colon (:).  All other fields are optional and will get default
   * values if not explicitly specified.
   *
   * @param tableName name of table to create
   * @param columnFamilies list of column family descriptors
   *
   * @throws IllegalArgument if an input parameter is invalid
   * @throws AlreadyExists if the table name already exists
   */
  void createTable(1:Text tableName, 2:list columnFamilies)
    throws (1:IOError io, 2:IllegalArgument ia, 3:AlreadyExists exist)</pre>
<p>Luckily, the thrift compiler has generated a Python class for this ColumnDescriptor (which we acquired by importing hbase.ttypes.*). Sadly, this isn&#8217;t the most Python of all classes, but will be quite serviceable for our needs. Lets build a ColumnDescriptor for a column-family called <em>foo</em>. For Hbase, we need to specify the column family in the <em>name</em><strong>:</strong> format &#8211; so don&#8217;t forget that colon, or you will be faced with an IllegalArgumentException.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">desc = ColumnDescriptor<span style="color: black;">&#40;</span> <span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'foo:'</span> <span style="color: black;">&#125;</span> <span style="color: black;">&#41;</span></pre></div></div>

<p>Note that there are many more fields you can use. Either consult the <em>Hbase.thrift</em> file or the <em>hbase/ttypes.py</em> file for details.</p>
<p>Now we&#8217;re ready to create our table!</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">client.<span style="color: black;">createTable</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'our_table'</span>, <span style="color: black;">&#91;</span>desc<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">print</span> client.<span style="color: black;">getTableNames</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>Running this script should yield a [] followed by ['our_table']. Now we have a table in Hbase! Congratulations!</p>
<h2>Handling Errors</h2>
<p>If you run the script again, you&#8217;ll notice that you get an exception since the table name is already in use. This is of course expected, but also highlights Thrift&#8217;s ability to <em>propagate exceptions</em> from the remote system.</p>
<p>Exceptions must be predefined in the .thrift interface file. For the case of the <em>createTable</em> method, there are three possible exceptions. Catching them is much like any other exception. Here is our program, changed to catch the AlreadyExists exception:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">try</span>:
    desc = ColumnDescriptor<span style="color: black;">&#40;</span> d = <span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'foo:'</span> <span style="color: black;">&#125;</span> <span style="color: black;">&#41;</span>
&nbsp;
    client.<span style="color: black;">createTable</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'our_table'</span>, <span style="color: black;">&#91;</span>desc<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">print</span> client.<span style="color: black;">getTableNames</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">except</span> AlreadyExists, tx:
    <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;Thrift exception&quot;</span>
    <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">'%s'</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>tx.<span style="color: black;">message</span><span style="color: black;">&#41;</span></pre></div></div>

<p>Note specifically the presence of the <em>message</em> attribute. The Thrift compiler doesn&#8217;t generate a nice __str__ or __repr__ method for Python exceptions, so in many cases to determine the exact cause of the error, you need to grab the <em>message</em> attribute.</p>
<h2>Wrapping Up</h2>
<p>Before this turns into an exhaustive documentation of the HBase Thrift API, I&#8217;m going to put a close on this post <img src='http://yannramin.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> . I hope this short example will help you with using Hbase and Python, and combining Hbase and Thrift. In a future post, I will touch upon how to create a Python Thrift server, and define your own Thrift interface file.</p>
]]></content:encoded>
			<wfw:commentRss>http://yannramin.com/2008/07/19/using-facebook-thrift-with-python-and-hbase/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.373 seconds -->

