转载

Hbase的Python API模块Starbase介绍

The following guest post is provided by Artur Barseghyan, a web developer currently employed by Goldmund, Wyldebeast & Wunderliebe in The Netherlands.

Python is my personal (and primary) programming language of choice and also happens to be the primary programming language at my company. So, when starting to work with a new technology, I prefer to use a clean and easy (Pythonic!) API.

After studying tons of articles on the web, reading (and writing) white papers, and doing basic performance tests (sometimes hard if you’re on a tight schedule), my company recently selected Cloudera for our Big Data platform (including using Apache HBase as our data store for Apache Hadoop), with Cloudera Manager serving a role as “one console to rule them all.”

However, I was surprised shortly thereafter to learn about the absence of a working Python wrapper around the REST API for HBase (aka Stargate). I decided to write one in my free time, and the result, ladies and gentlemen, wasStarbase (GPL).

In this post, I will provide some code samples and briefly explain what work has been done on Starbase. I assume that reader of this blog post already has some basic understanding of HBase (that is, of tables, column families, qualifiers, and so on).

一、安装

Next, I’ll show you some frequently used commands and use cases. But first, install the current version of Starbase from CheeseShop (PyPi).

# pip install starbase

导入模块:

>>> from starbase import Connection

…and create a connection instance. Starbase defaults to 127.0.0.1:8000; if your settings are different, specify them here.

>>> c = Connection()

二、API 操作实例

2.1 显示所有的表

假设有两个现有的表名为table1和table2表,以下将会打印出来。

>>> c.tables()
 ['table1', 'table2']

2.2 表的设计操作

每当你需要操作的表,你需要先创建一个表的实例。

创建一个表实例(注意,在这一步骤中没有创建表):

>>> t = c.table('table3')

Create a new table:

Create a table with columns ‘column1′, ‘column2′, ‘column3′ (here the table is actually created):

>>> t.create('column1', 'column2', 'column3')
 201

检查表是否存在：

>>> t.exists()
 True

查看表的列：

>>> t.columns()
 ['column1', 'column2', 'column3']

将列添加到表,(‘column4’,‘column5’,‘column6’,‘column7’):

>>> t.add_columns('column4', 'column5', 'column6', 'column7')
 200

删除列表，(‘column6’, ‘column7’):

>>> t.drop_columns('column6', 'column7')
 201

删除整个表:

>>> t.drop()
 200

2.3 表的数据操作

将数据插入一行:

>>> t.insert(
 >>>     'my-key-1',
 >>>     {
 >>>         'column1': {'key11': 'value 11', 'key12': 'value 12', 'key13': 'value 13'},
 >>>         'column2': {'key21': 'value 21', 'key22': 'value 22'},
 >>>         'column3': {'key32': 'value 31', 'key32': 'value 32'}
 >>>     }
 >>> )
 200

请注意,您也可以使用“本地”的命名方式列和细胞(限定词)。以下的结果等于前面的例子的结果。

>>> t.insert(
 >>>     'my-key-1a',
 >>>     {
 >>>         'column1:key11': 'value 11', 'column1:key12': 'value 12', 'column1:key13': 'value 13',
 >>>         'column2:key21': 'value 21', 'column2:key22': 'value 22',
 >>>         'column3:key32': 'value 31', 'column3:key32': 'value 32'
 >>>     }
 >>> )
 200

更新一排数据：

>>> t.update(
 >>>     'my-key-1',
 >>>     {'column4': {'key41': 'value 41', 'key42': 'value 42'}}
 >>> )
 200

Remove a row cell (qualifier):

>>> t.remove('my-key-1', 'column4', 'key41')
 200

Remove a row column (column family):

>>> t.remove('my-key-1', 'column4')
 200

Remove an entire row:

>>> t.remove('my-key-1')
 200

Fetch a single row with all columns:

>>> t.fetch('my-key-1')
   {
       'column1': {'key11': 'value 11', 'key12': 'value 12', 'key13': 'value 13'},
       'column2': {'key21': 'value 21', 'key22': 'value 22'},
       'column3': {'key32': 'value 31', 'key32': 'value 32'}
   }

Fetch a single row with selected columns (limit to ‘column1′ and ‘column2′ columns):

>>> t.fetch('my-key-1', ['column1', 'column2'])
   {
       'column1': {'key11': 'value 11', 'key12': 'value 12', 'key13': 'value 13'},
       'column2': {'key21': 'value 21', 'key22': 'value 22'},
   }

Narrow the result set even more (limit to cells ‘key1′ and ‘key2′ of column `column1` and cell ‘key32′ of column ‘column3′):

>>> t.fetch('my-key-1', {'column1': ['key11', 'key13'], 'column3': ['key32']})
   {
       'column1': {'key11': 'value 11', 'key13': 'value 13'},
       'column3': {'key32': 'value 32'}
   }

Note that you may also use the native means of naming the columns and cells (qualifiers). The example below does exactly the same thing as the example above.

>>>  t.fetch('my-key-1', ['column1:key11', 'column1:key13', 'column3:key32'])
   {
       'column1': {'key11': 'value 11', 'key13': 'value 13'},
       'column3': {'key32': 'value 32'}
   }

If you set the perfect_dict argument to False, you’ll get the native data structure:

>>>  t.fetch('my-key-1', ['column1:key11', 'column1:key13', 'column3:key32'], perfect_dict=False)
 {
     'column1:key11': 'value 11', 'column1:key13': 'value 13',
     'column3:key32': 'value 32'
 }

2.4 对表数据批处理操作

Batch operations (insert and update) work similarly to routine insert and update, but are done in a batch. You are advised to operate in batch as much as possible.

In the example below, we will insert 5,000 records in a batch:

>>> data = {
 >>>     'column1': {'key11': 'value 11', 'key12': 'value 12', 'key13': 'value 13'},
 >>>     'column2': {'key21': 'value 21', 'key22': 'value 22'},
 >>> }
 >>> b = t.batch()
 >>> for i in range(0, 5000):
 >>>     b.insert('my-key-%s' % i, data)
 >>> b.commit(finalize=True)
 {'method': 'PUT', 'response': [200], 'url': 'table3/bXkta2V5LTA='}

In the example below, we will update 5,000 records in a batch:

>>> data = {
 >>>     'column3': {'key31': 'value 31', 'key32': 'value 32'},
 >>> }
 >>> b = t.batch()
 >>> for i in range(0, 5000):
 >>>     b.update('my-key-%s' % i, data)
 >>> b.commit(finalize=True)
 {'method': 'POST', 'response': [200], 'url': 'table3/bXkta2V5LTA='}

Note: The table batch method accepts an optional size argument (int). If set, an auto-commit is fired each the time the stack is full.

2.5 表数据搜索（行扫描）

A table scanning feature is in development. At the moment it’s only possible to fetch all rows from a table. The result set returned is a generator.

注意：表数据扫描功能正在开发中。目前仅支持取出表中所有数据（Full Table Scan），暂不支持范围扫描（RowKey Range Scan），其结果以一个迭代器形式返回。