HBase Region Pre-Splitting

Background

Hbase Version: 0.90.6

RegionSplitter

Usage

The org.apache.hadoop.hbase.util.RegionSplitter class provides several utilities to create a table with a specified number of pre-split regions.

create a table named ‘myTable’ with 60 pre-split regions containing 2 column families ‘test’ & ‘rs’

bin/hbase org.apache.hadoop.hbase.util.RegionSplitter -c 60 -f test:rs myTable

-c 60, specifies the requested number of regions as 60
-f specifies the column families you want in the table, separated by “:”

SplitAlgorithm

MD5StringSplit is the default RegionSplitter.SplitAlgorithm for creating pre-split tables. The format of MD5StringSplit is the ASCII representation of an MD5 checksum. Row are long values in the range “00000000” => “7FFFFFFF” and are left-padded with zeros to keep the same order lexographically as if they were binary.


public static class MD5StringSplit implements SplitAlgorithm {
    final static String MAXMD5 = "7FFFFFFF";
    final static BigInteger MAXMD5_INT = new BigInteger(MAXMD5, 16);

    public byte[][] split(int n) {
        BigInteger[] splits = new BigInteger[n - 1];
        BigInteger sizeOfEachSplit = MAXMD5_INT.divide(BigInteger.valueOf(n));
        for (int i = 1; i < n; i++) {
            // NOTE: this means the last region gets all the slop.
            // This is not a big deal if we're assuming n << MAXMD5
            splits[i - 1] = sizeOfEachSplit.multiply(BigInteger.valueOf(i));
        }
        return convertToBytes(splits);
    }

    ...

}

Rowkey Design


    String id = "your_id";

    String md5 = org.apache.commons.codec.digest.DigestUtils.md5Hex(id);
    BigInteger x1 = new BigInteger(md5, 16);
    BigInteger x2 = new BigInteger("7FFFFFFF", 16);
    BigInteger x3 = x1.and(x2);
                      
    String rowkey_hash = x3.toString(16);
    while(rowkey_hash.length() < 8){
        rowkey_hash = "0" + rowkey_hash;
    }

    String rowkey = rowkey_hash + "_" + id;
    System.out.println(rowkey);

Scenario

Data Size: 4TB
HBase Cluster: 10 nodes
hbase.hregion.max.filesize: 2GB
Cluster Region Count: 4TB/2GB = 2048
Node Region Count: 2048/10 ~= 200

hahakubile Blog, Powered by 车立方

Welcome to hahakubile's blog, You should know him. Thanks to 车立方