Background
Hbase Version: 0.90.6
RegionSplitter
Usage
The org.apache.hadoop.hbase.util.RegionSplitter
class provides several utilities to create a table with a specified number of pre-split regions.
create a table named ‘myTable’ with 60 pre-split regions containing 2 column families ‘test’ & ‘rs’
bin/hbase org.apache.hadoop.hbase.util.RegionSplitter -c 60 -f test:rs myTable
- -c 60, specifies the requested number of regions as 60
- -f specifies the column families you want in the table, separated by “:”
SplitAlgorithm
MD5StringSplit
is the default RegionSplitter.SplitAlgorithm for creating pre-split tables. The format of MD5StringSplit is the ASCII representation of an MD5 checksum. Row are long values in the range “00000000” => “7FFFFFFF” and are left-padded with zeros to keep the same order lexographically as if they were binary.
public static class MD5StringSplit implements SplitAlgorithm {
final static String MAXMD5 = "7FFFFFFF";
final static BigInteger MAXMD5_INT = new BigInteger(MAXMD5, 16);
public byte[][] split(int n) {
BigInteger[] splits = new BigInteger[n - 1];
BigInteger sizeOfEachSplit = MAXMD5_INT.divide(BigInteger.valueOf(n));
for (int i = 1; i < n; i++) {
// NOTE: this means the last region gets all the slop.
// This is not a big deal if we're assuming n << MAXMD5
splits[i - 1] = sizeOfEachSplit.multiply(BigInteger.valueOf(i));
}
return convertToBytes(splits);
}
...
}
Rowkey Design
String id = "your_id";
String md5 = org.apache.commons.codec.digest.DigestUtils.md5Hex(id);
BigInteger x1 = new BigInteger(md5, 16);
BigInteger x2 = new BigInteger("7FFFFFFF", 16);
BigInteger x3 = x1.and(x2);
String rowkey_hash = x3.toString(16);
while(rowkey_hash.length() < 8){
rowkey_hash = "0" + rowkey_hash;
}
String rowkey = rowkey_hash + "_" + id;
System.out.println(rowkey);
Scenario
- Data Size: 4TB
- HBase Cluster: 10 nodes
- hbase.hregion.max.filesize: 2GB
- Cluster Region Count: 4TB/2GB = 2048
- Node Region Count: 2048/10 ~= 200