Distributed database HBase has the greater advantage than traditional relational database in large scale data loading but there is also a lot of optimization space. We build HBase environment based on the Hadoop distributed platform, and optimize self-defining data loading algorithm. Firstly, this paper analysis the HBase underlying data store, experiments work out that data loading methods of HBase are insufficient in efficiency and flexibility. Furthermore, it proposes self-defining parallel data loading algorithm, and optimizes the cluster. The experimental results show that the optimized self-defining parallel data loading method can give full play to the cluster performance, has good loading efficiency and data operational capacity.
2 Chang F, Dean J, Ghemawat S, et al. Bigtable: A distributed storage system for structured data. ACM Trans. on Computer Systems, 2008, 26(2): 205-218.
3 Shvachko K, Kuang H, Radia S, et al. The Hadoop Distributed File System. 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE Computer Society. 2010. 1-10.
4 Borthakur D, Gray J, Sarma JS, et al. Apache hadoop goes realtime at Facebook. ACM SIGMOD International Conference on Management of Data, SIGMOD 2011. Athens, Greece. June. 2011. 1071-1080.
5 George L. HBase: the definitive guide. Sebastopol. USA: O'Reilly Media, 2011.
6 Stonebraker M. SQL databases vs NoSQL databases. Communications of the ACM, 2010, 53(4): 10-11.
7 Ghemawat S, Gobioff H, Leung ST. The Google file system. Proc. of the 19th ACM Symp. on Operating Systems Principles. New York. ACM Press. 2003. 29-43