Abstract:In the massive data retrieval applications, hashing-based approximate nearest(ANN) search has become popular due to its computational and memory efficiency for online search. Semi-supervised hashing (SSH) framework that minimizes empirical error over the labeled set and an information theoretic regularizer over both labeled and unlabeled sets. But the training of hashing function of this framework is so slow due to the large-scale complex training process. HAMA is a Hadoop top-level parallel framework based on Bulk Synchronous Parallel mode (BSP). In this paper, we analyze calculation of adjusted covariance matrix in the training process of SSH, split it into two parts: unsupervised data variance part and supervised pairwise labeled data part, and explore its parallelization. And experiments show the performance and scalability over general commercial hardware and network environment.