Abstract:As being more and more widely used for data exchange and integration, the XML data quality issues cause more concern. In order to overcome the problems caused by data quality, Entity Resolution(ER) is critical. To overcome the drawbacks of current methods's deficiency and perform entity resolution efficiently and effectively on massive XML data set, under the basis of Entity Resolution, an XML data duplicate detection based on hadoop platform algorithm is presented in this paper. The method uses entities to describe their atrributes. By the comparing of the attributes,we can find all the objects that have the same attributes quickly. Meanwhile, taking the advantage of the Hadoop platform which can process massive data parallel. From the experiments, the method has excellent performance in scalability, flexibility and efficiency.