Abstract:Lattice Quantum Chromo Dynamics (LQCD) is a non-perturbative method for the study of low-energy strong interactions between quarks and gluons. The statistical and systematic uncertainties of the results from LQCD are in principle all under control and can be reduced steadily. Based on LQCD theory, larger volume of lattice grids can calculate physical processes in larger space. And one can divide the space more meticulously to obtain more accurate results. Therefore, large system LQCD calculation is of great significance to the study of QCD theory, but is demanding for higher program computing performance. In this work, the large-scale parallel analysis and performance optimization of LQCD configuration generating and glueball measurement program are studied. Based on the blocking and even-odd algorithms used in LQCD simulation, we design a parallel algorithm based on MPI and OpenMP, and design an optimized data communication module. Aiming at the bottleneck of configuration file output, the solution of configuration file parallel output is put forward. The simulation programs are tested and analyzed on an Intel KNL platform and the x86_64 queues of “Tianhe 2” supercomputer. The results verify the effectiveness of the corresponding optimization measures, and the efficiency of parallel simulation is also analyzed. The maximum size of the test is 1728 nodes (i.e. 41 472 CPU cores).