Abstract:In view of the fact that the cache data cleaning of spark checkpoint needs to be cleaned by the programmer after the job is completed, which may lead to memory accumulation of the failure data. This study analyzes the execution mechanism of checkpoint, deduces that the checkpoint cache cleaning method is not extensible with the increase of the number of check points. The matching degree between checkpoint cache and memory slot is measured by using the utility entropy model of checkpoint cache. The optimal checkpoint cache cleaning time is derived by using the principle of best utility matching. The PCC strategy based on utility entropy optimizes memory resources by making the checkpoint cache clean-up time approximately equal to the time when the checkpoint is written to HDFS. The experimental results show that in the multi-job execution environment based on fair scheduling, with the increase of the number of checkpoints, the execution efficiency of the unoptimized program becomes worse. After using PCC strategy, the program execution time, power consumption and GC time can be reduced by 10.1%, 9.5% and 19.5%, respectively. Effectively improve the efficiency of multi-checkpoint program execution.