Abstract:The traditional Variational AutoEncoder (VAE) takes the flattened sample as input data directly. When the sample is image data, the effect of learning by this method is weakly. In this study, VAE with the convolution optimization is proposed to preprocess image data with multiple convolution networks of variable layers. Each convolution network sets different parameters to process the input data, then splices the results of different layers as the input of VAE. Clustering is implemented through the distance between the category label distribution of original dataset and the category distribution of each sample is calculated by adding a category encoder. The experimental results show that the convolution optimization method proposed in this study improves the clustering accuracy compared with the non-optimal VAE, increases the quality of the generated image and the diversity of the generated samples in the edge and shape.