Abstract:Considering the difficulty in mutual mapping between text and image modalities in high-dimensional space, this study proposes a generative adversarial network (GAN) based on a stacked structure with global sentence vectors as input for the application of text-to-image generation tasks. The network incorporates a dual attention mechanism for greater integration of features in the two dimensions of space and channel. At the same time, we add the discriminator for fidelity loss as a constraint. The proposed method is experimentally verified on the Caltech-UCSD Birds (CUB) dataset, with Inception Score and SSIM as the evaluation indexes. The results show that the generated image has more realistic detail textures, and the visual effect is closer to the real image.