Abstract:In recent years, with the development of deep learning techniques, convolutional neural network (CNN) and Transformers have made significant progress in image super-resolution. However, for the extraction of global features of an image, it is common to stack individual operators and repeat the computation to gradually expand the receptive field. To better utilize global information, this study proposes that local, regional, and global features should be explicitly modeled. Specifically, local information, regional-local information, and global-regional information of an image are extracted and fused hierarchically and progressively through channel attention-enhanced convolution, a dual-branch parallel architecture consisting of a window-based Transformer and CNN, and a dual-branch parallel architecture consisting of a standard Transformer and a window-based Transformer. In addition, a hierarchical feature fusion method is designed to fuse the local information extracted from the CNN branch and the regional information extracted from the window-based Transformer. Extensive experiments show that the proposed network achieves better results in lightweight SR. For example, in the 4× upscaling experiments on the Manga109 dataset, the peak signal-to-noise ratio (PSNR) of the proposed network is improved by 0.51 dB compared to SwinIR.