Abstract:In order to improve the authenticity of Chinese lip synchronized facial animation videos, this study proposes a text audio-driven facial animation generation technology based on the improved Wav2Lip model. Firstly, a Chinese lip synchronized dataset is constructed, which is used to pre-train the lip discriminator to make it more accurate in discriminating Chinese lip synchronized facial animations. Then, in the Wav2Lip model, text features are introduced to improve lip time synchronization and thus improve the authenticity of facial animation videos. The model in this study synthesizes the extracted text information, audio information, and speaker facial information and generates a highly realistic lip synchronized facial animation video under the supervision of a pre-trained lip discriminator and video quality discriminator. The comparative experiments with the ATVGnet model and Wav2Lip model show that the lip synchronized facial animation video generated by the proposed model improves the synchronization between lip shape and audio and enhances the overall realism of the facial animation video. The paper provides a solution for the current facial animation generation.