Abstract:As voice conversion technology becomes increasingly prevalent in human-computer interaction, the need for highly expressive speech continues to grow. Currently, voice conversion primarily relies on decoupling acoustic features, emphasizing the decoupling of content and timbre features, but often neglects the emotional features in speech, resulting in insufficient emotional expressiveness in converted audio. To address this problem, this study introduces a novel model for highly expressive voice conversion with multiple mutual information constraints (MMIC-EVC). On top of decoupling content and timbre features, the model incorporates an expressiveness module to capture discourse-level prosody and rhythm features, enabling the conveyance of emotional features. It constrains every encoder to focus on its acoustic embedding by minimizing the variational upper bounds of multiple mutual information between features. Experiments on the CSTR-VCTK and ESD speech datasets indicate that the converted audio of the proposed model achieves a mean opinion score of 3.78 for naturalness and a Mel cepstral distortion of 5.39 dB, significantly outperforming baseline models in the best-worst sensitivity test. The MMIC-EVC model effectively decouples rhythmic and prosodic features, facilitating high expressiveness in voice conversion, and thereby providing a more natural and better user experience in human-computer interaction.