Abstract:In the all-media era, recommendation based on multimodal data is of great significance. This study proposes recommendation based on data in three modalities: text, audio, and image. Tensor fusion is implemented in two stages: The correlation between any two modes is modeled and fused by three parallel branches in the former stage, and the results of the three branches are then fused in the latter stage. This approach not only considers the local interaction between two modalities but also eliminates the influence of the modality fusion order on the result. In the recommen-dation module, the fused features are input to the stacked denoising auto-encoder and are then used as auxiliary features of collaborative filtering for recommendation. In the recommendation system constructed, an end-to-end training process is adopted for modality fusion and recommendation. Moreover, to overcome the high similarity and poor diversity of the recommendation results, this study also constructs a similarity matrix with the fused features of the tensor modalities in the two stages to further refine the results on the basis of the available recommendation results and thereby achieve rapid diversified recommendation. The experimental results show that the recommendation model based on the proposed multimodal fused features can not only effectively improve recommendation performance but also enhance the diversity of recommendation results.