Skew-pair fusion theory: An interpretable multimodal fusion framework

Our manuscript presents four key findings that advance the field of multimodal fusion learning:

Universal Skew-Pair Fusion Theory: We curate a novel framework that addresses the gaps in fusion-pair alignment and sparsity assignment by disentangling a dual cross-modal heterogeneity paradigm.
Interpretable Mechanism: We propose a holistic framework that formalizes a dual interpretable mechanism, comprising universal skew-layer alignment and bootstrapping sparsity, to enhance fusion gain in hybrid neural networks.
Comprehensive Validation: Our extensive experiments across multiple fusion tasks, spanning text-audio, audio-video, image-text, and text-text fusion, demonstrate the empirical advantages of our approach over conventional late- and pairwise-fusion strategies.
Sparsest Skew-Pair Fusion Network (SSFN): We develop a lightweight neural network SSFN that outperforms late- and pairwise-fusion counterparts, even in seemingly “unimodal” fusion scenarios, such as text-text fusion.
Broader Implications: Our bioinspired framework has the potential to serve as a benchmark for reframing the multidisciplinary perspective on multimodal fusion and multisensory integration, with implications for a broader audience.

Data availability

The availability of datasets used in this study is detailed as follows: (1) Text-Audio fusion experiments: the Interactive Emotional Dyadic Motion Capture (IEMOCAP) Database (https://sail.usc.edu/iemocap/iemocap_publication.htm). (2) Audio-Video fusion experiments: the AVE (Audio-Visual Event Localization) dataset (https://github.com/YashNita/Audio-Visual-Event-Localization-in-Unconstrained-Videos). (3) Image-Text fusion experiments: the Hateful Memes (https://github.com/facebookresearch/mmf). (4) Text-Text fusion experiments: the parallel corpus C4EL (https://github.com/Computational-social-science/C4EL).

This work is licensed under a CC BY 4.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
audio-video code		audio-video code
text-speech code		text-speech code
text-text code		text-text code
text-vision code		text-vision code
README.md		README.md
multimodal_plot.ipynb		multimodal_plot.ipynb
result.xlsx		result.xlsx
text-text_plot.ipynb		text-text_plot.ipynb