Our manuscript presents four key findings that advance the field of multimodal fusion learning:
- Universal Skew-Pair Fusion Theory: We curate a novel framework that addresses the gaps in fusion-pair alignment and sparsity assignment by disentangling a dual cross-modal heterogeneity paradigm.
- Interpretable Mechanism: We propose a holistic framework that formalizes a dual interpretable mechanism, comprising universal skew-layer alignment and bootstrapping sparsity, to enhance fusion gain in hybrid neural networks.
- Comprehensive Validation: Our extensive experiments across multiple fusion tasks, spanning text-audio, audio-video, image-text, and text-text fusion, demonstrate the empirical advantages of our approach over conventional late- and pairwise-fusion strategies.
- Sparsest Skew-Pair Fusion Network (SSFN): We develop a lightweight neural network SSFN that outperforms late- and pairwise-fusion counterparts, even in seemingly “unimodal” fusion scenarios, such as text-text fusion.
- Broader Implications: Our bioinspired framework has the potential to serve as a benchmark for reframing the multidisciplinary perspective on multimodal fusion and multisensory integration, with implications for a broader audience.
The availability of datasets used in this study is detailed as follows: (1) Text-Audio fusion experiments: the Interactive Emotional Dyadic Motion Capture (IEMOCAP) Database (https://sail.usc.edu/iemocap/iemocap_publication.htm). (2) Audio-Video fusion experiments: the AVE (Audio-Visual Event Localization) dataset (https://github.com/YashNita/Audio-Visual-Event-Localization-in-Unconstrained-Videos). (3) Image-Text fusion experiments: the Hateful Memes (https://github.com/facebookresearch/mmf). (4) Text-Text fusion experiments: the parallel corpus C4EL (https://github.com/Computational-social-science/C4EL).
This work is licensed under a CC BY 4.0 License.