Skip to content

We propose a holistic framework that formalizes a dual interpretable mechanism, comprising universal skew-layer alignment and bootstrapping sparsity, to enhance fusion gain in hybrid neural networks.

Notifications You must be signed in to change notification settings

Computational-social-science/Skew-pair_Fusion

Repository files navigation

Skew-pair fusion theory: An interpretable multimodal fusion framework

Our manuscript presents four key findings that advance the field of multimodal fusion learning:

  1. Universal Skew-Pair Fusion Theory: We curate a novel framework that addresses the gaps in fusion-pair alignment and sparsity assignment by disentangling a dual cross-modal heterogeneity paradigm.
  2. Interpretable Mechanism: We propose a holistic framework that formalizes a dual interpretable mechanism, comprising universal skew-layer alignment and bootstrapping sparsity, to enhance fusion gain in hybrid neural networks.
  3. Comprehensive Validation: Our extensive experiments across multiple fusion tasks, spanning text-audio, audio-video, image-text, and text-text fusion, demonstrate the empirical advantages of our approach over conventional late- and pairwise-fusion strategies.
  4. Sparsest Skew-Pair Fusion Network (SSFN): We develop a lightweight neural network SSFN that outperforms late- and pairwise-fusion counterparts, even in seemingly “unimodal” fusion scenarios, such as text-text fusion.
  5. Broader Implications: Our bioinspired framework has the potential to serve as a benchmark for reframing the multidisciplinary perspective on multimodal fusion and multisensory integration, with implications for a broader audience.

Data availability

The availability of datasets used in this study is detailed as follows: (1) Text-Audio fusion experiments: the Interactive Emotional Dyadic Motion Capture (IEMOCAP) Database (https://sail.usc.edu/iemocap/iemocap_publication.htm). (2) Audio-Video fusion experiments: the AVE (Audio-Visual Event Localization) dataset (https://github.com/YashNita/Audio-Visual-Event-Localization-in-Unconstrained-Videos). (3) Image-Text fusion experiments: the Hateful Memes (https://github.com/facebookresearch/mmf). (4) Text-Text fusion experiments: the parallel corpus C4EL (https://github.com/Computational-social-science/C4EL).

License

This work is licensed under a CC BY 4.0 License.

About

We propose a holistic framework that formalizes a dual interpretable mechanism, comprising universal skew-layer alignment and bootstrapping sparsity, to enhance fusion gain in hybrid neural networks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published