Earlier works in scene graph generation (SGG) such as IMP 1, VTransE 2, and MotifNet 3 have achieved high overall recall performance by innovating the relation reasoning component of the neural network pipeline. However, recent works such as VCTree 4 and KERN 5 have pointed out an important caveat to such seemingly high performance brought about by deep neural architectures. Specifically, they highlighted the long-tailed distribution of relation labels in the primary dataset used in SGG, the Visual Genome dataset 6. In fact, sophisticated neural network techniques barely outperform simple guessing on the most frequent relations 3. Since then, the SGG research community has focused on rectifying this bias. 7 proposed a simple yet effective method to re-balance the distribution of relation labels in the training dataset. They also proposed a logit adjustment method to improve the performance of the SGG model further. 8 proposed a dual-scene graph-knowledge graph to incorporate external knowledge into the relational reasoning process. Total Direct Effect (TDE) 9 applied causal reasoning to tease out the unbiasing contribution of visual features from potential semantic guessing. We continue on this path of unbiasing SGG by building on the seminal work of BPL-SA 7. Causal Property Anti-Conflict Modeling (CPAM) 10 models a belief function of the relationship instead of a probability distribution. More recently, some approaches have addressed the learning pipeline. For example, Iterative Scene Graph Generation 11 models each generated scene graph iteratively upon the previous one using a Markov Random Field. BGNN 12 uses a bipartite graph neural network for message passing between entities and predicates. However, our approach differs significantly from BPL-SA. BAI 13 uses multiheaded attention to learn feature maps between visual features and the relation triplet and applies contrastive learning to maximize the distance between the predicate classes and discard outliers from training samples based on variance and loss. In contrast to previous methods, we propose a data-oriented approach to semantic and visual augmentation for the tail relations in the training dataset.
Data augmentation has been successfully applied to tasks such as object detection and object localization. In particular, MixUp 14 generates synthetic training images by combining random pairs of images from the training data. Specifically, MixUp creates a weighted combination of image features and labels as a new example. CutOut 15 augments and regularizes images by randomly obscuring square regions of the input. Its purpose is to enhance the resilience and overall effectiveness of convolutional neural networks. The primary inspiration behind CutOut comes from the challenge of object occlusion, which is a frequent occurrence in various computer vision applications like object recognition, tracking, and human pose estimation. By generating new images that replicate occluded scenarios, we not only enhance the model's readiness for real-world occlusions but also encourage it to consider a broader context of the image when making decisions. CutMix 16 is a technique used to augment image data. Unlike Cutout, which removes pixels, CutMix replaces the removed areas with a patch extracted from a different image. Ground-truth labels are also mixed proportionally based on the number of pixels in the combined images. When these patches are introduced, the localization capability of the model is further improved since it must now identify objects from partial viewpoints. Our work is inspired by MixUp and CutMix. However, unlike the visually concrete problems of object detection and object localization, our problem uniquely requires abstract relational reasoning about the visual objects. Furthermore, we are applying augmentation not as a general regularization technique but as a targeted solution to the long-tailed relational label problem in SGG. To this end, we propose a novel method to upsample the tail relations in scene graphs by augmenting both visual features and semantic labels.
Knowledge-based augmentation Earlier works have studied incorporating various types of knowledge into the SGG problem. The triplet sample distribution can be considered a form of knowledge. GB-Net 8 introduces commonsense knowledge (CSK) to SGG by incorporating knowledge matrix edges. EB-Net 17 takes further advantage of CSK by performing long-tail logit adjustment in the relation reasoning module. IETrans 18 uses logit adjustment (“Internal Transfer”) to rebalance unlikely triplets and creates missing (NA) labels (“External Transfer”) for relation triplets missed by annotators.
Resampling-based augmentation strategies Data augmentation in scene graph generation has not yet been extensively studied. Although the Balanced Predicate Transfer (BPL) approach 7 is related to data augmentation, it does not directly target tail relations. Instead, it rebalances the distribution of relation labels by downsampling triplets with common relation labels. GAN-Neigh 19 considers a different SGG problem. Instead of relation labels, it tries to balance the biased triplet distribution to improve zero-shot generalization performance. Because triplet diversity depends on both objects and relations, GAN-Neigh augments triplets by perturbing the objects. PFRL+A-PFG 20 uses dual VAEs to learn latent features for objects and predicates and performs a data resampling via bi-level rebalancing. Our motivation and methods differ from both BPL and GAN-Neigh in that we directly address the long-tailed distribution by upsampling the tail relation labels. Dark Knowledge Balance Learning (DKBL) 21 uses dark knowledge (learned activations) to perform logit adjustment of non-target predicate categories to explicitly balance the head and tail samples. TsCM 22 applies two stages of causal modeling to tackle the semantic confusion bias and uses factorized semantic confusion to perform logit adjustment. NICE 23 applies out-of-distribution detection techniques to create pseudo-labels for unannotated (background) object-subject pairs. It also uses visual similarity-based clustering to detect and reassign noisy in-distribution labels. DeC 24 uses a conditional variational auto-encoder to separate visual features into the object’s intrinsic identity features and the relation-dependent state feature. It then applies a compositional learning strategy to generate additional relation samples based on the learned state and identity features. BGNN 12 also applies bi-level data resampling for debiasing. Model-based techniques are typically not adaptable across backbone relation reasoning models. We chose a dataset-oriented resampling approach and used both semantic and visual information to perturb labels and perform upsampling on the tail relations.
Visual augmentation Perturbation-based augmentation strategies have also been studied. The GAN component in GAN-Neigh 19 uses a generative adversarial network (GAN) to create new visual samples in the latent visual embedding space in the relation reasoning model. CV-SGG 25 conducts augmentation with visual perturbation with minor modifications of object positions in the image for a triplet. Most related to our work, CFA 26 is motivated by triplet diversity and constructs new visual samples by replacing ROIs with different object categories. Our work is distinct from the above visual augmentation approaches. Our proposed RelAug technique targets rare relations and borrows existing objects of the same category from the existing dataset.
Footnotes
-
Xu et al., Scene Graph Generation by Iterative Message Passing, CVPR 2017 ↩
-
Zhang et al., Visual Translation Embedding Network for Visual Relation Detection, CVPR 2017 ↩
-
Zellers et al., Neural Motifs: Scene Graph Parsing with Global Context, CVPR 2018 ↩ ↩2
-
Tang et al., Learning to Compose Dynamic Tree Structures for Visual Contexts, CVPR 2019 ↩
-
Chen et al., Knowledge-Embedded Routing Network for Scene Graph Generation, CVPR 2019 ↩
-
Krishna et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, IJCV 2017 ↩
-
Guo et al., From General to Specific: Informative Scene Graph Generation via Balance Adjustment, ICCV 2021 ↩ ↩2 ↩3
-
Zareian et al., Bridging Knowledge Graphs to Generate Scene Graphs, ECCV 2020 ↩ ↩2
-
Tang et al., Unbiased Scene Graph Generation From Biased Training, CVPR 2020 ↩
-
Zhang et al., Causal Property-based Anti-Conflict Modeling with Hybrid Data Augmentation for Unbiased Scene Graph Generation, ACCV 2022 ↩
-
Khandelwal et al., Iterative scene graph generation, NeurIPS 2022 ↩
-
Li et al., Bipartite Graph Network With Adaptive Message Passing for Unbiased Scene Graph Generation. CVPR 2021 ↩ ↩2
-
Li et al., Biased-Predicate Annotation Identification via Unbiased Visual Predicate Representation. ACM MM 2023 ↩
-
Zhang et al., Mixup: Beyond Empirical Risk Minimization, ICLR 2018 ↩
-
DeVries et al., Improved Regularization of Convolutional Neural Networks with Cutout, arXiv 2017 ↩
-
Yun et al., CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features, ICCV 2019 ↩
-
Chen et al., More Knowledge, Less Bias: Unbiasing Scene Graph Generation With Explicit Ontological Adjustment, WACV 2023 ↩
-
Zhang et al., Fine-grained scene graph generation with data transfer, ECCV 2022 ↩
-
Knyazev et al., Generative Compositional Augmentations for Scene Graph Prediction, ICCV 2021 ↩ ↩2
-
Wang et al., Learning to Generate an Unbiased Scene Graph by Using Attribute-Guided Predicate Features, AAAI 2023 ↩
-
Chen et al., Dark Knowledge Balance Learning for Unbiased Scene Graph Generation, MM 2023 ↩
-
Sun et al., Unbiased Scene Graph Generation via Two-Stage Causal Modeling, IEEE TPAMI 2023 ↩
-
Li et al., The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation. CVPR 2022 ↩
-
He et al., State-aware Compositional Learning towards Unbiased Training for Scene Graph Generation. IEEE T-IP 2022 ↩
-
Jin et al., Fast Contextual Scene Graph Generation With Unbiased Context Augmentation, CVPR 2023 ↩
-
Li et al., Compositional Feature Augmentation for Unbiased Scene Graph Generation, ICCV 2023 ↩