Given the large-scale multi-modal training of recent visionbased models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this
work, our goal is to evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority
of robustness evaluation methods have introduced synthetic datasets to
induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works
have explored leveraging large language models and diffusion models to
generate changes in the background. However, these methods either lack
in offering control over the changes to be made or distort the object
semantics, making them unsuitable for the task. Our method, on the
other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve
this goal, we harness the generative capabilities of text-to-image, imageto-text, and image-to-segment models to automatically generate a broad
spectrum of object-to-background changes. We induce both natural and
adversarial background changes by either modifying the textual prompts
or optimizing the latents and textual embedding of text-to-image models. This allows us to quantify the role of background context in understanding the robustness and generalization of deep neural networks. We
produce various versions of standard vision datasets (ImageNet, COCO),
incorporating either diverse and realistic backgrounds into the images or
introducing color, texture, and adversarial changes in the background.
We conduct thorough experimentation and provide an in-depth analysis
of the robustness of vision-based models against object-to-background
context variations across different tasks.
Overview of ObjectCompose framework. ObjectCompose consists of an inpainting-based diffusion model to generate the counterfactual background of an image. The object mask is obtained from a segmentation model (SAM) by providing the class label as an input prompt. The segmentation mask, along with the original image caption (generated via BLIP-2) is then processed through the diffusion model. For generating adversarial examples, both the latent and conditional embedding are optimized during the denoising process.
Our natural object-to-background changes, including color and texture, perform favorably against state-of-the-art methods. Furthermore, our adversarial object-to-background changes show a significant drop in performance across vision models
Dataset | ||||||||
Original | ||||||||
ImageNet-E(λ=-20) | ||||||||
ImageNet-E(λ=20) | ||||||||
ImageNet-E(λ=20-adv) | ||||||||
LANCE | ||||||||
Class label(ours) | ||||||||
BLIP-2 Caption(ours) | ||||||||
Color(ours) | |
|||||||
Texture(ours) | |
|||||||
Adversarial(ours) | |
We evaluated the resilience of Zero-shot CLIP models against object-to-background compositional changes.
Dataset | ||||||||
Original | ||||||||
ImageNet-E(λ=-20) | ||||||||
ImageNet-E(λ=20) | ||||||||
ImageNet-E(λ=20-adv) | ||||||||
LANCE | ||||||||
Class label(ours) | ||||||||
BLIP-2 Caption(ours) | ||||||||
Color(ours) | |
|||||||
Texture(ours) | |
|||||||
Adversarial(ours) | |
Qualitative comparison of our method (top row) with previous related work (bottom row). Our method enables diversity and controlled background edits.
In this study, we propose ObjectCompose, a method for generating objectto-background compositional changes. Our method addresses the limitations of current works, specifically distortion of object semantics and diversity in background changes. We accomplish this by utilizing the capabilities of image-to-text and image-to-segmentation foundational models to preserve the object semantics, while we optimize for diverse object-to-background compositional changes by modifying the textual prompts or optimizing the latents of the text-to-image model. ObjectCompose offers a complimentary evaluation protocol to the existing ones, for comprehensive evaluations across current vision-based models to reveal their vulnerability to background alterations. We anticipate that our insights will pave the way for a more thorough evaluation of vision models, consequently driving the development of more effective methods for improving their resilience.
For additional details about ObjectCompose framework, dataset, results, please refer to our main paper. Thank you!
@article{malik2024objectcompose,
title={ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes},
author={Malik, Hashmat Shadab and Huzaifa, Muhammad and Muzzamal, Naseer and Khan, Salman and Khan, Fahad Shahbaz},
journal={arXiv:2403.04701},
year={2024}
}