Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes

Abstract

Given the large-scale multi-modal training of recent visionbased models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, our goal is to evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, imageto-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. This allows us to quantify the role of background context in understanding the robustness and generalization of deep neural networks. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct thorough experimentation and provide an in-depth analysis of the robustness of vision-based models against object-to-background context variations across different tasks.

ObjectCompose Framework

Overview of ObjectCompose framework. ObjectCompose consists of an inpainting-based diffusion model to generate the counterfactual background of an image. The object mask is obtained from a segmentation model (SAM) by providing the class label as an input prompt. The segmentation mask, along with the original image caption (generated via BLIP-2) is then processed through the diffusion model. For generating adversarial examples, both the latent and conditional embedding are optimized during the denoising process.

ObjectCompose: Automating diverse background changes in real images.

Our Insights :

Robustness of Vision Models: Vison-based models are vulnerable to diverse background changes, such as texture and color, with the most prone to adversarial background changes.
Model Capacity: Increasing the capacity of the model for both CNN and transformer-based models helps in improving the robustness against varying background contexts. This indicates, that distilling from a more robust model can help in improving the robustness of small models. Evidence of this is seen in the performance of DeiT-T, which, by distilling knowledge from a strong CNN-based model, shows improved robustness as compared to ViT-T
Adversarially Trained Models: Our study indicates that adversarially trained models have limited robustness. While they perform well in scenarios with adversarial background changes, their effectiveness is limited for other types of object-to-background compositions. This highlights a significant gap in current model training approaches when it comes to dealing with diverse background changes.
Evaluation across various Vision Task: Object detection and segmentation models, which incorporate object-to-background context, display reasonably better robustness to background changes than classification models, as evidenced by quantitative and qualitative results
Robustness of Dinov2 Models: Also, recent training approaches for vision transformer-based classification models that learn more interpretable attention maps like Dinov2 (with registers) show improvement in robustness to background changes
Large Scale Training: Furthermore, models trained on large-scale datasets with more scalable and stable training show better robustness against background variations.

ObjectCompose results comparison

Our natural object-to-background changes, including color and texture, perform favorably against state-of-the-art methods. Furthermore, our adversarial object-to-background changes show a significant drop in performance across vision models

ObjectCompose performs favourably well relative to SOTA methods across unimodel models

Dataset	ViT-T	ViT-S	Swin-T	Swin-S	Res-50	Res-152	Dense-161	Average
Original	95.5	97.5	97.9	98.3	98.5	99.1	97.2	97.1
ImageNet-E(λ=-20)	91.3	94.5	96.5	97.7	96.0	97.6	95.4	95.5
ImageNet-E(λ=20)	90.4	94.5	95.9	97.4	95.4	97.4	95.0	95.1
ImageNet-E(λ=20-adv)	82.8	88.8	90.7	92.8	91.6	94.2	90.4	90.21
LANCE	80.0	83.8	87.6	87.7	86.1	87.4	85.1	85.3
Class label(ours)	90.5	94.0	95.1	95.4	96.7	96.5	94.7	94.7
BLIP-2 Caption(ours)	85.5	89.1	91.9	92.1	93.9	94.5	90.6	91.0
Color(ours)	67.1	83.8	85.8	86.1	88.2	91.7	80.9	83.37
Texture(ours)	64.7	80.4	84.1	85.8	85.5	90.1	80.3	81.55
Adversarial(ours)	18.4	32.1	25.0	31.7	2.0	14.0	28.0	21.65

We evaluated the resilience of Zero-shot CLIP models against object-to-background compositional changes.

ObjectCompose fares well in comparison with state of the art methods when evaluated on multimodel models (CLIP).

Dataset	ViT-B/32	ViT-B/16	ViT-L/14	Res50	Res101	Res50x4	Res50x16	Average
Original	73.9	79.4	87.7	70.6	71.8	76.2	82.1	77.4
ImageNet-E(λ=-20)	69.7	76.7	82.8	67.8	69.9	72.7	77.0	73.8
ImageNet-E(λ=20)	67.9	76.1	82.1	67.3	69.8	72.6	77.0	73.3
ImageNet-E(λ=20-adv)	62.8	70.5	77.5	59.9	65.8	67.0	67.0	68.2
LANCE	54.9	54.1	57.4	58.0	60.0	60.3	73.3	59.7
Class label(ours)	78.4	83.6	81.5	76.6	77.0	82.0	84.5	80.5
BLIP-2 Caption(ours)	68.7	72.2	71.4	65.2	68.4	71.2	75.4	70.4
Color(ours)	48.3	61.0	69.5	50.5	54.8	60.3	69.2	59.1
Texture(ours)	49.6	62.4	58.8	51.6	53.2	60.7	67.4	57.7
Adversarial(ours)	25.5	34.8	48.1	18.2	24.4	30.2	48.4	32.8

Qualitative comparison

Qualitative comparison of our method (top row) with previous related work (bottom row). Our method enables diversity and controlled background edits.

Conclusion

In this study, we propose ObjectCompose, a method for generating objectto-background compositional changes. Our method addresses the limitations of current works, specifically distortion of object semantics and diversity in background changes. We accomplish this by utilizing the capabilities of image-to-text and image-to-segmentation foundational models to preserve the object semantics, while we optimize for diverse object-to-background compositional changes by modifying the textual prompts or optimizing the latents of the text-to-image model. ObjectCompose offers a complimentary evaluation protocol to the existing ones, for comprehensive evaluations across current vision-based models to reveal their vulnerability to background alterations. We anticipate that our insights will pave the way for a more thorough evaluation of vision models, consequently driving the development of more effective methods for improving their resilience.

For additional details about ObjectCompose framework, dataset, results, please refer to our main paper. Thank you!

BibTeX

@article{malik2024objectcompose, title={ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes}, author={Malik, Hashmat Shadab and Huzaifa, Muhammad and Muzzamal, Naseer and Khan, Salman and Khan, Fahad Shahbaz}, journal={arXiv:2403.04701}, year={2024} }