Query-guided Attention in Vision Transformers for Localizing Objects Using a Single Sketch
Aditay Tripathi
Anand Mishra
Anirban Chakraborty
[Paper]
[GitHub]
Goal:Consider a scenario where users wish to localize all the instances of the object broccoli on a set of natural images, and (i) images of broccoli are never seen during training, (ii) even at the inference time users do not have natural image of broccoli that can be used as a query, and (iii) the category name (``broccoli") is also unknown to the user. In such a situation, the user chooses to draw a sketch of \emph{broccoli} by hand to localize all instances of it on the natural images. This is the sketch-guided object localization problem. This work significantly improves the performance on this challenging task.

Abstract

In this study, we explore sketch-based object localization on natural images. Given a crude hand-drawn object sketch, the task is to locate all instances of that object in the target image. This problem proves difficult due to the abstract nature of hand-drawn sketches, variations in the style and quality of sketches, and the large domain gap between the sketches and the natural images. Existing solutions address this using attention-based frameworks to merge query information into image features. Yet, these methods often integrate query features after independently learning image features, causing inadequate alignment and as a result incorrect localization. In contrast, we propose a novel sketch-guided vision transformer encoder that uses cross attention after each block of the transformer-based image encoder to learn query-conditioned image features, lead ing to stronger alignment with the query sketch. Further, at the decoder’s output, object and sketch features are re fined better to align the representation of objects with the sketch query, thereby improving localization. The proposed model also generalizes to the object categories not seen during training, as the target image features learned by the proposed model are query-aware. Our framework can utilize multiple sketch queries via a trainable novel sketch fusion strategy. The model is evaluated on the images fromthe public benchmark, MS-COCO, using the sketch queriesfrom QuickDraw! and Sketchy datasets. Compared with existing localization methods, the proposed approach gives a 6.6% and 8.0% improvement in mAP for seen objects usingsketch queries from QuickDraw! and Sketchy datasets, re-spectively, and a 12.2% improvement in AP@50 for large objects that are ‘unseen’ during training.

Model


The proposed sketch-guided object localization model contains two primary components: (a) sketch-guided vision transformer encoder (Sec 3.1.1) and (b) object and sketch refinement (Sec. 3.1.2). The sketch-guided transformer encoder takes the image at the input and generates sketch-conditioned features by fusing the sketch features into the target image after each block of the image encoder using \textbf{cross-attention}. After getting object-level features at the output of the transformer decoder , the object features and the query sketch features are further refined to bring the features of the relevant object closer to the query sketch leading to better localization score.


Code


 [GitHub]


Paper and Supplementary Material

A. Tripathi, A. Mishra, A. Chakraborty.
Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing
In WACV, 2023.
(hosted on ArXiv)


[Bibtex]


This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.