In this study, we explore sketch-based object localization on natural images. Given a crude hand-drawn object sketch, the task is to locate all instances of that object in the target image. This problem proves difficult due to the abstract nature of hand-drawn sketches, variations in the style and quality of sketches, and the large domain gap between the sketches and the natural images. Existing solutions address this using attention-based frameworks to merge query information into image features. Yet, these methods often integrate query features after independently learning image features, causing inadequate alignment and as a result incorrect localization. In contrast, we propose a novel sketch-guided vision transformer encoder that uses cross attention after each block of the transformer-based image encoder to learn query-conditioned image features, lead ing to stronger alignment with the query sketch. Further, at the decoder’s output, object and sketch features are re fined better to align the representation of objects with the sketch query, thereby improving localization. The proposed model also generalizes to the object categories not seen during training, as the target image features learned by the proposed model are query-aware. Our framework can utilize multiple sketch queries via a trainable novel sketch fusion strategy. The model is evaluated on the images fromthe public benchmark, MS-COCO, using the sketch queriesfrom QuickDraw! and Sketchy datasets. Compared with existing localization methods, the proposed approach gives a 6.6% and 8.0% improvement in mAP for seen objects usingsketch queries from QuickDraw! and Sketchy datasets, re-spectively, and a 12.2% improvement in AP@50 for large objects that are ‘unseen’ during training.
|