PAPER_TITLE

Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty; Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty

\(\textbf{\(\texttt{O3SLM}\)}\): Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

Rishi Gupta, Mukilan Karuppasamy^*, Shyam Marjit^*, Aditay Tripathi, Anirban Chakraborty

Visual Computing Lab, IISc Bangalore
^*Indicates Equal Contribution
AAAI 2026

YouTube Paper

Dataset Code Pre-print including Suppl. Materials

Capabilities of our model - \(\texttt{O3SLM}\). Our model is the first Large Vision-Language Model (LVLM) to demonstrate advanced alignment between sketches, images, and text—where existing LVLMs consistently fail. Through extensive pretraining on our proposed \(\texttt{SketchVCL}\) dataset, the model develops a robust understanding of crude hand-drawn sketches and how they relate to the visual and textual modalities in which current LVLMs already excel. This training enables cross-modal transfer, allowing the model to handle fine-grained queries using sketch-text pairs, even though it was originally trained with sketches alone. \(\texttt{O3SLM}\) is trained across multiple tasks, including Visual Question Answering (VQA), Sketch-based Image Retrieval (SBIR), sketch-based counting, and sketch-based object detection.

Abstract

While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually.

We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions:
(1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning
(2) \({\texttt{O3SLM}}\), an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval \(\textit{i.e.}\) (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated \(\texttt{SketchVCL}\) dataset, show that \(\texttt{O3SLM}\) achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.

Method

Summary of \(\texttt{O3SLM}\). We use CLIP-L-336 as the visual backbone. The hand-drawn sketch and natural image are encoded using this backbone, then the multimodal connector projects the sketch and image features to the input space of LLM. Finally, the sketch, image, and text tokens are concatenated and passed through the LLM, which generates the output.

\(\texttt{SketchVCL}\) Dataset

▷ Automated Large-Scale Sketch Generation Pipeline. For each object instance, we use theSAM2-generated segmentation maps to mask the background and pass the foreground throughPix2Pixfor sketch generation. These sketches are enhanced using edge detection using morphological gradients. The final sketch is an aggregation of the edges and the Pix2Pix sketch.

▷ Training Data Composition. The distribution of data for each task and corresponding datasets is shown. The total pretraining size is 600k, while the total finetuning size is 215k. Instruction tuning data is curated based on the downstream tasks. Detailed instruction formatting prompts for each task are provided in the supplementary.


          
        ▷ See sketch samples from SketchVCL dataset, generated using our proposed sketch-generation pipeline.

Results

▷ Evaluation on Sketch-Based Counting. We evaluate performance on images fromCOCOandPixMo-Countdatasets. COCO presents a more challenging setting, with a greater number of object categories per image, forcing the model to rely more on the sketches as a query. We sample sketches from four datasets representing various levels of abstraction and difficulty of hand-drawn sketches, for example QuickDraw! is known to have highly abstract and often incomplete sketches. † indicates sketch datasets which are unseen by our model during training, they assess our model’s ability to generalize to sketch styles.

▷ Sketch-Based Object Detection. To evaluate the sketch-based object detection on imagesCOCO val2017and sketches from four different datasets, specifically:Sketchy,QuickDraw!,TU-Berlin, andSketchVCL-C. We evaluate on TUBerlin and Sketchy datasets.†indicates sketch datasets which are unseen by our model during training, they assess our model’s ability to generalize to sketch styles.

▷ Skecth-based Image Retrieval (SBIR) performance on Sketchy. The substantial gains indicate that although original LLaVA has very limited sketch understanding, our training data and methodology align sketches and text in O3SLM.

▷ Qualitative Results for SBIR.

▷ More qualitative results for Detection and VQA task.

BibTeX

@InProceedings{gupta_2026_AAAI,
        author    = {Gupta, Rishi and Karuppasamy, Mukilan and Marjit, Shyam and Tripathi, Aditay and Chakraborty, Anirban},
        title     = {O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model},
        booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
        year      = {2026},
    }