Instance Segmentation of Scene Sketches
Using Natural Image Priors

Anonymous Authors

Paper Code (Coming Soon) Dataset (Coming Soon)

INKi performs instance segmentation of raster sketches.
It effectively handles diverse types of sketches, accommodating variations in stroke style and complexity.

Here we present high-level descriptions and examples of our core contributions. Please refer to the paper for more in-depth details and results.📝

But first, why instance segmentation of sketches?

🧑‍🎨 Sketching complex scenes is an expressive and iterative process, but editing and manipulating individual elements can be tedious without proper segmentation. This work is motivated by the need for a robust and precise instance segmentation method for sketches—one that empowers artists to refine their work with ease and flexibility.

What does INKi enable?

Interactive interface for sketch editing. INKi's segmentation and layering technique makes sketch editing effortless. With our interactive interface, users can upload a sketch, which is automatically segmented into ordered layers. This allows artists to easily move, copy, or delete objects without needing to manually redraw affected areas—streamlining the editing process and giving more control over their compositions..

What can INKi do?

We present instance segmentation results achieved by INKi across various stroke styles, demonstrating its robustness in handing varied styles and complexity.

INKi on Real-World Sketches

We demonstrate results on real-world sketches done by artists. These scenes span cozy indoor spaces, lively outdoor environments, expressive characters, and intricate object arrangements. Some sketches are minimal with clean strokes, while others are richly detailed with overlapping elements and fine textures, reflecting a range of artistic styles and complexity.

INKi on Dataset Sketches

We demonstrate results on sketches sourced from semi-synthetic and fully synthetic datasets, covering a diverse range of structured and organic scenes. These include charming suburban houses, bustling townscapes, and serene parks filled with people and animals. Some sketches depict everyday moments, such as dining scenes and workspace setups, while others capture dynamic action, including sports and outdoor activities. The dataset also features a variety of subjects like farm animals, intricate floral arrangements, and highly detailed objects. These synthetic sketches balance realism with stylization, making them a valuable benchmark for testing segmentation across different levels of abstraction and complexity.

How does INKi work?

✴️ Given a raster sketch, our goal is to produce a segmentation map such that pixels belonging to the same object instance are grouped together. Based on the segmentation map, we also divide the sketch into layers, sorted by depth.

Overall pipeline

Given an input sketch image, our framework first detects bounding boxes using a customized Grounding DINO to obtain region proposals, and then perform segmentation with SAM models. The localization and segmentation are refined by incorporating the depth features. The result segmentation can be viewed as a layered decomposition of object components in the original sketch.

Sketch-aware object detection

🟥 Challenge: Grounding DINO is an object detection model which outputs bounding boxes for recognized object instances, based on a given text prompt describing the scene. While effective for natural images, the model in its original configuration demonstrates limited generalization to sketches.

✅ Our Solution: We fine-tune a Grounding DINO model specifically for sketches. We train the model to distinguish between instances based on their visual characteristics, aiming to push the model to rely on Gestalt properties such as closure, continuity, and emergence, to group together strokes forming a single object. ⭐ Our fine-tuning technique proves highly effective, achieving a substantial improvement in GroundingDINO's detection performance on sketches, with Average Precision increasing from 24% to 75%. Please refer to the paper for technical details and numerical evaluations.

Resolving ambiguities in overlapping regions

🟥 Challenge: We extract segmentation masks by prompting Segment Anything (SAM) with the bounding boxes. However, overlapping bounding boxes often lead to problematic mask overlaps. For example, a couch's mask might extend over a cat's ear. To ensure accurate segmentation, pixels in these regions must be assigned to a single object.

✅ Our Solution: We use an off-the-shelf depth estimation model to extract the sketch's depth map $D$. The depth map $D$ is sampled along sketch pixels at evenly spaced points, and the sampled points are grouped by their corresponding object (e.g., $P_i$ corresponds to the $i$'th object). Each object is assigned a depth score based on the majority of depth values from the sampled points. Ambiguous pixels are then assigned to the mask with the highest depth score, prioritizing foreground objects.

Object layer completion

🟥 Challenge: In complex scenes with multiple objects, occlusion often leaves some objects incomplete, making it tricky to move them elsewhere.

✅ Our Solution: As a final step we extract complete layers for each object in the sketch, inpainting any occluded regions using a pretrained SDXL inpainting model. The goal of this stage is to support basic sketch editing operations, such as translation and scaling. The inpainting mask for each object is defined by intersecting overlapping masks with the object's bounding box.

Synthetic Scene Sketch Dataset

✴️ To evaluate our performance across a diverse set of sketches, we construct a synthetic annotated scene-sketch dataset. This dataset focuses on three key axes of variation, designed to extend existing datasets: (1) drawing style, (2) stroke style, and (3) object categories.