multiple annotator clicks to multiple instances and multiple classes of objects (each object class is depicted
with a different color), in order to reduce overall annotation cost.
Annotating tens or hundreds of tiny objects in a given image is laborious yet crucial for a multitude of Computer Vision tasks. Such imagery typically contains objects from various categories, yet the multi-class interactive annotation setting for the detection task has thus far been unexplored. To address these needs, we propose a novel interactive annotation method for multiple instances of tiny objects from multiple classes, based on a few point-based user inputs. Our approach, C3Det, relates the full image context with annotator inputs in a local and global manner via late-fusion and feature-correlation, respectively. We perform experiments on the Tiny-DOTA and LCell datasets using both two-stage and one-stage object detection architectures to verify the efficacy of our approach. Our approach outperforms existing approaches in interactive annotation, achieving higher mAP with fewer clicks. Furthermore, we validate the annotation efficiency of our approach in a user study where it is shown to be 2.85x faster and yield only 0.36x task load (NASA-TLX, lower is better) compared to manual annotation. The code is available at this https URL.
Large-scale data and annotations are crucial for successful deep learning . However in many real-world problems, annotations are very labor-intensive and expensive to acquire . Annotation costs increase even higher when handling numerous tiny objects such as in remote sensing [7, 15, 33], extreme weather research , and microscope image analysis [12,16]. These settings often require highly-skilled annotators and accordingly high compensation. For instance, cell annotation in Computational Pathology requires expert physicians (pathologists), whose training involves several years of clinical residency [3,31]. Reducing cost and effort for these annotators would directly enable the collection of new large-scale tiny-object datasets, and contribute to higher model performances.
Several prior works have been proposed to reduce annotation cost in other tasks. Interactive segmentation methods [23,35] focus on reducing the number of interactions in the segmentation of a single foreground object, which can be classified as a “many interactions to one instance” approach. However, tiny-objects annotation can benefit from a “many interactions to many instances” approach as one image can contain many instances. Object counting methods [4,26] count multiple instances from a few user clicks and do follow a “many interactions to many instances” approach. However, these methods highlight only objects of the same class as the one being counted and thus can be classified as a “one class to one class” approach. However, images with tiny-objects are often composed of objects from multiple classes. Thus, tiny-object annotation should implement a “many classes to many classes” approach.
To address the above needs, we propose C3Det, an effective interactive annotation framework for tiny object detection. Fig. 1 shows how a user interacts with C3Det to create bounding-boxes of numerous tiny objects from multiple classes. Once a user clicks on a few objects and provides their class information, C3Det takes those as inputs and detects bounding boxes of many objects, even including object classes that the user did not specify. The user repeats this process until the annotation is complete. By utilizing user inputs in the “many interactions to many instances” and “many classes to many classes” way, C3Det can significantly speed-up annotation.
A key aspect of our approach is in making each user click influence objects that are nearby (local context) as well as far away (global context). To encourage the annotatorspecified class to be consistent with model predictions, we insert user inputs (in heatmap form) at an intermediate stage in the model (late-fusion) and apply a class-consistency loss between user input and model predictions. This alone can capture local context well, but may miss far away objects. We therefore introduce the C3 (Class-wise Collated Correlation) module, a novel feature-correlation scheme that communicates local information to far away objects (see Fig. 2), allowing us to learn many-to-many instance-wise relations while retaining class information. Through extensive experiments, we show that these components combined, result in significant performance improvements.
To validate whether our performance improvements translate to lower annotation cost in the real-world, we perform a user study with 10 human annotators. Our approach, C3Det, when combined with further manual bounding box corrections, is shown to be 2.85× faster and yield only 0.36× task load (NASA-TLX) compared to manual annotation, achieving the same or even better annotation quality as measured against the ground-truth. This verifies that C3Det not only shows improvements in simulated experiments, but also reduces annotation cost in the real-world.
In summary, we make the following contributions: (a) we address the problem of multi-class and multiinstance interactive annotation of tiny objects, (b) we introduce a training data synthesis and an evaluation procedure for this setting, (c) we propose a novel architecture for interactive tiny-object detection that considers both local and global implications of provided user inputs, and finally (d) our experimental results and user study verify that our method reduces annotation cost while achieving high annotation quality.
Seonwook Park, Heon Song, Jeongun Ryu, Sanghoon Kim, Haejoon Kim, Sérgio Pereira, Donggeun Yoo
Lunit Inc., Seoul, South Korea