AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation

ICLR 2024

1Photogrammetry and Remote Sensing, ETH Zurich 2ETH AI Center, ETH Zurich 3Computer Vision Group, RWTH Aachen University 4Google

AGILE3D is a novel method for interactive multi-object 3D segmentation that supports (1) Click sharing: Clicks on one object are naturally utilized to segment other objects. (2) Holistic reasoning: AGILE3D captures the contextual relationships among objects. E.g., clicks on the armrest of one chair will also help correct the armrests of other chairs. (3) Globally-consistent mask: AGILE3D encourages different regions to directly compete for space in the whole scene so that each point is assigned exactly one label. (4) Fast inference: AGILE3D can pre-compute backbone features once per scene and only run a light-weight decoder per iteration.


During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. Our core idea is to encode user clicks as spatial-temporal queries and enable explicit interactions between click queries as well as between them and the 3D scene through a click attention module. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different 3D point cloud datasets, AGILE3D sets a new state-of-the-art. Moreover, we also verify its practicality in real-world setups with real user studies.


Model of AGILE3D. Given a 3D scene and a user click sequence, (a) the feature backbone extracts per-point features and (b) the click-as-query module converts user clicks to high-dimensional query vectors. (c) The click attention module refines the click queries and point features through multiple attention mechanisms. (d) The query fusion module first fuses the per-click mask logits to region-specific mask logits and then produces a final mask through a softmax.

Interactive Segmentation Tool

We provide an interactive tool that allows users to segment/annotate multiple 3D objects together, in an open-world setting. Although the model was only trained on ScanNet training set, it can be used to segment unseen datasets like S3DIS, ARKitScenes, and even outdoor scans like KITTI-360. Feel free to try your own scans!






  title     = {{AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation}},
  author    = {Yue, Yuanwen and Mahadevan, Sabarinath and Schult, Jonas and Engelmann, Francis and Leibe, Bastian and Schindler, Konrad and Kontogianni, Theodora},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2024}


We sincerely thank all volunteers who participated in our user study. Francis Engelmann and Theodora Kontogianni are postdoctoral research fellows at the ETH AI Center. This project is partially funded by the ETH Career Seed Award - Towards Open-World 3D Scene Understanding, NeuroSys-D (03ZU1106DA) and BMBF projects 6GEM (16KISK036K).