Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation. To solve this task, algorithms must produce features for every pixel that are both semantically meaningful and compact enough to form distinct clusters. Unlike previous works which achieve this with a single end-to-end framework, we propose to separate feature learning from cluster compactification. Empirically, we show that current unsupervised feature learning frameworks already generate dense features whose correlations are semantically consistent. This observation motivates us to design STEGO (Self-supervised Transformer with Energy-based Graph Optimization), a novel framework that distills unsupervised features into high-quality discrete semantic labels. At the core of STEGO is a novel contrastive loss function that encourages features to form compact clusters while preserving their relationships across the corpora. STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff (+14 mIoU) and Cityscapes (+9 mIoU) semantic segmentation challenges.

Unsupervised semantic segmentation

Real-world images can be cluttered with multiple objects making classification feel arbitrary. Furthermore, objects in the real world don't always fit in bounding boxes. Semantic segmentation methods aim to avoid these challenges by assigning each pixel of an image its own class label. Conventional semantic segmentation methods are notoriously difficult to train due to their dependence on densely labeled images, which can take 100x longer to create than bounding boxes or class annotations. This makes it hard to gather sizable and diverse datasets in domains where humans don't know the structure a-priori. We sidestep these challenges by learning an ontology of objects with pixel-level semantic segmentation through only self-supervision.

Deep features connect objects across images

Self-supervised contrastive learning enables algorithms to learn intelligent representations for images without supervision. STEGO builds on this work by showing that representations from self-supervised visual transformers like Caron et. al.’s DINO are already aware of the relationships between objects. By computing the cosine similarity between image features, we can see that similar semantic regions such as grass, motorcycles, and sky are “linked” together by feature similarity. We also show these connections are a particular case of a broader theory connecting game theory, economics, and model-explainability.

STEGO

The STEGO unsupervised segmentation system learns by distilling correspondences between images into a set of class labels using a contrastive loss. In particular we aim to learn a segmentation that respects the induced correspondences between objects. To achieve this we train a shallow segmentation network on top of the DINO ViT backbone with three contrastive terms that distill connections between an image and itself, similar images, and random other images respectively. If two regions are strongly coupled by deep features we encourage them to share the same class.

Results

We evaluate the STEGO algorithm on the CocoStuff, Cityscapes, and Potsdam semantic segmentation datasets. Because these methods see no labels, we use a Hungarian matching algorithm to find the best mapping between clusters and dataset classes. We find that STEGO is capable of segmenting complex and cluttered scenes with much higher spatial resolution and sensitivity than the prior art, PiCIE. This not only yields a substantial qualitative improvement, but also more than doubles the mean intersection over union (mIoU). For results on Cityscapes, and Potsdam see our paper.

Paper

Bibtex

@inproceedings{hamilton2022unsupervised,
    title={Unsupervised Semantic Segmentation by Distilling Feature Correspondences},
    author={Mark Hamilton and Zhoutong Zhang and Bharath Hariharan and Noah Snavely and William T. Freeman},
    booktitle={International Conference on Learning Representations},
    year={2022},
    url={https://openreview.net/forum?id=SaKO6z6Hl0c}
}

Related Projects

Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning

Website Paper Talk BibTex

The feature correspondences of STEGO arise from a generalization of Shapley Values for contrastive image similarity networks. We explore this theory and show that it provides a unique axiomatic characterization of contrastive model explanation methods.

MosAIc: Finding Artistic Connections across Culture with Conditional Image Retrieval

Paper Website Webinar Talk Code BibTex

We introduce a new K-Nearest Neighbor data-structure to enable fast computation of an images conditional nearest neighbors in deep feature space. We show the approach can find "hidden connections" in the visual arts, as well as "blind-spots" in trained Generative Adversarial Networks.

Contact

For feedback, questions, or press inquiries please contact Mark Hamilton

Unsupervised Semantic Segmentation
by Distilling Feature Correspondences

ICLR 2022

Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, William T. Freeman

Unsupervised semantic segmentation

Deep features connect objects across images

STEGO

Results

Paper

Bibtex

Unsupervised Semantic Segmentationby Distilling Feature Correspondences

ICLR 2022

Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, William T. Freeman

Unsupervised semantic segmentation

Deep features connect objects across images

STEGO

Results

Paper

Bibtex

Related Projects

Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning

The feature correspondences of STEGO arise from a generalization of Shapley Values for contrastive image similarity networks. We explore this theory and show that it provides a unique axiomatic characterization of contrastive model explanation methods.

MosAIc: Finding Artistic Connections across Culture with Conditional Image Retrieval

We introduce a new K-Nearest Neighbor data-structure to enable fast computation of an images conditional nearest neighbors in deep feature space. We show the approach can find "hidden connections" in the visual arts, as well as "blind-spots" in trained Generative Adversarial Networks.

Contact

For feedback, questions, or press inquiries please contact Mark Hamilton

Unsupervised Semantic Segmentation
by Distilling Feature Correspondences