Abstract

Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.

FeatUp: Upsampling Model Representations with Self-Supervision

FeatUp upsamples any deep network's features to arbitrary resolution while retaining the original semantics. We learn a high-res feature map by enforcing consistency across many low-res "views", which are formed by perturbing and featurizing the input image.

Inspired by NeRF's implicit scene representation learned by enforcing image consistency across multiple views, we learn a view-consistent implicit network that outputs features at any queried resolution. This upsampler can also be parameterized as a feedforward module, usable in any existing pipeline and trainable end-to-end.

Results

Above: both variants of FeatUp (implicit and feedforward JBU module) resolve high-res details that other methods cannot. Additionally our features lie in the same space as the input features, making them usable in downstream architectures without re-training.
‍

Above: Upsampled features from a variety of vision backbones. FeatUp introduces spatial resolution while preserving semantics.

Downstream Evaluations

We evaluate FeatUp on a variety of downstream tasks from the broader literature, including linear probe transfer learning, where features are directly used for depth estimation and semantic segmentation. Additionally, we evaluate CAM quality and weakly-supervised object localization performance. Across the board, our methods qualitatively and quantitatively improve performance on these downstream tasks - see our supplementary material for more examples.

Paper

Bibtex

@inproceedings{
    fu2024featup,
    title={FeatUp: A Model-Agnostic Framework for Features at Any Resolution},
    author={Stephanie Fu and Mark Hamilton and Laura E. Brandt and Axel Feldmann and Zhoutong Zhang and William T. Freeman},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=GkJiNn2QDF}
}

Related Projects

Unsupervised Semantic Segmentation by Distilling Feature Correspondences

Website Paper Talk Github BibTex

We show that inner products between deep features hold a key to solving unsupervised semantic segmentation. In particular we distill these features into high quality unsupervised semantic segmentaions.

Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning

Website Paper Talk BibTex

We show that inner products between deep vision features can be interpreted as a generalization of Shapley Values for contrastive image similarity networks. We explore this theory and show that it provides a unique axiomatic characterization of contrastive model explanation methods.

Contact

For feedback, questions, or press inquiries please contact Mark Hamilton and Stephanie Fu

FeatUp: A Model-Agnostic Framework
for Features at Any Resolution

ICLR 2024

Stephanie Fu, Mark Hamilton, Laura Brandt, Axel Feldman, Zhoutong Zhang, William T. Freeman

*Equal Contribution

Examples

Any Backbone:

Improve Downstream Tasks without Retraining :

Upsamples Every Feature Dimension:

Abstract

FeatUp: Upsampling Model Representations with Self-Supervision

Results

Above: both variants of FeatUp (implicit and feedforward JBU module) resolve high-res details that other methods cannot. Additionally our features lie in the same space as the input features, making them usable in downstream architectures without re-training.
‍

Above: Upsampled features from a variety of vision backbones. FeatUp introduces spatial resolution while preserving semantics.

Downstream Evaluations

Paper

Bibtex

Contact

For feedback, questions, or press inquiries please contact Mark Hamilton and Stephanie Fu

FeatUp: A Model-Agnostic Frameworkfor Features at Any Resolution

ICLR 2024

Stephanie Fu*, Mark Hamilton*, Laura Brandt, Axel Feldman, Zhoutong Zhang, William T. Freeman

*Equal Contribution

Examples

Any Backbone:

Improve Downstream Tasks without Retraining :

Upsamples Every Feature Dimension:

Abstract

FeatUp: Upsampling Model Representations with Self-Supervision

Results

Above: both variants of FeatUp (implicit and feedforward JBU module) resolve high-res details that other methods cannot. Additionally our features lie in the same space as the input features, making them usable in downstream architectures without re-training.‍

Above: Upsampled features from a variety of vision backbones. FeatUp introduces spatial resolution while preserving semantics.

Downstream Evaluations

Paper

Bibtex

Related Projects

Unsupervised Semantic Segmentation by Distilling Feature Correspondences

We show that inner products between deep features hold a key to solving unsupervised semantic segmentation. In particular we distill these features into high quality unsupervised semantic segmentaions.

Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning

We show that inner products between deep vision features can be interpreted as a generalization of Shapley Values for contrastive image similarity networks. We explore this theory and show that it provides a unique axiomatic characterization of contrastive model explanation methods.

Contact

For feedback, questions, or press inquiries please contact Mark Hamilton and Stephanie Fu

FeatUp: A Model-Agnostic Framework
for Features at Any Resolution

Stephanie Fu, Mark Hamilton, Laura Brandt, Axel Feldman, Zhoutong Zhang, William T. Freeman

Above: both variants of FeatUp (implicit and feedforward JBU module) resolve high-res details that other methods cannot. Additionally our features lie in the same space as the input features, making them usable in downstream architectures without re-training.
‍