Abstract

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of words and the ``location'' of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn ``global'' audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters.

Audio-Video Contrastive Learning

DenseAV can learn the meaning of words and the location of sounds using only self-supervision from video. To learn these patterns, DenseAV uses audio-video contrastiv learning to associate sound with the visual world. Intuitively speaking, its much easier to predict what you are seeing from what you are hearing when you understand language and can recognize sounds. This is how DenseAV can learn without labels.

Most Contrastive Learners Cannot Localize Sound or Language

Interestingly, contrastive learning with CLS tokens or average pooled representations isnt enough to be able to localize objects from sound and language. DenseAV uses a contrastive similarity based on inner products between local audio and visual representation tokens. This dramatically improves its ability to localize information.
‍

Unsupervised Disentanglement of Sound and Language

Theres many ways that a sound can be related to an visual object. For instance, the word "dog" and the sound of a bark both conjure the image of a dog despite being very different types of sound. In an analogy with multi-head attention we provide DenseAV with multiple features to compute inner products with. Amazingly, DenseAV naturally organizes it's features into sound-features and language features without knowing a-priori what is sound and what is language.

Paper

Bibtex

@misc{hamilton2024separating,
      title={Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language},
      author={Mark Hamilton and Andrew Zisserman and John R. Hershey and William T. Freeman},
      year={2024},
      eprint={2406.05629},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Related Projects

FeatUp: A Model-Agnostic Frameworkfor Features at Any Resolution

Website Paper Github BibTex

We improve the spatial resolution of any vision backbone by 16-32x without changing their features' semantics.

Unsupervised Semantic Segmentation by Distilling Feature Correspondences

Website Paper Talk Github BibTex

We show that inner products between deep features hold a key to solving unsupervised semantic segmentation. In particular we distill these features into high quality unsupervised semantic segmentaions.

Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning

Website Paper Talk BibTex

We show that inner products between deep vision features can be interpreted as a generalization of Shapley Values for contrastive image similarity networks. We explore this theory and show that it provides a unique axiomatic characterization of contrastive model explanation methods.

Contact

For feedback, questions, or press inquiries please contact Mark Hamilton

Separating the "Chirp" from the "Chat":
Self-supervised Visual Grounding
of Sound and Language

CVPR 2024

Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman

Examples

DenseAV Discovers Language from Watching Videos:
(Unmute videos for Audio)

DenseAV Localizes Sound without Supervision:

Unsupervised Disentanglemnt of Sound and Language:

Abstract

Audio-Video Contrastive Learning

Most Contrastive Learners Cannot Localize Sound or Language

Unsupervised Disentanglement of Sound and Language

Paper

Bibtex

Contact

For feedback, questions, or press inquiries please contact Mark Hamilton

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

CVPR 2024

Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman

Examples

DenseAV Discovers Language from Watching Videos:(Unmute videos for Audio)

DenseAV Localizes Sound without Supervision:

Unsupervised Disentanglemnt of Sound and Language:

Abstract

Audio-Video Contrastive Learning

Most Contrastive Learners Cannot Localize Sound or Language

Unsupervised Disentanglement of Sound and Language

Paper

Bibtex

Related Projects

FeatUp: A Model-Agnostic Frameworkfor Features at Any Resolution

We improve the spatial resolution of any vision backbone by 16-32x without changing their features' semantics.

Unsupervised Semantic Segmentation by Distilling Feature Correspondences

We show that inner products between deep features hold a key to solving unsupervised semantic segmentation. In particular we distill these features into high quality unsupervised semantic segmentaions.

Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning

We show that inner products between deep vision features can be interpreted as a generalization of Shapley Values for contrastive image similarity networks. We explore this theory and show that it provides a unique axiomatic characterization of contrastive model explanation methods.

Contact

For feedback, questions, or press inquiries please contact Mark Hamilton

Separating the "Chirp" from the "Chat":
Self-supervised Visual Grounding
of Sound and Language

DenseAV Discovers Language from Watching Videos:
(Unmute videos for Audio)