TriCoLo: Trimodal Contrastive Loss for fine-grained Text to Shape Retrieval

Yue Ruan*, Han-Hung Lee*, Ke Zhang, Angel X. Chang

Code Paper

Recent work on contrastive losses for learning joint embeddings over multi-modal data has been successful at downstream tasks such as retrieval and classification. On the other hand, work on joint representation learning for 3D shapes and text has thus far mostly focused on improving embeddings through modeling of complex attention between representations, or multi-task learning. We show that with large batch contrastive learning we achieve SoTA on text-shape retrieval without complex attention mechanisms or losses. Prior work in 3D and text representations has also focused on bimodal representation learning using either voxels or multi-view images with text. To this end, we propose a trimodal learning scheme to achieve even higher performance and better representations for all modalities.



For each modality, we define an encoder that takes the input and outputs an encoding. The text encoder is a Bi-directional Gate Recurrent Unit (Bi-GRU) which takes a text description and outputs the embedding. For voxels we use a 3D CNN model that takes a 3D input and outputs a voxel embedding. Finally, the image encoder takes M views of the object through an MVCNN architecture with pretrained ResNet18 backbone to obtain the image representation. Variants of our model include just two modalities(Bi) and all three modalities(Tri). For the bimodal models, we only consider text and image(I), or text and voxels(V). For the trimodal models, we consider text, image and voxels(I+V).



@misc{ruan2022tricolo, title={TriCoLo: Trimodal Contrastive Loss for Fine-grained Text to Shape Retrieval}, author={Yue Ruan and Han-Hung Lee and Ke Zhang and Angel X. Chang}, year={2022}, eprint={2201.07366}, archivePrefix={arXiv}, primaryClass={cs.CV} }


This work is funded by the Canada CIFAR AI Chair program and an NSERC Discovery Grant. This research was enabled in part by support provided by WestGrid and Compute Canada.