Text-to-3D Shape Generation Paper List

Eurographics STAR 2024

Han-Hung Lee¹, Manolis Savva¹ and Angel Xuan Chang^1,2

¹ Simon Fraser University ² Canada-CIFAR AI Chair, Amii

alt text

Abstract

Recent years have seen an explosion of work and interest in text-to-3D shape generation. Much of the progress is driven by advances in 3D representations, large-scale pretraining and representation learning for text and image data enabling generative AI models, and differentiable rendering. Computational systems that can perform text-to-3D shape generation have captivated the popular imagination as they enable non-expert users to easily create 3D content directly from text. However, there are still many limitations and challenges remaining in this problem space. In this state-of-the-art report, we provide a survey of the underlying technology and methods enabling text-to-3D shape generation to summarize the background literature. We then derive a systematic categorization of recent work on text-to-3D shape generation based on the type of supervision data required. Finally, we discuss limitations of the existing categories of methods, and delineate promising directions for future work.

We list the commonly used datasets used to train these methods here.

The methods are divided into four families as shown in the table below, namely: 1) Paired Text to 3D (3DPT); 2) Unpaired 3D Data (3DUT); 3) Text-to-3D without 3D data (NO3D); and 4) Hybrid3D.

alt text

Finally, we include works focused on generating multi-object 3D scenes, editing of 3D shapes and evaluation of text-to-3d methods.

Datasets

3D

ShapeNet: An Information-Rich 3D Model Repository,
Chang et al., Arxiv 2015
Website
ABO: Dataset and Benchmarks for Real-World 3D Object Understanding,
Collins et al., CVPR 2022
Website
Objaverse: A Universe of Annotated 3D Objects,
Deitke et al., CVPR 2023
Website
Objaverse-XL: A Universe of 10M+ 3D Objects,
Deitke et al., NeurIPS 2023
Website

Text-3D

Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings,
Chen et al., Arxiv 2018
ShapeGlot: Learning Language for Shape Differentiation,
Achlioptas et al., ICCV 2019
ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations,
Achlioptas et al., CVPR 2023
OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding,
Liu et al.,NeurIPS 2023
Scalable 3D Captioning with Pretrained Models,
Luo et al., NeurIPS 2023

alt text

Paired Text to 3D (3DPT)

alt text

Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings,
Chen et al., Arxiv 2018
Towards Implicit Text-Guided 3D Shape Generation,
Liu et al., CVPR 2022

alt text

Unpaired 3D Data (3DUT)

alt text

CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation,
Sanghi et al., CVPR 2022
CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language,
Sanghi et al., CVPR 2023
ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation,
Liu et al., ICLR 2023
TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision,
Wei et al., CVPR 2023

Text-to-3D without 3D data (NO3D)

Unsupervised CLIP Guidance

alt text

Zero-Shot Text-Guided Object Generation with Dream Fields,
Jain et al., CVPR 2022
CLIP-Mesh: Generating textured meshes from text using pretrained image-text models,
Khalid et al., SIGGRAPH Asia 2022
Understanding Pure CLIP Guidance for Voxel Grid NeRF Models,
Lee et al., Arxiv 2022
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models,
Xu et al., CVPR 2023

Unsupervised Diffusion Guidance

alt text

Loss Formulation

DreamFusion: Text-to-3D using 2D Diffusion,
Poole et al., ICLR 2023
Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation,
Wang et al., CVPR 2023
ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation,
Wang et al., NeurIPS 2023

3D Representation Improvements

Magic3D: High-Resolution Text-to-3D Content Creation,
Lin et al., CVPR 2023
TextMesh: Generation of Realistic 3D Meshes From Text Prompts,
Tsalicoglou et al., 3DV 2024
Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation,
Chen et al., ICCV 2023
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation,
Tang et al., ICLR 2024
Text-to-3D using Gaussian Splatting,
Chen et al., CVPR 2024
GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models,
Yi et al., CVPR 2024

Janus Problem Mitigation

Debiasing Scores and Prompts of 2D Diffusion for View-consistent Text-to-3D Generation,
Hong et al., NeurIPS 2023
Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation,
Seo et al., ICLR 2024
Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond,
Armandpour et al., Arxiv 2023

Generative Modeling

ATT3D: Amortized Text-to-3D Object Synthesis,
Lorraine et al., ICCV 2023
Instant3D: Instant Text-to-3D Generation,
Li et al., Arxiv 2023

Hybrid3D

Point-E: A System for Generating 3D Point Clouds from Complex Prompts,
Nichol et al., Arxiv 2022

3D-aware T2I

alt text

Text Conditioning

MVDream: Multi-view Diffusion for 3D Generation,
Shi et al., Arxiv 2023
SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D,
Li et al., Arxiv 2023
Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion,
Lu et al., Arxiv 2023
UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation,
Liu et al., Arxiv 2023
Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model,
Li et al., ICLR 2024

Image Conditioning

Zero-1-to-3: Zero-shot One Image to 3D Object,
Liu et al., ICCV 2023
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image,
Liu et al., Arxiv 2023
Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model,
Shi et al., Arxiv 2023
Wonder3D: Single Image to 3D using Cross-Domain Diffusion,
Long et al., Arxiv 2023
LRM: Large Reconstruction Model for Single Image to 3D,
Hong et al., ICLR 2024
One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization,
Liu et al., NeurIPS 2023
One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion,
Liu et al., Arxiv 2023

Multi Object Scene Generation

Compositional Generation

Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes,
Cohen-Bar et al., ICCVW 2023
CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout,
Bai et al., Arxiv 2023
Compositional 3D Scene Generation using Locally Conditioned Diffusion,
Po et al., Arxiv 2023
CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting,
Vilesov et al., Arxiv 2023
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs,
Gao et al., CVPR 2024
SceneWiz3D: Towards Text-guided 3D Scene Composition,
Zhang et al., Arxiv 2023

RGBD Fusion for Scenes

SceneScape: Text-Driven Consistent Scene Generation,
Fridman et al., NeurIPS 2023
Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models,
Höllein et al., ICCV 2023
Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields,
Zhang et al., TVCG 2024

Editing

Shape Editing with CLIP

CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields,
Wang et al., CVPR 2022
Text2Mesh: Text-Driven Neural Stylization for Meshes,
Michel et al., CVPR 2022

Scene Editing with Text-to-image Models

SKED: Sketch-guided Text-based 3D Editing,
Mikaeili et al., ICCV 2023
Vox-E: Text-guided Voxel Editing of 3D Objects,
Sella et al., ICCV 2023
Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions,
Haque et al., ICCV 2023
Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion,
Kamata et al., Arxiv 2023
RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture,
Song et al., Arxiv 2023

Texturing

TEXTure: Text-Guided Texturing of 3D Shapes,
Richardson et al., SIGGRAPH 2023
Text2Tex: Text-driven Texture Synthesis via Diffusion Models,
Chen et al., ICCV 2023
SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors,
Chen et al., Arxiv 2023

Evaluation

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation,
Yang et al., CVPR 2024

Citing

@misc{lee2024textto3d,
      title={Text-to-3D Shape Generation}, 
      author={Han-Hung Lee and Manolis Savva and Angel X. Chang},
      year={2024},
      eprint={2403.13289},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Text-to-3D Shape Generation Paper List

Abstract

Datasets

3D

Text-3D

Paired Text to 3D (3DPT)

Autoregressive Prior

Diffusion Prior

Structure Aware

Unpaired 3D Data (3DUT)

Text-to-3D without 3D data (NO3D)

Unsupervised CLIP Guidance

Unsupervised Diffusion Guidance

Loss Formulation

3D Representation Improvements

Janus Problem Mitigation

Generative Modeling

Further Reading

Hybrid3D

3D-aware T2I

Text Conditioning

Image Conditioning

Further Reading

Multi Object Scene Generation

Compositional Generation

RGBD Fusion for Scenes

Editing

Shape Editing with CLIP

Scene Editing with Text-to-image Models

Texturing

Evaluation

Citing