CityDreamer: Compositional Generative Model of Unbounded 3D Cities

2309.00610

YC

92

Reddit

0

Published 6/7/2024 by Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu

📈

Abstract

3D city generation is a desirable yet challenging task, since humans are more sensitive to structural distortions in urban environments. Additionally, generating 3D cities is more complex than 3D natural scenes since buildings, as objects of the same class, exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges, we propose textbf{CityDreamer}, a compositional generative model designed specifically for unbounded 3D cities. Our key insight is that 3D city generation should be a composition of different types of neural fields: 1) various building instances, and 2) background stuff, such as roads and green lands. Specifically, we adopt the bird's eye view scene representation and employ a volumetric render for both instance-oriented and stuff-oriented neural fields. The generative hash grid and periodic positional embedding are tailored as scene parameterization to suit the distinct characteristics of building instances and background stuff. Furthermore, we contribute a suite of CityGen Datasets, including OSM and GoogleEarth, which comprises a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. CityDreamer achieves state-of-the-art performance not only in generating realistic 3D cities but also in localized editing within the generated cities.

Get summaries of the top AI research delivered straight to your inbox:

Overview

  • 3D city generation is a challenging task due to human sensitivity to structural distortions in urban environments and the wider range of building appearances compared to natural scenes.
  • To address these challenges, the researchers propose CityDreamer, a compositional generative model designed specifically for 3D city generation.
  • The key insight is that 3D city generation should be a composition of different types of neural fields: building instances and background stuff like roads and green spaces.

Plain English Explanation

The researchers developed a system called CityDreamer to generate realistic 3D cities. Generating 3D cities is more complex than generating natural 3D scenes because buildings can have a wide variety of appearances, while objects in nature tend to look more similar.

The researchers' approach involves breaking down the 3D city into two main components: the individual buildings and the background elements like roads and parks. They use specialized techniques to model each of these components, which allows the system to create more believable and diverse 3D cities.

The researchers also created a large dataset of real-world city imagery, called the CityGen Datasets, to help the system generate cities that look and feel more realistic.

Technical Explanation

The researchers propose CityDreamer, a compositional generative model for 3D city generation. The key insight is that 3D city generation should be a composition of different types of neural fields: 1) building instances and 2) background stuff, such as roads and green lands.

Specifically, the system uses a bird's eye view scene representation and employs a volumetric rendering approach for both the instance-oriented and stuff-oriented neural fields. The researchers tailor the generative hash grid and periodic positional embedding techniques to suit the distinct characteristics of building instances and background stuff.

Additionally, the researchers contribute the CityGen Datasets, which includes a vast amount of real-world city imagery from sources like OpenStreetMap and Google Earth. This dataset helps the system generate 3D cities that are more realistic in terms of both layout and appearance.

Critical Analysis

The researchers acknowledge that generating realistic 3D cities is a challenging task, as humans are highly sensitive to structural distortions in urban environments. They also note that 3D city generation is more complex than 3D natural scene generation due to the wider range of building appearances.

While the CityDreamer model and the CityGen Datasets represent significant advancements in the field, the researchers do not discuss potential limitations or areas for further research in detail. For example, it would be interesting to explore how the system might handle the generation of cities with unique architectural styles or cultural influences.

Additionally, the researchers could have compared their approach to other recent developments in 3D city generation, such as RealMDreamer, DreamScene, or StyleCity, to provide a more comprehensive understanding of the state of the art in this field.

Conclusion

The researchers have developed CityDreamer, a compositional generative model that addresses the challenges of 3D city generation. By breaking down the task into building instances and background stuff, the system is able to generate more realistic and diverse 3D cities.

The contribution of the CityGen Datasets, which includes a vast amount of real-world city imagery, is also a valuable addition that can help advance the field of 3D city generation. While the researchers have made significant progress, there are still opportunities for further exploration and improvement, such as addressing the generation of cities with unique architectural styles or cultural influences.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

Fan Lu, Kwan-Yee Lin, Yan Xu, Hongsheng Li, Guang Chen, Changjun Jiang

YC

0

Reddit

0

Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations by introducing a compositional 3D layout representation into text-to-3D paradigm, serving as an additional prior. It comprises a set of semantic primitives with simple geometric structures and explicit arrangement relationships, complementing textual descriptions and enabling steerable generation. Upon this, we propose two modifications -- (1) We introduce Layout-Guided Variational Score Distillation to address model optimization inadequacies. It conditions the score distillation sampling process with geometric and semantic constraints of 3D layouts. (2) To handle the unbounded nature of urban scenes, we represent 3D scene with a Scalable Hash Grid structure, incrementally adapting to the growing scale of urban scenes. Extensive experiments substantiate the capability of our framework to scale text-to-3D generation to large-scale urban scenes that cover over 1000m driving distance for the first time. We also present various scene editing demonstrations, showing the powers of steerable urban scene generation. Website: https://urbanarchitect.github.io.

Read more

4/11/2024

CityCraft: A Real Crafter for 3D City Generation

New!CityCraft: A Real Crafter for 3D City Generation

Jie Deng, Wenhao Chai, Junsheng Huang, Zhonghan Zhao, Qixuan Huang, Mingyan Gao, Jianshu Guo, Shengyu Hao, Wenhao Hu, Jenq-Neng Hwang, Xi Li, Gaoang Wang

YC

0

Reddit

0

City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neural rendering. These techniques often exhibit limited diversity and noticeable artifacts in the rendered city scenes. The rendered scenes lack variety, resembling the training images, resulting in monotonous styles. Additionally, these methods lack planning capabilities, leading to less realistic generated scenes. In this paper, we introduce CityCraft, an innovative framework designed to enhance both the diversity and quality of urban scene generation. Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts. Subsequently, a Large Language Model(LLM) is utilized to strategically make land-use plans within these layouts based on user prompts and language guidelines. Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction. Furthermore, we contribute two new datasets to the field: 1)CityCraft-OSM dataset including 2D semantic layouts of urban areas, corresponding satellite images, and detailed annotations. 2) CityCraft-Buildings dataset, featuring thousands of diverse, high-quality 3D building assets. CityCraft achieves state-of-the-art performance in generating realistic 3D cities.

Read more

6/10/2024

GaussianCity: Generative Gaussian Splatting for Unbounded 3D City Generation

New!GaussianCity: Generative Gaussian Splatting for Unbounded 3D City Generation

Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu

YC

0

Reddit

0

3D city generation with NeRF-based methods shows promising generation results but is computationally inefficient. Recently 3D Gaussian Splatting (3D-GS) has emerged as a highly efficient alternative for object-level 3D generation. However, adapting 3D-GS from finite-scale 3D objects and humans to infinite-scale 3D cities is non-trivial. Unbounded 3D city generation entails significant storage overhead (out-of-memory issues), arising from the need to expand points to billions, often demanding hundreds of Gigabytes of VRAM for a city scene spanning 10km^2. In this paper, we propose GaussianCity, a generative Gaussian Splatting framework dedicated to efficiently synthesizing unbounded 3D cities with a single feed-forward pass. Our key insights are two-fold: 1) Compact 3D Scene Representation: We introduce BEV-Point as a highly compact intermediate representation, ensuring that the growth in VRAM usage for unbounded scenes remains constant, thus enabling unbounded city generation. 2) Spatial-aware Gaussian Attribute Decoder: We present spatial-aware BEV-Point decoder to produce 3D Gaussian attributes, which leverages Point Serializer to integrate the structural and contextual characteristics of BEV points. Extensive experiments demonstrate that GaussianCity achieves state-of-the-art results in both drone-view and street-view 3D city generation. Notably, compared to CityDreamer, GaussianCity exhibits superior performance with a speedup of 60 times (10.72 FPS v.s. 0.18 FPS).

Read more

6/11/2024

📈

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto

YC

0

Reddit

0

In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.

Read more

4/30/2024