Continuous Object Representation Networks: Novel View Synthesis without Target View Supervision

Paper Video

We present Continuous Object Representation Networks (CORNs), a continuous, 3D geometry aware scene representation that can be learned from as little as two images per object. CORNs represent object geometry and appearance by conditionally extracting global and local features and using transformation chains and 3D feature consistency as self-supervision, requiring 50×fewer data during training than the current state-of-the-art models. By formulating the problem of novel view synthesis as a neural rendering algorithm, CORNs are end-to-end trainable from only two source views, without access to 3D data or 2D target views. This formulation naturally generalizes across scenes, and our auto-encoder-decoder framework avoids latent code optimization at test time, improving inference speed drastically.

Novel View Synthesis

CORNs can be trained with only two source images per object on a variety of categories. We show that CORN generated novel views of cars, chairs and human faces. CORNs generalize naturally across objects within the same category, requiring no additional training. To perform novel view synthesis, we use a randomly selected image of an object, generate the intermediate scene representation and project it to the target view.

Baseline comparisons

We compare CORN against multiple state of the art baseline methods. Namely, we compare against VIGAN, TBN and SRN. In contrast to the baselines, our model does not use target view supervision and only uses two of the 108 images per object. Still, our model shows comparable level of detail.

Out-of-domain Generalization

CORN generalizes to some degree beyond the training data domain. Here we use a model trained on the synthetic ShapeNet v2 dataset and test it on images of the Stanford CARS dataset without retraining. Our model generates novel views that preserve appearance and shape parameters.

3D Reconstruction

In addition to novel view synthesis, a possible application of our method is to perform single-image 3D reconstruction. We synthesize $N$ novel views on the viewing hemisphere from a single image. From these images, we sample $k$ 3D points uniformly at random from a cubic volume. Our goal is to predict the occupancy of each of these $k$ points.

Paper

Bibtex

@inproceedings{haeni2020corn, author = {H{\"a}ni, Nicolai and Engin, Selim and Chao, Jun-Jee and Isler, Volkan}, title = {Continuous Object Representation Networks: Novel View Synthesis without Target View Supervision}, booktitle = {Proc. NeurIPS}, year={2020} }