CamP: Camera Preconditioning
for Neural Radiance Fields

Google Research
SIGGRAPH Asia 2023
(a) Ground truth with annotation
(b) Zip-NeRF
(c) BARF-NGP
(d) Ours
(e) Ground truth
Interactive visualization. Hover or tap to move the zoom cursor.

CamP preconditions camera optimization in camera-optimizing Neural Radiance Fields, significantly improving their ability to jointly recover the scene and camera parameters.

Here we show a NeRF reconstructed from a cellphone capture -- using camera poses estimated using ARKit. We apply our method to Zip-NeRF (d), a state-of-the-art NeRF approach that relies entirely oncamera poses obtained from COLMAP. Adding joint camera optimization to Zip-NeRF (b) using an improved version of the camera parameterization of BARF, adapted to the Instant NGP setting, improves image quality, but many artifacts still remain. (d) Our proposed preconditioned camera optimization technique improves camera estimates which results in a higher fidelity scene reconstruction.


Abstract

Neural Radiance Fields (NeRF) can be optimized to obtain high-fidelity 3D scene reconstructions of objects and large-scale scenes. However, NeRFs require accurate camera parameters as input --- inaccurate camera parameters result in blurry renderings. Extrinsic and intrinsic camera parameters are usually estimated using Structure-from-Motion (SfM) methods as a pre-processing step to NeRF, but these techniques rarely yield perfect estimates. Thus, prior works have proposed jointly optimizing camera parameters alongside a NeRF, but these methods are prone to local minima in challenging settings.

In this work, we analyze how different camera parameterizations affect this joint optimization problem, and observe that standard parameterizations exhibit large differences in magnitude with respect to small perturbations, which can lead to an ill-conditioned optimization problem. We propose using a proxy problem to compute a whitening transform that eliminates the correlation between camera parameters and normalizes their effects, and we propose to use this transform as a preconditioner for the camera parameters during joint optimization.

Our preconditioned camera optimization significantly improves reconstruction quality on scenes from the Mip-NeRF 360 dataset: we reduce error rates (RMSE) by 67% compared to state-of-the-art NeRF approaches that do not optimize for cameras like Zip-NeRF, and by 29% relative to state-of-the-art joint optimization approaches using the camera parameterization of SCNeRF. Our approach is easy to implement, does not significantly increase runtime, can be applied to a wide variety of camera parameterizations, and can straightforwardly be incorporated into other NeRF-like models.


Camera Preconditioning

We demonstrate the impact of our method on the SE3+Focal camera parameterization. We apply a small perturbation to each parameter and visualize how preconditioning unifies the scale of the parameters (right) whereas the original parameterization exhibits large differences (left).

Rotation
Translation
Focal length
(a) SE3+Focal
(b) SE3+Focal+CamP (Ours)

Mobile Phone Captures

Modern cellphones are able to estimate pose information using visual-inertial odometry, which is often used for Augmented Reality (AR) applications. While the poses are sufficiently good for AR effects, it can be challenging to recover high quality NeRFs from those sequences without resorting to running expensive, offline SfM pipelines.

Here we show reconstructions computed from casually captured scenes using the open source NeRF Capture app on an iPhone 13 Pro, which exports camera poses estimated by ARKit. We show qualitative comparisons that highlight the benefits of camera optimization.

(a) ARKit Poses (w/o COLMAP)
(b) ARKit Poses + CamP (Ours)
(a) ARKit Poses (w/o COLMAP)
(b) ARKit Poses + CamP (Ours)
(a) ARKit Poses (w/o COLMAP)
(b) ARKit Poses + CamP (Ours)
(a) ARKit Poses (w/o COLMAP)
(b) ARKit Poses + CamP (Ours)

Mip-NeRF 360 Dataset

Ground Truth
(a) Ground truth with annotation
(b) Zip-NeRF
(c) +SCNeRF
(d) +CamP (ours)
(e) Ground truth
Interactive visualization. Hover or tap to move the zoom cursor.

Static NeRF methods such as Zip-NeRF (b) rely on poses from a preprocessing step such as COLMAP to produce good results. Even these poses may be imperfect, leading to blurring and artifacts. The treehill scene in the Mip-NeRF 360 dataset is an example of a scene with noisy camera estimates. Adding a state-of-the-art camera optimization method such as SCNeRF (c) ameliorates this slightly but artifacts in distance parts of the scene still remain. Our method is able to enhance SCNeRF by preconditioning its parameterization thereby achieving significantly sharper results.

Side-by-Side Comparisons on Perturbed Mip-NeRF 360 Dataset

Next we show a side-by-side comparison between SCNeRF with and without preconditioning on the perturbed Mip-NeRF 360 dataset. Preconditioning allows the optimization converge to better estimates for the camera poses resulting in sharper details and less artifacts.

Interactive visualization. Hover or tap to move the split.

NeRF-Synthetic Dataset

CamP can converge quickly even when the intrinsics of a scene are unknown. We evaluate on a more challenging version of the perturbed NeRF-Synthetic benchmark proposed in BARF where we also perturb the focal length and perspective of the cameras.

We compare our method to an improved version of BARF that has been adapted to the Instant NGP setting (BARF-NGP) for faster convergence. BARF is unable to converge to the correct camera positions leading to visual artifacts and floaters in the reconstruction.

Convergence

(a) BARF-NGP
(b) Ours

When the focal lengths of the cameras are unknown, the BARF formulation fails to find the correct camera poses due to perspective ambiguities. Our preconditioned camera optimization is able to quickly converge to the correct solution.

Even in the presence of perspective ambiguities the reconstruction may look correct at first glance. However, compared to the reconstruction with the correct cameras there are significantly more artifacts around the boundaries. Floaters caused by inaccurate boundaries can be clearly seen in the depth.

(a) BARF-NGP
(b) Ours
(a) BARF-NGP
(b) Ours

Acknowledgements

Thanks to Rick Szeliski and Sameer Agarwal for their comments on the text; and Ben Poole, Aleksander Holynski, Pratul Srinivasan, Ben Attal, Peter Hedman, Matthew Burruss, Laurie Zhang, Matthew Levine, and Forrester Cole for their advice and help. Thanks to Dor Verbin for providing code for the video comparison tool and helping with the preparation of the NeRF-synthetic dataset.

BibTeX

@article{park2023camp,
  author    = {Park, Keunhong; Henzler, Philipp; Mildenhall, Ben; Barron, Jonathan T.; Martin-Brualla, Ricardo},
  title     = {CamP: Camera Preconditioning for Neural Radiance Fields},
  journal = {ACM Trans. Graph.},
  publisher = {ACM},
  year      = {2023},
  issue_date = {December 2023},
}