← Back to homepage
Limitations of NeRF with Pre-trained Vision Features for Few-Shot 3D Reconstruction
Ankit Sanjyal
arXiv preprint arXiv:2506.18208  ·  cs.CV  ·  June 2025
Neural Radiance Fields DINO Few-Shot 3D Reconstruction LoRA Negative Results
NeRF few-shot rendered comparison across model variants
Abstract. Neural Radiance Fields (NeRF) have revolutionized 3D scene reconstruction from sparse image collections. Recent work has explored integrating pre-trained vision features, particularly from DINO, to enhance few-shot reconstruction capabilities. However, the effectiveness of such approaches remains unclear, especially in extreme few-shot scenarios. We present a systematic evaluation of DINO-enhanced NeRF models, comparing baseline NeRF, frozen DINO features, LoRA fine-tuned features, and multi-scale feature fusion. Surprisingly, all DINO variants perform worse than the baseline NeRF, achieving PSNR values around 12.9–13.0 compared to the baseline's 14.71. This counterintuitive result suggests that pre-trained vision features may not be beneficial for few-shot 3D reconstruction and may even introduce harmful biases.
⚠️ Key Finding: This is a negative result paper: and that's the point. Pre-trained DINO features, widely believed to improve few-shot NeRF, consistently hurt performance. This challenges a common assumption in the field.

Background

NeRF learns a continuous volumetric representation of a scene from posed images, enabling novel view synthesis. In the few-shot regime (3–10 views), NeRF suffers from underdetermination: there are not enough observations to constrain the MLP. A natural idea is to condition NeRF on rich semantic features from foundation models like DINO (a self-supervised ViT), which can provide scene priors.

This paper asks: does DINO actually help when views are extremely scarce?

Models Evaluated

Results

Model PSNR ↑ SSIM ↑ LPIPS ↓
Baseline NeRF 14.71 0.46 0.53
DINO-NeRF (frozen) 12.99 (−1.72) 0.46 0.54
LoRA-NeRF (fine-tuned) 12.97 (−1.74) 0.45 0.54
Multi-Scale LoRA-NeRF 12.94 (−1.77) 0.44 0.54

Evaluated on the Blender Lego scene with extreme few-shot setup. Green = best. Red = worse than baseline.

Why Does DINO Hurt?

We hypothesize several causes for the degradation:

Implications

Simpler, geometry-focused architectures may be more effective than feature-rich models for extreme few-shot 3D reconstruction. Future work should focus on geometric consistency losses, depth priors, and view-consistent regularization rather than semantic feature injection.

Citation

@article{sanjyal2024nerflimitations,
  title     = {Limitations of {NeRF} with Pre-trained Vision Features for Few-Shot 3D Reconstruction},
  author    = {Sanjyal, Ankit},
  journal   = {arXiv preprint arXiv:2506.18208},
  year      = {2025}
}