Unlock Next-Gen 3D Reconstruction
VGGT empowers developers and researchers with a single forward pass to predict camera poses, depth maps, point clouds, and more—no external bundle adjustment required.

VGGT is a Transformer-based model for end-to-end 3D reconstruction, consolidating multiple stages into a single forward pass to deliver camera poses, depth maps, and point clouds.
Single forward pass produces camera poses, depth maps, and point clouds without external bundle adjustment
Multi-head attention mechanism fuses geometric and appearance cues across multiple views
Generate dense depth predictions with sub-millimeter accuracy for each input view
Automatically predict camera extrinsics from multi-view images
Direct extraction of high-fidelity 3D point clouds from latent representations
Multiple model sizes (100M, 200M, 500M parameters) to balance performance and resources
Follow these simple steps to reconstruct 3D models from your multi-view images using VGGT
Upload 5-20 synchronized images of your scene or object from different viewpoints. Ensure good overlap between adjacent views.
Provide approximate camera intrinsic parameters. If unknown, you can use default values or let the system estimate them.
Choose between Base (faster, 8GB GPU), Large (higher quality, 16GB+ GPU), or XLarge (best quality, 32GB GPU) based on your needs.
Click 'Generate 3D Model' and wait for VGGT to process your images. Processing time varies from 30 seconds to 5 minutes depending on model size.
Download your reconstructed point cloud (PLY format), depth maps (PNG), camera poses (JSON), and preview the 3D model in the interactive viewer.
VGGT processes your images end-to-end without requiring manual camera calibration or bundle adjustment, making 3D reconstruction accessible to everyone.
Explore how VGGT can transform your 3D reconstruction workflows across various industries and applications
Real-time environment mapping and localization for robots and autonomous vehicles with rapid pose and depth estimation
Build immersive virtual environments by reconstructing real-world scenes in high fidelity for dynamic interaction
Digitally preserve historical architectures and archaeological sites with accurate 3D models from photo collections
Create detailed 3D terrain and building models from drone imagery for surveying and planning
Automate defect detection and quality control by reconstructing 3D surfaces for precise measurement
Generate 3D product models from multiple product photos for interactive online shopping experiences
Common questions about using VGGT for 3D reconstruction
VGGT accepts JPEG and PNG images. You need 5-20 multi-view images of the same scene captured from different angles. Video frames can also be extracted and used.
While camera intrinsics improve accuracy, VGGT can work with approximate or estimated values. For smartphone cameras, default values often work well.
Processing time depends on the model size and number of images. Base model typically takes 30-60 seconds, while larger models may take 2-5 minutes for optimal quality.
VGGT outputs point clouds in PLY format, depth maps as PNG images, and camera poses as JSON. You can also export to OBJ or other 3D formats using conversion tools.
Yes, VGGT works well with outdoor scenes including buildings, landscapes, and monuments. Drone imagery is also supported for aerial reconstruction.
VGGT may struggle with highly reflective surfaces, transparent objects, or scenes with very poor lighting. Textureless surfaces may also produce less accurate results.
The Base model can achieve near real-time performance on modern GPUs (1-2 FPS), making it suitable for applications like robotics and AR where speed is critical.
Transform your multi-view images into high-quality 3D reconstructions in minutes
No coding required. Simply upload your images and let VGGT handle the rest.