Sidhu masterfully explains how NeRFs and Gaussian Splatting turn fragmented internet photos into a seamless digital twin of the planet. This marks a significant leap from mapping mere landmarks to reconstructing the entire physical world in high fidelity.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Building a 3D Model of the World from Internet PhotosAdded:
Today, someone can geollocate you from a single photo you posted online, just from what's reflected in your sunglasses, what's outside the window, the angle of the sunlight. But here's a question that nobody's really thinking about. What happens when you take every photo that's been uploaded in a given area and fuse it all together? I'm talking every Instagram post, every Snapchat story, every WhatsApp share, every random tourist shot uploaded on Flickr. stack them all together to build, dare I say, a 3D god's eye view of the world. Up until now, the answer was kind of. But last month, one paper has made this a real possibility.
All right, so how does this work? On the left, you see the input, just photos, random tourist shots of an iconic landmark off the internet. And on the right, you see a full 3D reconstruction of the place from those photos. That's it. And it's not just landmarks. You can do it for an entire city. Now, to get why this is such a big deal, you got to see what people have been trying to do for the last two decades. Let's go back.
The year was 2009 and a team at the University of Washington led by Samir Agarwal downloads thousands of tourist photos of Rome off of Flickr. They use a technique called structure for motion.
basically reverse engineering where every single camera was in 3D space and then they used this information to stitch the photos into a single cohesive 3D model. The eternal city reconstructed from random tourist snapshots. They called it building Rome in a day. And by the way, they say a day because that's how long it took to process this 3D reconstruction. This was a seminal paper, not just because you could do this with a coliseum, but because the techniques in this paper, the skeletal sets, bundle adjustment at scale, posing photos against the global 3D model, rippled outwards. This is the toolkit that street view style systems are still built on today. Samir ended up at Google where I had the chance to work on photoggramometry myself along with some of the folks that built these very foundations.
Now, for the next 9 years, the computer vision field scaled this approach hard.
By 2015, a UNC team reconstructed the entire planet from Flickr in 6 days.
Same technique now running at planetary scale. But of course, it always hits the same wall. Think about it. Photos online aren't evenly distributed. There's a small group of places everyone loves to take photos of. The Eiffel Tower, the Colosseum, Time Square. We got tons of these photos. That's the head. But when you talk about the torso and the tail, that's everywhere else. I'm talking about that local coastal fort, a random monument, your neighborhood. In those cases, you have a handful of photos, if any. That's the longtail problem. These methods that we've been discussing only work on the head. Famous landmarks get beautiful reconstructions. The long tail ends up being a hollow shell. The internet just doesn't have enough photos for this stuff to work. Which is why Google goes through the effort of flying airplanes and cars to image the world.
And that's the wall that the computer vision field is going to spend the next decade banging its head against. The longtail problem. Remember that term because we'll come back to it.
Now, what if you flip the question?
You've got all these reconstructions, famous landmarks captured from thousands of different angles. So a Cornell PhD student thinks instead of using internet photos to reconstruct 3D scenes, what if we use the reconstructions as training data? Essentially teaching a neural network to predict the depth from any single new photo. Show it that random tourist shot and then suddenly get back a per pixel depth map. They called it mega death. So that gives you the structure of the world. What about the people in it? The same year, the same lab pulls a stunt I still think about.
They go about scraping 2,000 mannequin challenge videos off of YouTube. This is the 2016 meme where everyone freezes in place and one person walks around with a phone recording a video. Now, because the people are static and the camera is moving, every video is a free structure for motion problem conveniently with humans embedded in it. So, these researchers use it to train the first depth network that actually works on moving people. And by the way, this makes you wonder. Every single Tik Tok trend, Instagram challenge, every viral pose and hold thing, how many of those things are quietly data collection plays for ML labs? You think you're participating in a meme, meanwhile, somebody's training a frontier model on it. Anyway, back to the topic, but that's something I think about a lot lately. Point is, by 2019, we could go both ways. Lots of photos in, full 3D scene out, or single photo in, and geometry out. Two sides of the same coin. Then along came neural radiance fields or nerfs.
Now, I've talked about these things a lot on this channel. You got a bunch of other videos, in fact, an entire playlist that you can go check out. This is the stuff I had a chance to work on at Google. Really cool tech. Instead of representing a scene as a point cloud or a mesh, you encode the entire complexity of a physical space, place, or person into the weights of a neural network.
the geometry, the colors, the way light bounces around, all of it baked into the parameters of a multi-layer perceptron.
So, in 2021, a Google research team takes that Nerf idea and adapts it to the messy reality of internet scale photos. They call it Nerf in the wild.
And the killer demo is this. You give it 800 photos of the Brandenberg Gate, for example. And these are images taken across years, different lighting conditions, different cameras with tourists in every single frame. It can not only pull out the building, but it can disentangle the lighting that took place. So you can literally change the time of day on the same building with the same model. Pretty freaking clever.
But of course, there's a catch with nerfs. They encode the whole scene into this massive neural network. So to draw a single frame, you have to query that network at millions of points. Yes, you get beautiful results, but it is extremely painful to render.
So, in 2023, this paper drops and it has the same goal as nerfs. You fit a model to your photos and you can render new views, but they swap out this implicit neural network for millions of these fuzzy ellipsoidal splats called Gaussians. The key point being instead of querying a network for every pixel, you rasterize these primitives directly.
And with modern GPUs, you can do this stuff at 100 FPS, even in a browser.
Now, the reason this matters for a story is that once your scene representation is explicit and editable, every downstream trick gets easier, which is exactly what happens next.
Along comes 2024, and researchers do exactly what they did with nerfs just a few years ago. Wild Gaussians was the name of the paper and it takes the in the wild trick separating the building from the lighting that it happened to be on and runs it on this new more efficient substrate. The killer demo is this interactive slider. You can now be in a browser and track from sunrise to sunset to night. Same building, same model running in real time on your freaking browser. Now at this point, the data set side also starts catching up. Noah Snavely's lab teams up with Stanford and Adobe to build this data set called Mega Scenes.
Basically 430,000 different scenes, 2 million images, 100,000 structure for motion reconstructions, all scraped from Wikipdia Commons. These are real internet photos, right? Organized by what there are photos of globally. And it becomes the infrastructure for everything that comes next. But think about it. Scraping more photos creates new problems. try to reconstruct the Belvadier Palace in Vienna from random tour shots and your software keeps folding the building in half. Why, you ask? Because the front and back of this building look identical. This is called bilateral symmetry. You get the same problem with cathedrals, capitals, anything with repeated structures.
Essentially, two surfaces that look the same in pixel space but are physically apart. The field actually has a term for this appropriately named doppelgangers.
So Noah's team trains a transformer to detect them. They call it Doppelgangers++ and their project page has this beautiful interactive viewer where you can actually fly around any monument and see exactly which photos contributed to the 3D reconstruction. Each camera frost them lit up around the building like a swarm. So many beautiful monuments for you to check out. But think about it.
These are hundreds of strangers vacation photos who don't know each other suddenly stitched together into a coherent 3D model. Heck, at a smaller scale, you can even do this with your Amazon deliveries with all the photos that your Amazon delivery driver takes when they drop off that package on your front door. Now, this is cool to see this multi-deade arc that started with building Rome in a day in 2009. And here we are in 2024, one paper away from the punchline.
So while all that's happening on the Gaussian side, a parallel thread is killing per scene optimization entirely.
This new class of feed forward models.
Basically, you take any pile of photos and it predicts cameras and geometry in a single pass in seconds. You don't need to worry about structure for motion anymore. The 2025 names are VGGGT, which won the best paper award at CVPR 2025 out of like 13,000 submissions, and PI Cube, which fixed VGGT's main weakness, which was that VGGT secretly picked one photo as the reference and predicts everything relative to it. So, if you have a bad anchor photo, you get a broken reconstruction. Pi Cube throws out the anchor entirely.
Now we're in April 2026. One month ago as of me recording this video, we're talking about the same Cornell lab. Noah Snavely, another PhD student, Juan Lee drops me depth X. This is the chicken and egg problem the entire field has been stuck on. You can't train a model to reconstruct sparse internet photos because nobody has ground truth for those scenes because nothing can reconstruct them. If you don't have answers or training data, that means you're stuck. So, the researchers came up with a clever trick. Why don't we take the well photographed famous landmarks where you do have the ground truth, meaning the final 3D reconstructions, but then you throw away most of the photos on purpose, essentially to simulate what that messy longtail looks like, framing it as this hard problem with a stolen answer key, and then train the model on that. Then they fine-tune VGGT and PI cube on this data. Both improved dramatically. The same architecture just trained on the longtail of the world. Now on the hardest spar scenes offtheshelf pi cubed gets 75% rotation accuracy. After fine-tuning on mega depth x, we're talking 86%. The longtail of the world, the parts that none of this used to work for now is unlocked. This is the wall.
The wall the field has been hitting since 2015 and we are punching right through it. which means we can basically take random photos of basically anywhere on the internet and turn them into coherent 3D models.
All right, so why does this matter?
Well, if you have this goal to create a 3D reconstruction of the entire planet, you could do it like Google does it. You fly around these planes, you buy satellite imagery, you've got street view cars, there's another technology that's been emerging that actually fills in the holes beautifully. all these generative image diffusion models and even Google needs this because they're not actually allowed to drive street view cars or airplanes everywhere for example in Dubai. In a previous video I covered Skyfall GS which uses 3D Gaussian splatting plus these creative diffusion models to fill in the gaps where the satellite data was sparse solving Google Earth's biggest problem from above. It turns out there's a ground level equivalent too. Last April, SRRI International dropped the paper called diffusionguided Gaussian splatting. Same idea, just the opposite direction of attack. Why don't we take ground level photos, some drone shots, some satellite data, and fuse it all into one 3D model and let these image diffusion models fill in the gaps wherever any single source runs out. So, you might be wondering, why the hell does SRRI care? Well, because they're contractors for the Intelligence Advanced Research Projects Activity, or IRPA. basically the intel community's DARPA. And they've got a project called walkthrough rendering from images of varying altitudes or W IVA. Let me tell you, the intelligence community loves their acronyms, but they've been running this 42-month effort since 2023 to do exactly this. Build photorealistic 3D walkthroughs of places agents can't physically go. That's their longtail problem. They've got a handful of photos on the ground, maybe some satellite imagery above, and no aerial imagery.
and they need to know exactly what it looks like before they send operators in. Places you need to know before you arrive. And by the way, if you start pulling the threads, it's not just the US. Megaepth X itself was funded by Korea's National AI Research Lab Project. So, we're talking about the same race, just a different country footing the bill. And by the way, this is the pattern that I see with situational awareness every single time.
The same tech shows up in a Cornell research paper, a Netflix VFX tool, and an intelligence program, usually within months of each other. Because these use cases for mapping and understanding the world have both commercial and military application. They always have and they always will. Don't forget that.
Now, there's still a missing piece here.
The world, of course, isn't static. It is four-dimensional. People move in it.
Time passes. And so far, everything we're talking about is the static structure of the real world. And researchers are now starting to do the same trick for casual handheld video.
You've got papers like MOSA and Shape Emotion that let you pull 4D out of a single phone video clip. Not just the geometry, but the motion of the entities within it. Now, we're nowhere near fusing every concert gore's iPhone into a free viewpoint playback of a Taylor Swift show, but you can see where this is going.
So, let's take a step back. We constantly capture photos and videos all the time. Every iPhone, every dash cam, every photo posted online on whichever platform you care about. And now we have the means to figure out not just where any of them were taken, but to extract the 3D structure of the world from it.
Put it all together and the sensorium has come to life. a real God's eye view built from everyone's vacation photos or status updates fused together. And conveniently, we've been uploading the photos for years, but until last month, it was really hard to pull it all together. Now, all those viewpoints can be molded into a 3D view of the world.
Now, if you're curious about the future of 4D gaussian splatting, not using user photos, but really awesome capture rigs, check out these two videos over here to show you the entertainment and the sports application of this technology.
Bolavo signing off, and I'll see y'all in the next one. Cheers.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











