Start building in seconds without a credit card with Maps Demo Key. Learn more

Introducing new AI-powered products and tools from Google Maps Platform. Learn more

New! Subscribe to save with one monthly price and access more products. Learn more

From pixels to physical insights: Geospatial image understanding with Gemini 3.1

From satellite constellations to rich street-level imagery, organizations have unprecedented visual access to the physical world. Yet, for GIS professionals, infrastructure managers, and urban planners, a massive operational bottleneck remains: turning raw, unstructured pixels into structured spatial databases.

Historically, extracting automated insights from street-level imagery—such as auditing infrastructure integrity, reading localized parking signs, or detecting specific hardware attachments—required training highly specialized, bespoke computer vision models. This approach demands months of manual bounding-box labeling, continuous model retraining, and complex ML pipelines.

With the advanced visual reasoning capabilities of Gemini 3.1 on the Gemini Enterprise Agent Platform, the paradigm has fundamentally shifted. We are moving from an era of training custom models for individual spatial tasks to prompting multimodal foundation models to reason directly about the physical world.

While these capabilities are applicable to any geospatial dataset, they are powerful when applied to Street View Insights, enabling organizations to derive deeper value from large-scale street-level imagery.

1. Bypassing the cold start: Few-shot visual discovery

Traditional machine learning requires thousands of labeled examples to detect a new class of localized asset, creating a massive "cold start" problem for GIS teams expanding into new municipalities.

Gemini 3.1's expansive context window enables seamless in-context visual learning. When targeting highly specialized or localized assets (such as a unique regional road sign or custom municipal infrastructure), you can pass a few visual reference images directly into the prompt. The model adapts to the distinct visual pattern instantly without requiring a single underlying weight update.

import vertexai
from vertexai.generative_models import GenerativeModel
model = GenerativeModel("gemini-3.1-pro")
prompt = """
You are a municipal infrastructure inspector. 
Images 1 and 2 show examples of 'Type A' asset configurations.
Images 3 and 4 show examples of 'Type B' configurations.
Analyze the final target image and classify the asset configuration, providing a brief visual justification based on the mounting style.
"""
# Pass the reference images, the target image, and the prompt natively to the model
response = model.generate_content([
    reference_image_1, reference_image_2, 
    reference_image_3, reference_image_4, 
    target_image, prompt
])

Example few-shot implementation for asset discovery

Copied to clipboard!

2. Native spatial awareness: Bounding box extraction

Many developers view Large Language Models as text-in, text-out engines. However, Gemini 3.1 possesses deep native spatial awareness. When analyzing a street-level image, Gemini can locate physical infrastructure and return localized pixel coordinates using 2D bounding box detection scaled to a [0, 1000] coordinate space.

Instead of building a bespoke object detector, workflows can prompt the model to isolate specific sub-components or attachments on a primary infrastructure asset—such as locating equipment components or streetlights on a target asset.

prompt = """
Analyze this street-level view of a target asset. 
Find all attached equipment components and streetlights. 
Return the 2D bounding boxes in [ymin, xmin, ymax, xmax] format for each identified component so they can be mapped to our asset inventory.
"""
response = model.generate_content([image_part, prompt])
Copied to clipboard!

3. Physical & geometric reasoning via code execution

Perhaps the most interesting capability of Gemini 3.1 for geospatial applications is its ability to combine visual extraction with pixel-level mathematical reasoning.

While Gemini natively understands imagery, many civil engineering tasks require rigorous mathematical thresholds. By leveraging Gemini Enterprise Agentic Platform’s 's native Code Execution capabilities, Gemini can dynamically write and execute Python scripts using standard computer vision libraries like cv2 (OpenCV) and numpy.

This allows the model to go beyond "looking" at an image to actively understanding the pixel data. For example, Gemini can identify a primary asset, generate an OpenCV script to extract its coordinate geometry, and use trigonometry to estimate its lean angle to assess structural stability. It can also use cv2 to calculate the variance of the Laplacian to compute a mathematical blur metric for image triage.

from vertexai.generative_models import Tool
# Equip Gemini with Code Execution capabilities
code_execution_tool = Tool.from_code_execution()
model = GenerativeModel("gemini-3.1-pro", tools=[code_execution_tool])
prompt = """
1. Identify the primary infrastructure asset in the foreground.
2. Write and execute a Python script using the `cv2` (OpenCV) and `numpy` libraries to manipulate the pixel data.
3. Extract the pixel coordinates of the absolute top and bottom of the asset.
4. Calculate the asset's lean angle in degrees from the vertical axis.
5. Return the calculated angle and assess if it exceeds the standard 5-degree safety threshold.
"""
response = model.generate_content([image_part, prompt])

Example Implementation using Gemini Enterprise Agent Platform Code Execution

Copied to clipboard!

4. Automated image quality analysis & visual triage

Extracting deep visual insights is only effective if the underlying imagery is clear. In real-world street-level feeds, camera blur, severe sun glare, or dense foliage often obscure critical infrastructure. Running expensive geometry calculations or database ingestion on heavily obscured images leads to corrupted GIS data.

Gemini excels at visual understanding and triage. Before initiating complex geospatial pipelines, you can prompt Gemini to act as a visual gatekeeper, analyzing the physical constraints of the scene itself. It can detect if an asset is blocked by a tree, if the lighting creates unreadable silhouettes, or if the motion blur prevents hardware identification.

prompt = """
Perform a detailed visual quality analysis of this street-level image for civil inspection.
Assess the following conditions and return a structured analysis:
1. Occlusion: Is the primary target object fully visible, or is it obscured by foliage or structures?
2. Lighting: Are there severe backlighting or shadow issues preventing hardware identification?
3. Clarity: Is there significant motion blur preventing us from reading the attached signage?
Determine if this image should be routed to the GIS database or discarded.
"""
response = model.generate_content([image_part, prompt])

Example visual triage prompt

Copied to clipboard!

By programmatically filtering imagery based on Gemini’s visual understanding, organizations can save immense compute resources and ensure only high-fidelity data enters their spatial databases.

The road ahead

The distance between raw, unstructured street-level imagery and highly structured spatial intelligence has never been shorter. By integrating the vast coverage of geospatial imagery with the multimodal reasoning, bounding box extraction, and code execution capabilities of frontier models, spatial analysis is entering a new era grounded entirely in actual real-world data.

With recent updates and capabilities like Agentic Vision in Gemini 3 Flash—which uses a dynamic 'Think, Act, Observe' loop to actively zoom, crop, and analyze imagery via Python code execution—this type of active inspection is now reality.

There are also exciting developments from Google DeepMind research, proving that generative pre-training unlocks powerful generalist vision learners for advanced 2D and 3D understanding, Gemini is expanding past basic extraction to provide the concrete tools needed to transform raw geospatial imagery into precise, actionable intelligence.

Reference implementations

To help enterprise developers and data scientists operationalize these concepts, we have published a comprehensive suite of open-source notebooks. These resources showcase how to extract advanced geospatial insights using Gemini and Vertex AI.

Explore the complete reference implementations, test out the multimodal prompts, and run the end-to-end pipelines by visiting our Google Maps Platform Insights Samples repository on GitHub.

Specific Geospatial Use Case Notebooks:

Representative header image generated with Nano Banana Pro

Get going with Google Maps Platform

View of a globe from spaceView of a globe from space
  • *Free usage refers to monthly calls available at no cost. Essentials Map Tiles APIs provide up to 100,000 calls at no cost per SKU per month.
  • Product availability, functionality and terms may differ for customers with billing addresses in the European Economic Area (EEA). Learn more