GNSS-Denied Visual Localization for UAVs using Satellite Imagery

Overview

Visual Localization project is a real-time visual localization system designed for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments. The system leverages deep learning-based feature matching between onboard camera images and geo-referenced satellite imagery to provide accurate position estimates without relying on GPS signals.

The core innovation lies in dynamically generating perspective-transformed map regions from satellite tiles based on the UAV’s estimated attitude and position, enabling robust visual feature matching even under challenging conditions such as illumination changes, viewpoint variations, and texture-poor environments.

Problem Statement

Traditional GNSS-based navigation systems face critical limitations in:

Urban canyons: High-rise buildings block satellite signals
Indoor environments: No GPS coverage
Jamming scenarios: Intentional interference disrupts navigation
Accuracy limitations: Consumer-grade GPS provides 5-10 meter accuracy

Visual odometry and SLAM approaches offer alternatives but suffer from drift accumulation over time. Our system addresses these challenges by providing absolute position estimates through visual matching with geo-referenced satellite imagery.

Technical Approach

System Architecture

The system operates through five integrated modules:

Dynamic Map Generation: Satellite tiles are dynamically retrieved and assembled based on the UAV’s estimated position and attitude
Perspective Transformation: Camera images are transformed to match the satellite map’s coordinate system using homography
Deep Feature Extraction: State-of-the-art deep learning models extract robust visual features
Feature Matching: Learned matchers establish correspondences between camera and map features
State Estimation: Multiple position estimation methods with Kalman filtering provide robust pose estimates

Deep Learning Pipeline

Feature Extraction

The system supports multiple deep learning-based feature extractors:

SuperPoint: Fast, general-purpose feature detector with 256-dimensional descriptors
ALIKED: Adaptive Local Invariant Keypoint Detection, optimized for low-contrast images
DISK: Disk descriptor-based features
SIFT: Classical SIFT algorithm (baseline comparison)
ORB: Oriented FAST and Rotated BRIEF

Each extractor is optimized for real-time inference on embedded platforms, with TensorRT support for NVIDIA Jetson devices.

Feature Matching

LightGlue serves as the core matching engine, utilizing graph neural networks for robust feature correspondence:

Bidirectional matching with outlier rejection
Confidence-aware matching scores
Real-time performance on embedded hardware
Support for up to 1024 (default) and 2048 keypoints per image

The matching pipeline includes:

Image pre-processing and contrast enhancement
Homography-based geometric verification
RANSAC-based outlier rejection

Position Estimation Methods

The system implements three complementary position estimation approaches:

1. ….

2. ….

3. Extended Kalman Filter (EKF)

Sophisticated filtering approach with camera parameters in state:

State vector: [x, vx, y, vy, z, vz, qw, qx, qy, qz, fx, fy] (12-dimensional)
- Position (x, y, z) and velocity (vx, vy, vz) in NED coordinates
- Quaternion orientation (qw, qx, qy, qz)
- Camera focal length parameters (fx, fy)
Minimum requirement: …
Output: Filtered position estimate with uncertainty quantification

Post-Processing Pipeline

….

Coordinate Transformations

The system handles multiple coordinate systems:

Camera Frame: Image pixel coordinates
NED Frame: North-East-Down local navigation frame
WGS84: Geodetic coordinates (latitude, longitude, altitude)
UTM: Universal Transverse Mercator projection coordinates

Transformations include:

Homography-based perspective correction
Camera-to-IMU frame alignment
Geodetic coordinate conversions
Elevation integration from DEM (Digital Elevation Model) data

Implementation Details

Real-Time Processing

The system is optimized for real-time operation on embedded platforms:

Frame rate: 10-30 Hz depending on hardware
Latency: <100ms end-to-end processing time
Memory: Efficient caching of satellite tiles
GPU acceleration: TensorRT support for Jetson platforms

Experimental Results

Real-World Flight Tests

The system has been extensively tested in various conditions:

Daytime flights: Clear visibility, good contrast
Night flights: Low-light conditions with artificial illumination
Different altitudes: 50m to 500m above ground level
Various terrains: Urban, rural, and mixed environments

Performance Metrics

Position Accuracy:

Mean position error: 4.8 meters (comparable to consumer GPS)
RMS error: 6.2 meters
95th percentile error: 12.5 meters

Robustness:

Matching success rate: >85% in favorable conditions
Optic flow fallback activation: <5% of frames
System uptime: >95% during test flights

Computational Performance:

Processing time: 30-80ms per frame (NVIDIA Jetson AGX)
Memory usage: <2GB RAM
Power consumption: <15W on Jetson platforms

Comparison with Baselines

Method	Mean Error (m)	RMS Error (m)	Drift
NAVWOGPS	4.8	6.2	None
Visual Odometry	12.3	18.5	High
GPS (Consumer)	5.2	8.1	Low
GPS (RTK)	0.1	0.2	None

Key Features

Advantages

Absolute Positioning: No drift accumulation unlike visual odometry
GNSS Independence: Operates without GPS signals
Real-Time Performance: Optimized for embedded platforms
Robust Matching: Deep learning features handle challenging conditions
Modular Design: Easy to integrate with existing navigation stacks

Limitations

Satellite Imagery Dependency: Requires internet connectivity for tile retrieval
Altitude Constraints: Optimal performance between 50-1000m AGL (The zoom level was tested up to 11.~1000 meter)
Texture Requirements: Performance degrades in texture-poor environments
Computational Resources: Requires GPU for real-time operation

Applications

The system is suitable for:

Search and Rescue: Navigation in GPS-denied areas
Infrastructure Inspection: Precise positioning for autonomous inspection
Military Operations: Operations in GPS-jammed environments
Indoor-Outdoor Transitions: Seamless navigation across environments
Urban Air Mobility: Navigation in dense urban environments

Future Work

Potential improvements and extensions:

Offline Map Storage: Pre-downloaded satellite tiles for offline operation
Multi-Modal Fusion: Integration with IMU, barometer, and other sensors
Seasonal Adaptation: Handling seasonal changes in satellite imagery
Semantic Understanding: Integration with semantic segmentation for improved matching
Distributed Processing: Multi-UAV collaborative localization

Technologies Used

Deep Learning: PyTorch, SuperPoint, ALIKED, LightGlue
Computer Vision: OpenCV, RANSAC
State Estimation: Extended Kalman Filter, Kalman Filter variants
Geospatial: PyMap3D, Bing Maps API, SRTM elevation data
Embedded Computing: TensorRT, NVIDIA Jetson platforms
Robotics: ROS/ROS2 integration, MAVLink protocol

Visualizations

Feature Matching Results

Deep learning-based feature matching between camera image and satellite map. Green lines indicate matched features with high confidence scores.

Position Estimation Results

Comparison of estimated trajectory (blue) with ground truth GPS trajectory (red) during a test flight. Mean position error: 4.8 meters.

Estimation of UAV Position in NED frame

Real-time visualization showing UAV position on satellite map with matched features and confidence indicators.

Conclusion

Visual Localization demonstrates that deep learning-based visual localization can provide reliable, absolute position estimates for UAVs in GNSS-denied environments. By combining state-of-the-art feature matching with robust state estimation, the system achieves GPS-comparable accuracy while operating independently of satellite navigation systems.

The modular architecture and extensive configuration options make the system adaptable to various mission requirements and hardware platforms, from small consumer drones to larger autonomous systems.

References

DeTone, D., Malisiewicz, T., & Rabinovich, A. (2018). SuperPoint: Self-Supervised Interest Point Detection and Description. CVPR.
Sarlin, P. E., et al. (2020). SuperGlue: Learning Feature Matching with Graph Neural Networks. CVPR.
Zhao, X., et al. (2023). ALIKED: A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation. ICCV.
Lindenberger, P., et al. (2023). LightGlue: Local Feature Matching at Light Speed. ICCV.

This project was developed as part of research on GNSS-denied navigation for autonomous systems. For more details, please contact with me.