The Problem
Multi-object tracking (MOT) from drone footage is one of the hardest problems in aerial computer vision. Unlike ground-level cameras with a fixed viewpoint, UAVs introduce three compounding challenges that break standard tracking pipelines:
- Altitude-dependent scale variation — Objects appear at vastly different sizes depending on flight height. A pedestrian at 30 m occupies a very different pixel footprint than the same pedestrian at 120 m, confusing fixed-parameter graph matchers.
- Dense, small objects — Aerial imagery often contains dozens of closely spaced objects — vehicles in a parking lot, pedestrians at a crossing — where bounding boxes overlap and associations become ambiguous.
- Occlusion-induced identity switches — When an object passes behind another, trackers lose the correct association and assign a new identity on reappearance. In aerial footage, this happens frequently and cascades into large IDSW counts.
Existing MOT methods — SORT, DeepSORT, ByteTrack — were designed for ground-level footage and treat all detections as homogeneous nodes in their association graph. They have no mechanism to adapt to altitude, distinguish tracklet states, or suppress the influence of occluded objects on their neighbors.
Our Approach: HDST-GNN
HDST-GNN models the tracking problem as a heterogeneous dynamic graph where nodes have distinct types (detections, confirmed tracklets, lost tracklets) and edges are built adaptively based on estimated camera altitude — enabling the network to reason about the scene at the right spatial scale for the current flight height.
The model processes each frame as a spatiotemporal graph that evolves over time. Spatial edges connect objects within the current frame; temporal edges connect objects across frames. The graph transformer layers aggregate information across both dimensions simultaneously, producing association scores that reflect both appearance similarity and motion consistency.
Three Key Innovations
Altitude-Adaptive Edge Construction
HDST-GNN estimates camera altitude from the distribution of object sizes in each frame, then adjusts the graph connectivity radius and edge feature normalization accordingly. At high altitude, objects are small and densely packed — the model widens its spatial neighborhood to capture relevant context. At low altitude, the model tightens the graph to avoid connecting unrelated objects.
Heterogeneous Node Representation
Rather than treating all graph nodes as equivalent, HDST-GNN distinguishes three node types: new detections (no trajectory history), confirmed tracklets (stable, multi-frame trajectories), and lost tracklets (temporarily invisible, being held for re-identification). Each type has dedicated embedding pathways and type-specific message-passing weights, letting the network reason differently about uncertain vs. established identities.
Occlusion-Gated Temporal Aggregation
When an object is likely occluded — inferred from overlap with neighboring bounding boxes and a sudden drop in detection confidence — HDST-GNN gates its temporal aggregation signal. This prevents the uncertain state of an occluded object from corrupting the trajectory representations of its neighbors, dramatically reducing the cascade of identity switches that occlusion events typically trigger.
Results
We evaluated HDST-GNN on standard UAV MOT benchmarks using perfect detections (oracle detector) to isolate the association component. The results demonstrate substantial improvements over SORT, the most widely used tracking baseline:
The 81% reduction in identity switches is the standout result. IDSW is arguably the most operationally damaging MOT failure mode — each switch breaks a trajectory and requires human review in downstream applications. The occlusion gate and heterogeneous node types together drive this improvement: the model holds lost tracklets in memory without corrupting neighbors, and re-identifies them correctly when they reappear.
Why Aerial MOT Matters
Federal Reconnaissance
Persistent tracking of vehicles and personnel across wide-area aerial surveillance footage, maintaining identity continuity through occlusion and altitude changes.
Border & Infrastructure Security
Automatic detection and tracking of unauthorized activity across large perimeters without requiring dense fixed camera networks.
Autonomous Delivery
Urban drone delivery requires tracking pedestrians and vehicles in dense scenes to plan safe descent and landing approach paths.
Traffic Analytics
City-scale vehicle flow analysis from UAV footage, enabling intersection timing optimization and incident detection without ground sensors.
What's Next
HDST-GNN establishes a strong baseline for graph-based aerial MOT. Upcoming research directions include:
- End-to-end joint detection and tracking — eliminating the oracle detector assumption with a learned detector in the loop
- Swarm-cooperative tracking — fusing observations from multiple UAVs into a shared heterogeneous graph for wider coverage and cross-view re-identification
- Deployment on NVIDIA Jetson platforms for real-time on-board inference at the edge
- Integration with our Drone AI simulation environment for large-scale data generation and evaluation
Read the Full Paper
HDST-GNN: Heterogeneous Dynamic Spatiotemporal Graph Neural Networks for Multi-Object Tracking in UAV Aerial Imagery
Phillip Jiang · arXiv 2606.05587 · 2026