Alpine
Aotai Trail
Mountain ridges, forests, and alpine meadows for long-horizon wilderness search.
Embodied AI for aerial search and rescue
A Benchmark for Agentic UAV Embodied Search and Rescue
A high-fidelity ESAR simulator and evaluation suite for UAV agents that must explore open 3D terrain, discover rescue clues, reason over mission context, and report victim locations.
Harbin Institute of Technology, Shenzhen
Project video
The simulator reconstructs representative wilderness rescue environments and exposes long-horizon UAV search episodes with dynamic terrain, visibility, weather, clues, and victim locations.
Abstract
ESARBench introduces Embodied Search and Rescue (ESAR), a task in which aerial agents autonomously explore complex 3D environments, identify mission-critical clues, and reason about likely victim locations before reporting precise spatial coordinates.
The benchmark uses Unreal Engine 5 and AirSim to construct four large-scale, photorealistic environments mapped from real-world GIS data. Dynamic variables such as weather, time of day, and stochastic clue placement make each mission closer to practical rescue operations.
ESARBench contains 12 rescue events, 60 temporal snapshots, and 600 tasks. Baseline experiments show that current methods remain far from solving ESAR, with core bottlenecks in spatial memory, aerial adaptation, safe long-horizon flight, and semantic reasoning over clues.
Defines agentic UAV ESAR beyond passive instruction following.
Builds high-fidelity wilderness worlds from real GIS terrain.
Evaluates victim search, clue discovery, time, and safety.
Task formulation
The agent receives conditions, coordinates, and a high-level prompt describing the missing target.
The UAV searches large outdoor 3D terrain with multi-view RGB-D observations and flight state.
Tents, backpacks, campfires, flares, and other objects must be detected and interpreted as evidence.
The agent locates victims and reports predicted 3D coordinates under time and safety constraints.
UAV-ESAR simulator
Alpine
Mountain ridges, forests, and alpine meadows for long-horizon wilderness search.
Desert
Open desert and Gobi terrain with harsh visibility and sparse semantic cues.
Snowy Peak
High-altitude snow fields and steep terrain challenge navigation and perception.
Coast
Hills, forested areas, and coastal cliffs for mixed visual and topological conditions.
Real-world GIS and DEM data are mapped into UE5 landscapes, then coupled with AirSim-Colosseum flight dynamics.
Tasks vary weather, illumination, clue placement, start position, and target configuration.
The simulator provides GPS, IMU, LiDAR, multi-view RGB, and depth observations for embodied agents.
Benchmark design
Real-world rescue cases are abstracted into longitudinal mission narratives.
Each event is discretized into reproducible static temporal states.
Tasks sample weather, time, UAV starts, clue layouts, and victim positions.
Success Rate measures how many victims are correctly localized using one-to-one matching.
Time-weighted Success Rate rewards finding victims while completing missions efficiently.
Clue Discovery Score combines spatial clue localization with strict semantic matching.
Rescue Score balances safety, victim localization, clue discovery, and temporal efficiency.
Baselines
| Method | Type | SR | TSR | CDS | RS |
|---|---|---|---|---|---|
| Random | Basic | 2.65 | 2.47 | 1.51 | 9.81 |
| FBE | Basic exploration | 8.19 | 2.05 | 3.40 | 9.97 |
| Pure-MLLM | Direct MLLM control | 3.45 | 1.80 | 2.39 | 8.26 |
| SemExp | Ground ObjectNav | 6.83 | 1.21 | 2.47 | 9.05 |
| VLFM | Ground ObjectNav | 9.12 | 3.17 | 2.92 | 10.50 |
| NavGPT | Ground VLN | 5.92 | 1.36 | 3.30 | 10.89 |
| UniGoal | Ground ObjectNav | 6.47 | 0.74 | 2.94 | 9.27 |
| SPF | Aerial VLN | 8.84 | 0.94 | 3.53 | 13.12 |
| APEX | Aerial ObjectNav | 13.89 | 0.87 | 4.14 | 13.45 |
Aerial baselines outperform direct transfers from ground navigation because ESAR requires 3D motion, outdoor viewpoints, and search-specific exploration.
MLLM reasoning helps clue discovery most when it is paired with spatial memory, mapping, and UAV action structure.
Strong search policies still trade time for coverage, and long-horizon flight safety remains a major bottleneck.
Resources
Citation
@misc{zhang2026esarbenchbenchmarkagenticuav,
title={ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue},
author={Daoxuan Zhang and Ping Chen and Jianyi Zhou and Shuo Yang},
year={2026},
eprint={2605.01371},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2605.01371},
}