ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

Zhang, Daoxuan; Chen, Ping; Zhou, Jianyi; Yang, Shuo

Embodied AI for aerial search and rescue

ESARBench

A Benchmark for Agentic UAV Embodied Search and Rescue

A high-fidelity ESAR simulator and evaluation suite for UAV agents that must explore open 3D terrain, discover rescue clues, reason over mission context, and report victim locations.

Daoxuan Zhang Ping Chen Jianyi Zhou Shuo Yang

Harbin Institute of Technology, Shenzhen

Demo Dataset GitHub Paper

4 GIS-based worlds

600 ESAR tasks

13 Weather types

UE5 + AirSim Rendering and flight

Project video

High-fidelity UAV Rescue Simulation

The simulator reconstructs representative wilderness rescue environments and exposes long-horizon UAV search episodes with dynamic terrain, visibility, weather, clues, and victim locations.

Abstract

From UAV Navigation to Embodied Rescue Reasoning

ESARBench introduces Embodied Search and Rescue (ESAR), a task in which aerial agents autonomously explore complex 3D environments, identify mission-critical clues, and reason about likely victim locations before reporting precise spatial coordinates.

The benchmark uses Unreal Engine 5 and AirSim to construct four large-scale, photorealistic environments mapped from real-world GIS data. Dynamic variables such as weather, time of day, and stochastic clue placement make each mission closer to practical rescue operations.

ESARBench contains 12 rescue events, 60 temporal snapshots, and 600 tasks. Baseline experiments show that current methods remain far from solving ESAR, with core bottlenecks in spatial memory, aerial adaptation, safe long-horizon flight, and semantic reasoning over clues.

New Task

Defines agentic UAV ESAR beyond passive instruction following.

New Simulator

Builds high-fidelity wilderness worlds from real GIS terrain.

New Protocol

Evaluates victim search, clue discovery, time, and safety.

Task formulation

ESAR: Embodied Search and Rescue

Mission Start

The agent receives conditions, coordinates, and a high-level prompt describing the missing target.

Exploration

The UAV searches large outdoor 3D terrain with multi-view RGB-D observations and flight state.

Clue Discovery

Tents, backpacks, campfires, flares, and other objects must be detected and interpreted as evidence.

Life Search

The agent locates victims and reports predicted 3D coordinates under time and safety constraints.

UAV-ESAR simulator

Four Real-world Rescue Terrains

Alpine

Aotai Trail

Mountain ridges, forests, and alpine meadows for long-horizon wilderness search.

Desert

Lop Nur

Open desert and Gobi terrain with harsh visibility and sparse semantic cues.

Snowy Peak

K2

High-altitude snow fields and steep terrain challenge navigation and perception.

Coast

Dapeng Peninsula

Hills, forested areas, and coastal cliffs for mixed visual and topological conditions.

Realistic terrain construction

Real-world GIS and DEM data are mapped into UE5 landscapes, then coupled with AirSim-Colosseum flight dynamics.

Dynamic environmental state

Tasks vary weather, illumination, clue placement, start position, and target configuration.

Multi-modal UAV sensors

The simulator provides GPS, IMU, LiDAR, multi-view RGB, and depth observations for embodied agents.

Benchmark design

Task Data Generation

12 Events

Real-world rescue cases are abstracted into longitudinal mission narratives.

60 Snapshots

Each event is discretized into reproducible static temporal states.

600 Tasks

Tasks sample weather, time, UAV starts, clue layouts, and victim positions.

SR

Success Rate measures how many victims are correctly localized using one-to-one matching.

TSR

Time-weighted Success Rate rewards finding victims while completing missions efficiently.

CDS

Clue Discovery Score combines spatial clue localization with strict semantic matching.

RS

Rescue Score balances safety, victim localization, clue discovery, and temporal efficiency.

Baselines

Current Agents Still Struggle with ESAR

Method	Type	SR	TSR	CDS	RS
Random	Basic	2.65	2.47	1.51	9.81
FBE	Basic exploration	8.19	2.05	3.40	9.97
Pure-MLLM	Direct MLLM control	3.45	1.80	2.39	8.26
SemExp	Ground ObjectNav	6.83	1.21	2.47	9.05
VLFM	Ground ObjectNav	9.12	3.17	2.92	10.50
NavGPT	Ground VLN	5.92	1.36	3.30	10.89
UniGoal	Ground ObjectNav	6.47	0.74	2.94	9.27
SPF	Aerial VLN	8.84	0.94	3.53	13.12
APEX	Aerial ObjectNav	13.89	0.87	4.14	13.45

Aerial adaptation matters

Aerial baselines outperform direct transfers from ground navigation because ESAR requires 3D motion, outdoor viewpoints, and search-specific exploration.

Reasoning needs embodiment

MLLM reasoning helps clue discovery most when it is paired with spatial memory, mapping, and UAV action structure.

Safety and efficiency remain hard

Strong search policies still trade time for coverage, and long-horizon flight safety remains a major bottleneck.

Resources

Use ESARBench

GitHub Code, benchmark runners, baseline adapters, and evaluation scripts. Dataset ESARBench Map packages. AirSim-Colosseum UE5-based AirSim backend used for UAV physics and simulator control.

Citation

BibTeX

@misc{zhang2026esarbenchbenchmarkagenticuav,
      title={ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue}, 
      author={Daoxuan Zhang and Ping Chen and Jianyi Zhou and Shuo Yang},
      year={2026},
      eprint={2605.01371},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.01371}, 
}