Embodied AI for aerial search and rescue

ESARBench

A Benchmark for Agentic UAV Embodied Search and Rescue

A high-fidelity ESAR simulator and evaluation suite for UAV agents that must explore open 3D terrain, discover rescue clues, reason over mission context, and report victim locations.

Daoxuan Zhang Ping Chen Jianyi Zhou Shuo Yang

Harbin Institute of Technology, Shenzhen

4 GIS-based worlds
600 ESAR tasks
13 Weather types
UE5 + AirSim Rendering and flight

Project video

High-fidelity UAV Rescue Simulation

The simulator reconstructs representative wilderness rescue environments and exposes long-horizon UAV search episodes with dynamic terrain, visibility, weather, clues, and victim locations.

Abstract

From UAV Navigation to Embodied Rescue Reasoning

ESARBench introduces Embodied Search and Rescue (ESAR), a task in which aerial agents autonomously explore complex 3D environments, identify mission-critical clues, and reason about likely victim locations before reporting precise spatial coordinates.

The benchmark uses Unreal Engine 5 and AirSim to construct four large-scale, photorealistic environments mapped from real-world GIS data. Dynamic variables such as weather, time of day, and stochastic clue placement make each mission closer to practical rescue operations.

ESARBench contains 12 rescue events, 60 temporal snapshots, and 600 tasks. Baseline experiments show that current methods remain far from solving ESAR, with core bottlenecks in spatial memory, aerial adaptation, safe long-horizon flight, and semantic reasoning over clues.

New Task

Defines agentic UAV ESAR beyond passive instruction following.

New Simulator

Builds high-fidelity wilderness worlds from real GIS terrain.

New Protocol

Evaluates victim search, clue discovery, time, and safety.

Task formulation

ESAR: Embodied Search and Rescue

ESAR workflow from mission start to exploration, clue discovery, and life search
ESAR requires a UAV agent to combine perception, memory, semantic reasoning, and 3D motion planning while continuously updating its search strategy from discovered clues.
01

Mission Start

The agent receives conditions, coordinates, and a high-level prompt describing the missing target.

02

Exploration

The UAV searches large outdoor 3D terrain with multi-view RGB-D observations and flight state.

03

Clue Discovery

Tents, backpacks, campfires, flares, and other objects must be detected and interpreted as evidence.

04

Life Search

The agent locates victims and reports predicted 3D coordinates under time and safety constraints.

UAV-ESAR simulator

Four Real-world Rescue Terrains

Aotai Trail simulated alpine environment

Alpine

Aotai Trail

Mountain ridges, forests, and alpine meadows for long-horizon wilderness search.

Lop Nur simulated desert environment

Desert

Lop Nur

Open desert and Gobi terrain with harsh visibility and sparse semantic cues.

K2 simulated snowy peak environment

Snowy Peak

K2

High-altitude snow fields and steep terrain challenge navigation and perception.

Dapeng Peninsula simulated coastal environment

Coast

Dapeng Peninsula

Hills, forested areas, and coastal cliffs for mixed visual and topological conditions.

Realistic terrain construction

Real-world GIS and DEM data are mapped into UE5 landscapes, then coupled with AirSim-Colosseum flight dynamics.

Dynamic environmental state

Tasks vary weather, illumination, clue placement, start position, and target configuration.

Multi-modal UAV sensors

The simulator provides GPS, IMU, LiDAR, multi-view RGB, and depth observations for embodied agents.

Benchmark design

Task Data Generation

12 Events

Real-world rescue cases are abstracted into longitudinal mission narratives.

60 Snapshots

Each event is discretized into reproducible static temporal states.

600 Tasks

Tasks sample weather, time, UAV starts, clue layouts, and victim positions.

SR

Success Rate measures how many victims are correctly localized using one-to-one matching.

TSR

Time-weighted Success Rate rewards finding victims while completing missions efficiently.

CDS

Clue Discovery Score combines spatial clue localization with strict semantic matching.

RS

Rescue Score balances safety, victim localization, clue discovery, and temporal efficiency.

Baselines

Current Agents Still Struggle with ESAR

Method Type SR TSR CDS RS
RandomBasic2.652.471.519.81
FBEBasic exploration8.192.053.409.97
Pure-MLLMDirect MLLM control3.451.802.398.26
SemExpGround ObjectNav6.831.212.479.05
VLFMGround ObjectNav9.123.172.9210.50
NavGPTGround VLN5.921.363.3010.89
UniGoalGround ObjectNav6.470.742.949.27
SPFAerial VLN8.840.943.5313.12
APEXAerial ObjectNav13.890.874.1413.45

Aerial adaptation matters

Aerial baselines outperform direct transfers from ground navigation because ESAR requires 3D motion, outdoor viewpoints, and search-specific exploration.

Reasoning needs embodiment

MLLM reasoning helps clue discovery most when it is paired with spatial memory, mapping, and UAV action structure.

Safety and efficiency remain hard

Strong search policies still trade time for coverage, and long-horizon flight safety remains a major bottleneck.

Resources

Use ESARBench

Citation

BibTeX

@misc{zhang2026esarbenchbenchmarkagenticuav,
      title={ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue}, 
      author={Daoxuan Zhang and Ping Chen and Jianyi Zhou and Shuo Yang},
      year={2026},
      eprint={2605.01371},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.01371}, 
}