Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents

Accepted in ICCV 2025 Workshop on RetailVision

To be presented in Best of ICCV

Janika Deborah Gajo,¹ Gerarld Paul Merales,¹ Jerome Escarcha,¹ Brenden Ashley Molina,¹ Gian Nartea,¹ Emmanuel Maminta,² Juan Carlos Roldan,² Rowel Atienza^1,2

¹EEEI, University of the Philippines, Diliman, Quezon City ²AI Graduate Program, University of the Philippines, Diliman, Quezon City

Paper arXiv Code

Sari Sandbox: A high-fidelity virtual retail environment for benchmarking embodied AI agents and humans. Features include randomized product placement, multiple store layouts, interactive elements, VR support, and a Python API for task control. Both human participants and agents perform comparable shopping tasks for evaluation.

Abstract

We present Sari Sandbox, a high-fidelity, photorealistic 3D retail store simulation for benchmarking embodied agents against human performance in shopping tasks. Addressing a gap in retail-specific sim environments for embodied agent training, Sari Sandbox features over 250 interactive grocery items across three store configurations, controlled via an API. It supports both virtual reality (VR) for human interaction and a vision language model (VLM)-powered embodied agent. We also introduce SariBench, a dataset of annotated human demonstrations across varied task difficulties. Our sandbox enables embodied agents to navigate, inspect, and manipulate retail items, providing baselines against human performance. We conclude with benchmarks, performance analysis, and recommendations for enhancing realism and scalability. The source code can be accessed via https://github.com/upeee/sari-sandbox-env.

Video

Motivation

Real-world retail environments are complex, variable, and costly to replicate at scale. Sari Sandbox was built to fill this gap—providing a high-fidelity, controllable testbed for embodied agents in retail shopping tasks. It enables controlled experimentation on perception, navigation, and manipulation under realistic conditions, with support for both human participants and AI agents.

Environment Design

The Sari Sandbox environment features three store layouts with randomized item placement. It includes over 250 interactive high-fidelity grocery products across shelves, freezers, and checkout counters. This design enables training and evaluation of agents in diverse, high-fidelity conditions.

SariBench

SariBench is a dataset and benchmark suite designed to evaluate both human and agent performance on shopping-related tasks. It spans 3 core task difficulties and includes complex tasks requiring manipulation, reasoning, and product comparison.

Difficulty	Skills Involved	Example Task
Easy	Perception, Navigation, Manipulation	Find and pick up a box of cereal.
Average	Perception, Navigation, Manipulation, Memory, Task Execution	Pick up a bottle of soda and scan at checkout.
Difficult	Perception, Navigation, Manipulation, Memory, Task Execution, Decision Making, Comprehension	Which of these two products has lower sugar content: strawberry-flavored biscuit or chocolate-flavored biscuit? Scan the answer.

VR Playground

Human participants interact with the environment using VR controllers to grab, inspect, and scan products. This enables fine-grained benchmarking of human behavior in embodied tasks.

Agent Architecture

Our embodied agent is composed of three main modules: a Vision Language Model (VLM)-based captioner, a planner that converts high-level goals into structured steps, and a decider that maps these into low-level actions. This architecture supports semantic navigation, object interaction, and episodic memory.

Human vs. Agent Performance

We compare human performance to an embodied agent baseline across varying task difficulty levels. Metrics include: HAT (Human Average Time), HCR% (Human Completion Rate), AAT (Agent Average Time), and ACR% (Agent Completion Rate).

Difficulty	HAT ↓	HCR% ↑	AAT ↓	ACR% ↑
Easy-L1	47	88.88	780	68.63
Easy-L2	73	100.00	660	45.10
Easy-L3	61	93.33	420	33.33
Average-L1	158	87.50	-	-
Average-L2	106	100.00	-	-
Average-L3	84	100.00	-	-
Difficult-L1	76	100.00	-	-
Difficult-L2	136	100.00	-	-
Difficult-L3	113	100.00	-	-

BibTeX

@inproceedings{gajo2025sari
  author    = {Gajo, Janika Deborah and Merales, Gerarld Paul and Escarcha, Jerome and Molina, Brenden Ashley and Nartea, Gian and Maminta, Emmanuel and Roldan, Juan Carlos and Atienza, Rowel},
  title     = {Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents},
  journal   = {ICCV 2025 Workshop RetailVision},
  year      = {2025},
}