Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents

Accepted in ICCV 2025 Workshop on RetailVision

To be presented in Best of ICCV

1EEEI, University of the Philippines, Diliman, Quezon City 2AI Graduate Program, University of the Philippines, Diliman, Quezon City
Sari-sari Sandbox banner

Sari Sandbox: A high-fidelity virtual retail environment for benchmarking embodied AI agents and humans. Features include randomized product placement, multiple store layouts, interactive elements, VR support, and a Python API for task control. Both human participants and agents perform comparable shopping tasks for evaluation.

Abstract

We present Sari Sandbox, a high-fidelity, photorealistic 3D retail store simulation for benchmarking embodied agents against human performance in shopping tasks. Addressing a gap in retail-specific sim environments for embodied agent training, Sari Sandbox features over 250 interactive grocery items across three store configurations, controlled via an API. It supports both virtual reality (VR) for human interaction and a vision language model (VLM)-powered embodied agent. We also introduce SariBench, a dataset of annotated human demonstrations across varied task difficulties. Our sandbox enables embodied agents to navigate, inspect, and manipulate retail items, providing baselines against human performance. We conclude with benchmarks, performance analysis, and recommendations for enhancing realism and scalability. The source code can be accessed via https://github.com/upeee/sari-sandbox-env.

Video

Motivation

Real-world retail environments are complex, variable, and costly to replicate at scale. Sari Sandbox was built to fill this gap—providing a high-fidelity, controllable testbed for embodied agents in retail shopping tasks. It enables controlled experimentation on perception, navigation, and manipulation under realistic conditions, with support for both human participants and AI agents.

Environment Design

The Sari Sandbox environment features three store layouts with randomized item placement. It includes over 250 interactive high-fidelity grocery products across shelves, freezers, and checkout counters. This design enables training and evaluation of agents in diverse, high-fidelity conditions.

Environment layout 1
SariBench task flowchart 1
SariBench task flowchart 2
SariBench task flowchart 1
SariBench task flowchart 2

SariBench

SariBench is a dataset and benchmark suite designed to evaluate both human and agent performance on shopping-related tasks. It spans 3 core task difficulties and includes complex tasks requiring manipulation, reasoning, and product comparison.

Difficulty Skills Involved Example Task
Easy Perception, Navigation, Manipulation Find and pick up a box of cereal.
Average Perception, Navigation, Manipulation, Memory, Task Execution Pick up a bottle of soda and scan at checkout.
Difficult Perception, Navigation, Manipulation, Memory, Task Execution, Decision Making, Comprehension Which of these two products has lower sugar content: strawberry-flavored biscuit or chocolate-flavored biscuit? Scan the answer.

VR Playground

Human participants interact with the environment using VR controllers to grab, inspect, and scan products. This enables fine-grained benchmarking of human behavior in embodied tasks.

VR Participants

Agent Architecture

Our embodied agent is composed of three main modules: a Vision Language Model (VLM)-based captioner, a planner that converts high-level goals into structured steps, and a decider that maps these into low-level actions. This architecture supports semantic navigation, object interaction, and episodic memory.

Agent loop architecture

Human vs. Agent Performance

We compare human performance to an embodied agent baseline across varying task difficulty levels. Metrics include: HAT (Human Average Time), HCR% (Human Completion Rate), AAT (Agent Average Time), and ACR% (Agent Completion Rate).

Difficulty HAT ↓ HCR% ↑ AAT ↓ ACR% ↑
Easy-L14788.8878068.63
Easy-L273100.0066045.10
Easy-L36193.3342033.33
Average-L115887.50--
Average-L2106100.00--
Average-L384100.00--
Difficult-L176100.00--
Difficult-L2136100.00--
Difficult-L3113100.00--

BibTeX

@inproceedings{gajo2025sari
  author    = {Gajo, Janika Deborah and Merales, Gerarld Paul and Escarcha, Jerome and Molina, Brenden Ashley and Nartea, Gian and Maminta, Emmanuel and Roldan, Juan Carlos and Atienza, Rowel},
  title     = {Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents},
  journal   = {ICCV 2025 Workshop RetailVision},
  year      = {2025},
}