Relationship-Aware Hierarchical 3D Scene Graph

Autonomous Robots Lab, Norwegian University of Science and Technology (NTNU)
IEEE ICRA 2026

ReasoningGraph incrementally builds an open-vocabulary, relationship-aware hierarchical scene graph of the environment during autonomous exploration. Leveraging open-vocabulary and object-relational embeddings, ReasoningGraph identifies task-relevant objects and reasons about their interactions. In this example, it identifies all the objects (chairs, a table, and a trash can) that are blocking the exits.

Abstract

Representing and understanding 3D environments in a structured manner is crucial for autonomous agents to navigate and reason about their surroundings. While traditional Simultaneous Localization and Mapping (SLAM) methods generate metric reconstructions and can be extended to metric-semantic mapping, they lack a higher level of abstraction and relational reasoning. To address this gap, 3D scene graphs have emerged as a powerful representation for capturing hierarchical structures and object relationships. In this work, we propose an enhanced hierarchical 3D scene graph that integrates open-vocabulary features across multiple abstraction levels and supports object-relational reasoning. Our approach leverages a Vision Language Model (VLM) to infer semantic relationships. Notably, we introduce a task reasoning module that combines Large Language Models (LLM) and a VLM to interpret the scene graph’s semantic and relational information, enabling agents to reason about tasks and interact with their environment more intelligently. We validate our method by deploying it on a quadruped robot in multiple environments and tasks, highlighting its ability to reason about them.

ReasoningGraph Overview

a) ReasoningGraph incrementally builds a hierarchical 3D scene graph (c)) from RGB-D frames and poses, using an open-vocabulary detector and CLIP embeddings for object representation. Object relations are derived from a VLM visual encoder, while Hydra reconstructs the semantic mesh (L1), clusters objects (L2), and detects places and rooms (L3, L4). Open-vocabulary features and relations are then assigned to the graph. b) The task reasoning module leverages two LLMs and a VLM. Given a task, the LLM identifies relevant objects and formulates subtasks needing evaluation. These subtasks are evaluated for feasibility by the VLM, with CLIP similarity used for object retrieval.

Results

We design a series of evaluation tasks that require identifying objects within the graph and, in some cases, reasoning about their relations.

Object search task

After the scene graph is constructed during autonomous exploration, we ask ReasoningGraph to find a backpack, a fan, plants and trash cans. Our method is capable of finding all the objects.

Trash disposal task

The goal is to identify filled trash bags near trash cans and determine if they can be thrown away.

Prepare bedroom task

The goal is to verify whether the pillows and blankets are placed appropriately on the bed.

The scene graph is built during exploration, after which the LLM reasons about the task and the VLM evaluates subtasks using object relations. In both individual experiments, a 100 SR% is achieved. We present one VLM reasoning example per task, although in practice the VLM reasons about each subtask.

Video

BibTeX


@inproceedings{puigjaner2026reasoninggraph,
    title={Relationship-Aware Hierarchical 3D Scene Graph},
    author={Gassol Puigjaner, Albert and Zacharia, Angelos and Alexis, Kostas},
    booktitle={2026 IEEE International Conference on Robotics and Automation (ICRA)}, 
    year={2026}
}