Skip to content

Visual Language Model (VLM) Reasoning

The system currently integrates two complementary vision-language model (VLM) modalities:

  • Open-vocabulary object detection with 3D spatial grounding.
  • Binary visual question-answering (Yes/No) with reasoning.

In their combination, these capabilities collectively enable semantic scene understanding and reasoning from visual data.

Source Code

System Overview

The VLM stack is designed to run alongside the core perception and planning modules, providing semantic annotations and reasoning.

The following diagram describes the entities in the VLM system along with the data exhange within and outside the system.

graph TB

    FC["Front Camera"]
    SLAM["Multi-modal SLAM"]

    Q&AVLM["Q&A VLM"]

    subgraph Detection VLM
        VoxelGrid["Voxel Grid Map"]
        VLM["VLM"]
        3D["3D projection & clustering"]
    end

    UI["User Interface"]

    2DDets["2D Detections"]
    3DDets["3D Detections"]
    ReasonOutput["Reasoning Output"]

    SLAM -->|"LiDAR Cloud"| VoxelGrid
    SLAM -->|"Odometry"| VoxelGrid
    SLAM -->|"Odometry"| 3D
    FC -->|"Images"| VLM
    FC -->|"Images"| Q&AVLM
    VoxelGrid -->|"Map"| 3D
    VLM -->|"2D detections"| 3D
    UI -->|"Question (prompt)"| Q&AVLM
    VLM --> 2DDets
    3D --> 3DDets
    Q&AVLM -->|"Yes/No + reasoning"| ReasonOutput


    classDef input fill:#e0e7ff30,stroke:#6366f1,stroke-width:2px,color:#000
    classDef frontend fill:#fef3c730,stroke:#eab308,stroke-width:2px,color:#000
    classDef backend fill:#d1fae530,stroke:#10b981,stroke-width:2px,color:#000
    classDef output fill:#cffafe30,stroke:#06b6d4,stroke-width:2px,color:#000

    class SLAM,FC,UI input
    class VLM,VoxelGrid,Q&AVLM,3D backend
    class 2DDets,3DDets,ReasonOutput output

Detection VLM

Object detection is performed using either an open-vocabulary object detector (YOLOe) or a VLM-based detector (GPT-4V via API call) initialized with a set of labels or a description of the objects to detect. These models operate on the front camera image and produce 2D bounding boxes. In parallel, a downsampled voxel grid derived from the LiDAR point cloud and our SLAM odometry estimate is maintained. Accordingly, LiDAR points are projected into the camera frame using the current pose estimate and the camera projection matrix. Valid points are clustered to identify those that fall within each 2D detection/mask. This produces aligned 2D detections and corresponding 3D bounding volumes.

Q&A VLM

For high-level semantic assessment, a VLM (GPT-4V via API call) processes the front-camera image together with a binary “Yes/No” question. For example, queries related to assessing safety or navigation-related properties of the scene (e.g., is the exit of this environment blocked). The model returns the binary answer, alongside a color-coded confidence overlay on the input image, and a brief explanation of its reasoning.


Topics & Interfaces

Topics are remapped in the launch file. Note, all topics are in the /detection_vlm or /reasoning_vlm namespaces.

Input Topics

Detection VLM

Topic Type Description
input_image sensor_msgs/Image or sensor_msgs/CompressedImage Input image
input_pointcloud sensor_msgs/PointCloud2 Input Pointcloud

Q&A VLM

Topic Type Description
input_image sensor_msgs/Image or sensor_msgs/CompressedImage Input image

Input Frames (for TF querying)

Detection VLM

Frame name Description
target_frame Target world frame
camera_frame Camera frame
body_frame Body/IMU frame

Output Topics

Detection VLM

Topic Type Description
detections_image sensor_msgs/Image 2D detections overlayed on the input image
accumulated_pointcloud sensor_msgs/PointCloud2 Accumulated voxel grid map
detected_bboxes_3d vision_msgs/BoundingBox3DArray 3D detections
visualization_3d_bboxes visualization_msgs/MarkerArray 3D detections for visualization

Q&A VLM

Topic Type Description
output_image sensor_msgs/Image Color-coded confidence overlay on the input image and brief explanation of the VLM reasoning

Services (Input)

Q&A VLM

Service Type Description
set_prompt detection_vlm_msgs/SetPrompt Prompt used by the VLM, it should be a Yes/No question

Configuration

Detection VLM

Parameters are set in:

  • YOLOe: detection_vlm/detection_vlm_ros/config/detection_yoloe.yaml
  • GPT-4V: detection_vlm/detection_vlm_ros/config/detection_vlm.yaml

Topics and frame remappings are set in:

  • detection_vlm/detection_vlm_ros/launch/detection_vlm.launch.yaml

VLM

Parameter Description
vlm/type Which vlm to use: yoloe or openai
vlm/model (YOLOe) Model to use (see here)
vlm/confidence_threshold (YOLOe) Detection confidence threshold
vlm/verbose (YOLOe) If True, prints inference statistics
vlm/confidence_threshold (YOLOe) Detection confidence threshold
vlm/cuda (YOLOe) Whether to use cuda
vlm/client_config/model (OpenAI) Model to use
prompt (OpenAI) Detection prompt. Specify/describe objects to detect

Image worker

Parameter Description
worker/min_separation_s Processing process: take an image every min_separation_s seconds
compressed_image Wether the input image topic is compressed

Voxel map and clustering

Parameter Description
voxel_size Map voxel size (m)
min_points_per_cluster Minimum number of points a cluster must have
eps_dbscan DBSCAN epsilon parameter (m)
min_point_r Filter out points closer than this value to the body frame (m)
max_point_r Filter out points further than this value to the body frame (m)

Camera intrinsics

Parameter Description
camera_intrinsics/fx x focal length
camera_intrinsics/fy y focal length
camera_intrinsics/cx x principal point
camera_intrinsics/cy y principal point

Others

Parameter Description
use_tf_current_time False: use message stamp to query TF. True: use current time to query TF.
use_masks_for_projection Whether to use masks when projecting to 3D. Otherwise use bounding boxes.
verbose Whether to print timings and info.

Q&A VLM

Parameters are set in detection_vlm/detection_vlm_ros/config/reasoning_vlm.yaml.

Topics and frame remmapings are set in detection_vlm/detection_vlm_ros/launch/reasoning_vlm.launch.yaml.

VLM

Parameter Description
vlm/type Which vlm to use: only openai is supported for now
vlm/type Which vlm to use: only openai is supported for now
vlm/client_config/model (OpenAI) Model to use
prompt (OpenAI) Detection prompt. Specify/describe objects to detect

Image worker

Parameter Description
worker/min_separation_s Processing process: take an image every min_separation_s seconds
compressed_image Wether the input image topic is compressed

Visualization

Parameter Description
overlay_alpha Alpha used for the color-coded overlay.
footer_height Foother height for the reasoning on the output image.

Others

Parameter Description
footer_height Footer height for the reasoning on the output image.
verbose Whether to print timings and info.