Skip to content

Visual Language Model (VLM) Reasoning

The VLM stack integrates three complementary vision-language capabilities:

  • Open-vocabulary object detection with 3D spatial grounding.
  • Persistent semantic mapping from image detections and LiDAR geometry.
  • Binary visual question-answering with reasoning.

Together, these modules provide object-level detections, a semantic voxel map, clustered semantic objects, and high-level scene reasoning from robot camera and LiDAR data.

Source Code

System Overview

The VLM stack runs alongside the core perception and planning modules. It consumes camera images, LiDAR point clouds, calibration, and TF, then publishes semantic annotations for downstream autonomy and user-facing visualization.

graph TB

    FC["Front Camera"]
    SLAM["Multi-modal SLAM"]
    UI["User Interface"]

    subgraph Detection["Detection VLM"]
        Detector["YOLOE / OpenAI detector"]
        GeometryMap["Accumulated voxel geometry"]
        Projection3D["2D-to-3D projection and association"]
    end

    subgraph SemanticMap["Semantic Map"]
        SemanticUpdater["Semantic Voxel Grid"]
        ObjectClusters["Semantic object clustering"]
    end

    QAVLM["Q&A VLM"]

    Dets2D["2D Detection Image"]
    Dets3D["3D Bounding Boxes"]
    SemanticCloud["Colored Semantic Cloud"]
    SemanticObjects["Semantic Object Markers"]
    ReasonOutput["Reasoning Image"]

    FC -->|"Images"| Detector
    FC -->|"Images"| QAVLM
    SLAM -->|"LiDAR cloud + TF"| GeometryMap
    SLAM -->|"TF"| Projection3D
    SLAM -->|"TF"| SemanticUpdater
    UI -->|"Detection prompt"| Detector
    UI -->|"Question prompt"| QAVLM

    GeometryMap --> Projection3D
    Detector -->|"2D boxes / masks"| Projection3D
    Detector -->|"2D boxes / masks"| SemanticUpdater
    GeometryMap --> SemanticUpdater
    SemanticUpdater --> ObjectClusters

    Detector --> Dets2D
    Projection3D --> Dets3D
    SemanticUpdater --> SemanticCloud
    ObjectClusters --> SemanticObjects
    QAVLM --> ReasonOutput

    classDef input fill:#e0e7ff30,stroke:#6366f1,stroke-width:2px,color:#000
    classDef backend fill:#d1fae530,stroke:#10b981,stroke-width:2px,color:#000
    classDef output fill:#cffafe30,stroke:#06b6d4,stroke-width:2px,color:#000

    class SLAM,FC,UI input
    class Detector,GeometryMap,Projection3D,SemanticUpdater,ObjectClusters,QAVLM backend
    class Dets2D,Dets3D,SemanticCloud,SemanticObjects,ReasonOutput output

Detection VLM

Object detection is performed using either an open-vocabulary detector, currently YOLOE, or an OpenAI VLM detector initialized with a text prompt. The detector operates on the front camera image and produces 2D bounding boxes and, for segmentation-capable YOLOE models, masks.

In parallel, a downsampled voxel map is accumulated from LiDAR points. The node projects map points into the camera frame, associates projected points with each 2D detection, and publishes 3D object bounding volumes. Association can use DBSCAN clustering or a front-surface selection method.

Semantic Map

The semantic map node uses the same camera, LiDAR, camera info, and TF inputs as the detection node, but accumulates semantic evidence over time. Each image detection updates per-voxel semantic scores in the map. The node publishes:

  • a colored semantic point cloud, where each active voxel is colored by its most likely label;
  • clustered semantic objects as RViz markers with boxes, connectors, nodes, and labels;
  • the annotated 2D detections image used for map updates.

Semantic scores can be updated using additive counters or Bayesian updates. Active-window parameters limit projection work to a local region around the robot for runtime control.

Q&A VLM

For high-level semantic assessment, an OpenAI VLM processes the front-camera image together with a binary question. Example questions include whether an exit is blocked or whether an area appears safe. The node publishes a color-coded overlay image with the answer probability and a short textual explanation.


Launch Files

ROS 1 Noetic

Module Launch file Default config
Detection VLM detection_vlm/detection_vlm_ros/launch/detection_vlm.launch detection_vlm/detection_vlm_ros/config/detection_yoloe.yaml
Semantic Map detection_vlm/detection_vlm_ros/launch/semantic_map.launch detection_vlm/detection_vlm_ros/config/semantic_mapping_yoloe.yaml
Semantic Map, site-specific NENE setup detection_vlm/detection_vlm_ros/launch/semantic_map_nene.launch detection_vlm/detection_vlm_ros/config/semantic_mapping_yoloe.yaml
Q&A VLM detection_vlm/detection_vlm_ros/launch/reasoning_vlm.launch detection_vlm/detection_vlm_ros/config/reasoning_vlm.yaml

ROS 2 Humble

Module Launch file Default config
Detection VLM detection_vlm/detection_vlm_ros/launch/detection_vlm.launch.yaml detection_vlm/detection_vlm_ros/config/detection_yoloe.yaml
Semantic Map detection_vlm/detection_vlm_ros/launch/semantic_map.launch.yaml detection_vlm/detection_vlm_ros/config/semantic_mapping_yoloe.yaml
Q&A VLM detection_vlm/detection_vlm_ros/launch/reasoning_vlm.launch.yaml detection_vlm/detection_vlm_ros/config/reasoning_vlm.yaml

OpenAI-backed modes require:

export OPENAI_API_KEY=<Your OpenAI API key>

Topics & Interfaces

Topics are remapped in the launch files. The default namespaces are /detection_vlm, /semantic_map, and /reasoning_vlm.

Detection VLM

Input Topics

Topic ROS 1 type ROS 2 type Description
/detection_vlm/input_image sensor_msgs/Image or sensor_msgs/CompressedImage sensor_msgs/msg/Image or sensor_msgs/msg/CompressedImage Front camera image
/detection_vlm/input_pointcloud sensor_msgs/PointCloud2 sensor_msgs/msg/PointCloud2 LiDAR point cloud
/detection_vlm/input_camera_info sensor_msgs/CameraInfo sensor_msgs/msg/CameraInfo Camera intrinsics and distortion

Output Topics

Topic ROS 1 type ROS 2 type Description
/detection_vlm/detections_image sensor_msgs/Image sensor_msgs/msg/Image 2D detections overlaid on the input image
/detection_vlm/accumulated_pointcloud sensor_msgs/PointCloud2 sensor_msgs/msg/PointCloud2 Accumulated voxel geometry map
/detection_vlm/detected_bboxes_3d vision_msgs/BoundingBox3DArray vision_msgs/msg/BoundingBox3DArray 3D object bounding boxes
/detection_vlm/visualization_3d_bboxes visualization_msgs/MarkerArray visualization_msgs/msg/MarkerArray RViz markers for 3D boxes and labels
/detection_vlm/debug_clusters_pointcloud sensor_msgs/PointCloud2 sensor_msgs/msg/PointCloud2 Optional colored point clusters used for 3D association

Service

Service ROS 1 type ROS 2 type Description
/detection_vlm/set_prompt detection_vlm_msgs/SetPrompt detection_vlm_msgs/srv/SetPrompt Update the detection prompt at runtime

Semantic Map

Input Topics

Topic ROS 1 type ROS 2 type Description
/semantic_map/input_image sensor_msgs/Image or sensor_msgs/CompressedImage sensor_msgs/msg/Image or sensor_msgs/msg/CompressedImage Front camera image
/semantic_map/input_pointcloud sensor_msgs/PointCloud2 sensor_msgs/msg/PointCloud2 LiDAR point cloud used to build map geometry
/semantic_map/input_camera_info sensor_msgs/CameraInfo sensor_msgs/msg/CameraInfo Camera intrinsics and distortion

Output Topics

Topic ROS 1 type ROS 2 type Description
/semantic_map/detections_image sensor_msgs/Image sensor_msgs/msg/Image 2D detections used to update the semantic map
/semantic_map/semantic_map sensor_msgs/PointCloud2 sensor_msgs/msg/PointCloud2 Colored semantic voxel cloud, latched/transient-local
/semantic_map/semantic_objects visualization_msgs/MarkerArray visualization_msgs/msg/MarkerArray Clustered semantic object markers, latched/transient-local

Service

Service ROS 1 type ROS 2 type Description
/semantic_map/set_prompt detection_vlm_msgs/SetPrompt detection_vlm_msgs/srv/SetPrompt Update the semantic detection prompt at runtime

Q&A VLM

Input Topics

Topic ROS 1 type ROS 2 type Description
/reasoning_vlm/input_image sensor_msgs/Image or sensor_msgs/CompressedImage sensor_msgs/msg/Image or sensor_msgs/msg/CompressedImage Front camera image

Output Topics

Topic ROS 1 type ROS 2 type Description
/reasoning_vlm/output_image sensor_msgs/Image - Color-coded answer/confidence image in ROS 1
/reasoning_vlm/reasoning_image - sensor_msgs/msg/Image Color-coded answer probability image in ROS 2

Service

Service ROS 1 type ROS 2 type Description
/reasoning_vlm/set_prompt detection_vlm_msgs/SetPrompt detection_vlm_msgs/srv/SetPrompt Update the yes/no question at runtime

TF Frames

Parameter Description
target_frame World or navigation frame where the geometry and semantic maps are accumulated
camera_frame Camera optical frame used for projecting map points into the image
body_frame Robot body or IMU frame used for range filtering and active-window localization

ROS 1 expects TF to provide the point cloud frame to body_frame, the point cloud frame to target_frame, and target_frame to camera_frame.

ROS 2 keeps the LiDAR-to-body extrinsic in YAML as lidar_to_body_transform, then queries TF for body_frame to target_frame and target_frame to camera_frame.


Configuration

Common Model Parameters

Parameter Description
vlm/type Detector or reasoner backend. Detection supports yoloe and openai; reasoning uses openai.
vlm/model YOLOE model name or local weight path.
vlm/confidence_threshold YOLOE detection threshold.
vlm/output_size YOLOE inference image size.
vlm/cuda Whether YOLOE inference should use CUDA.
vlm/client_config/model OpenAI model name.
vlm/system_prompt_path Path to the system prompt loaded by the launch file.
prompt Detection prompt or yes/no reasoning question.
classes_file Optional class list for YOLOE models that support setting custom classes.

Image Worker

Parameter Description
worker/min_separation_s Minimum time between processed images.
worker/queue_size Image worker queue size.
compressed_image If True, subscribe to compressed images.

Detection VLM

Parameters are set in:

  • ROS 1 YOLOE: detection_vlm/detection_vlm_ros/config/detection_yoloe.yaml
  • ROS 1 OpenAI: detection_vlm/detection_vlm_ros/config/detection_vlm.yaml
  • ROS 2 YOLOE: detection_vlm/detection_vlm_ros/config/detection_yoloe.yaml
  • ROS 2 OpenAI: detection_vlm/detection_vlm_ros/config/detection_vlm.yaml
Parameter Description
voxel_size Geometry map voxel size in meters.
min_points_per_cluster Minimum number of projected points required for an association or cluster.
eps_dbscan DBSCAN epsilon in meters.
association_method dbscan or front_surface.
front_surface_cell_size_px Image grid size used by front-surface association.
front_surface_depth_band Absolute depth band for front-surface selection.
front_surface_depth_band_scale Relative depth band scale for front-surface selection.
front_surface_depth_percentile Percentile used to estimate front depth per image cell.
front_surface_outlier_std_ratio Statistical outlier threshold for front-surface points.
min_point_r Filter out LiDAR points closer than this distance to body_frame.
max_point_r Filter out LiDAR points farther than this distance to body_frame.
use_masks_for_projection Use segmentation masks when available; otherwise use 2D boxes.
publish_debug_clusters Publish colored association clusters for debugging.
keep_boxes Keep old RViz box markers instead of expiring them.
use_tf_current_time If False, query TF at the message stamp. If True, query latest TF.
verbose Print timings and additional information.

ROS 2 also exposes confidence_threshold as a runtime node parameter.

Semantic Map

Parameters are set in:

  • ROS 1: detection_vlm/detection_vlm_ros/config/semantic_mapping_yoloe.yaml
  • ROS 2: detection_vlm/detection_vlm_ros/config/semantic_mapping_yoloe.yaml
Parameter Description
voxel_size Semantic and geometry map voxel size in meters.
association_method dbscan, front_surface, or none.
semantic_update_method counter for additive voting or bayes for Bayesian updates.
default_detection_confidence Confidence used when a detector output has no confidence.
semantic_publish_min_observations Minimum observations before a voxel can be published.
semantic_publish_min_confidence Minimum winning-label confidence before a voxel can be published.
semantic_publish_period_s Semantic cloud publish period.
publish_semantic_objects Enable clustered semantic object markers.
semantic_objects_period_s Semantic object marker publish period.
semantic_cluster_eps Euclidean clustering radius for semantic objects.
semantic_cluster_min_points Minimum points for semantic object clustering.
semantic_object_min_voxels Minimum voxels in a semantic object.
semantic_object_min_confidence Minimum mean object confidence used by ROS 2 semantic object filtering.
semantic_object_node_height Height of semantic object label nodes in RViz.
semantic_object_node_radius Radius of semantic object label nodes.
semantic_object_text_offset Vertical text offset above object label nodes.
show_confidence Include confidence in semantic object label text.
active_window_enabled Enable local-window updates around body_frame.
active_window_radius_xy Active-window radius in the horizontal plane.
active_window_min_z Minimum active-window height relative to body_frame.
active_window_max_z Maximum active-window height relative to body_frame.
active_window_apply_to_updates Limit semantic updates to the active window.
active_window_apply_to_visualization Limit published semantic cloud to the active window.
semantic_visualization_voxel_size Extra downsampling size for the published semantic cloud.
semantic_visualization_max_points Maximum number of semantic cloud points to publish.
semantic_object_visualization_voxel_size Optional downsampling size before object clustering in ROS 2.
max_active_update_points Maximum number of map points projected into each image update.
runtime_logging_enabled Enable runtime CSV logging.
runtime_log_file Runtime CSV path.

Camera Calibration

Parameter Description
camera_intrinsics/fx X focal length.
camera_intrinsics/fy Y focal length.
camera_intrinsics/cx X principal point.
camera_intrinsics/cy Y principal point.
distortion_coeffs/k1 Radial distortion coefficient.
distortion_coeffs/k2 Radial distortion coefficient.
distortion_coeffs/p1 Tangential distortion coefficient.
distortion_coeffs/p2 Tangential distortion coefficient.

Camera info messages override these values after the first received message.

ROS 2 LiDAR Extrinsics

The ROS 2 branch keeps LiDAR-to-body calibration in the config files:

Parameter Description
lidar_to_body_transform/translation/x LiDAR translation in body_frame.
lidar_to_body_transform/translation/y LiDAR translation in body_frame.
lidar_to_body_transform/translation/z LiDAR translation in body_frame.
lidar_to_body_transform/quaternion/x LiDAR orientation quaternion.
lidar_to_body_transform/quaternion/y LiDAR orientation quaternion.
lidar_to_body_transform/quaternion/z LiDAR orientation quaternion.
lidar_to_body_transform/quaternion/w LiDAR orientation quaternion.

Q&A VLM

Parameters are set in:

  • ROS 1: detection_vlm/detection_vlm_ros/config/reasoning_vlm.yaml
  • ROS 2: detection_vlm/detection_vlm_ros/config/reasoning_vlm.yaml
Parameter Description
overlay Enable or disable the colored answer overlay.
overlay_alpha Overlay opacity.
footer_height Base footer height for text rendering.
runtime_logging_enabled Enable runtime CSV logging.
runtime_log_file Runtime CSV path.
verbose Print timings and additional information.