Master's Thesis · Emad Razavi

Semantic Object-Goal Navigation on a Quadruped Robot in Known Environments

Built on Boston Dynamics Spot with a simple map then act workflow: record a clean 2D map plus trusted object instances, then come back and navigate to objects by name.

Stage 1: Pre run mapping

A short teleop pass builds a 2D occupancy map and records only confirmed object instances from RGB D into a small semantic database.

Stage 2: Object goal navigation

The robot localizes on the saved map and navigates to a selected object with a safe standoff goal using Nav2. Goal selection can be done from a CLI or by voice.

Download thesis PDF Semantic mapping paper

Thesis work at Dynamic Legged Systems, Istituto Italiano di Tecnologia. Everything runs onboard on an Intel NUC with no discrete GPU.

System overview

Hardware and sensing

The robot carries a small sensor set on purpose: a 2D LiDAR for mapping, localization, and costmaps, a RealSense T265 for visual inertial odometry, and a RealSense D435 for RGB D detections. The stack is designed for repeatable runs, not a one time demo.

ROS 2 pipeline

In mapping mode, SLAM Toolbox builds the 2D map while the semantic layer runs in parallel. In navigation mode, SLAM is off, the map is loaded, AMCL localizes, and Nav2 plans and drives Spot through the ROS 2 bridge.

Why two stages

It keeps runtime stable and lightweight. The robot does not try to rebuild the world during navigation. Instead it localizes on a fixed map and uses a compact object database for object goals.

What “confirmed” means

Raw detections are noisy. Objects are promoted to the database only after repeated support across time, with gating and de duplication so the final list stays small, stable, and usable for navigation.

Demo 1: Semantic mapping

This video shows the full semantic layer in the pre run phase. The detector runs on the D435 RGB stream, depth is used to back project detections into 3D, then TF is used to express those points in the map frame. Instead of writing every frame into the map, the system keeps a small memory:

Proposal memory: new observations enter here first. This is where unstable detections die out.
Static memory: only detections that are repeatedly supported get promoted. Nearby duplicates are merged so you do not end up with five chairs for the same chair.
Recorder output: confirmed objects are periodically saved to a human readable YAML file. That file becomes the “semantic database” used in stage 2.

Demo 2: Voice based object goal navigation

This demo is the second stage. The robot loads the saved 2D map and the saved object database, then you request a target by voice. The voice interface is a front end only, the navigation core stays standard and inspectable.

Goal selection

Speech is transcribed with Whisper, then the text is matched to available targets using CLIP style embeddings. The output is a specific object instance from the recorded database.

Goal generation

The selected object pose is converted into a safe navigation goal in the map frame. The goal is not “inside” the object. It is a standoff pose with a clear facing direction so the behavior is predictable.

Nav2 execution and safety

Nav2 receives a standard NavigateToPose goal. Global and local costmaps come from the fixed map plus LiDAR obstacles and inflation. If the robot stalls, standard recoveries apply. Spot’s own safety layer remains active underneath.

What you can debug

You can inspect the exact object list, the chosen instance, the generated goal pose, the global plan, and the local trajectory in RViz. That makes failures explainable instead of mysterious.

What I built

Confirmed only semantic layer

A ROS 2 node that turns RGB D detections into a compact set of stable object instances in the map frame, with filtering, confirmation, and de duplication.

Object goal interface for Nav2

A thin interface that reads the recorded objects and converts a chosen instance into a PoseStamped goal with standoff and facing constraints, then sends it to Nav2.

End to end Spot integration

Deployment on Spot with onboard compute, launch and configuration files, RViz views, rosbag logging, and a small supervision layer for repeated runs.

Real world evaluation

Tests in multiple indoor environments, including failure case analysis. The focus is integration quality and predictable behavior under real constraints.

Experiments and what I learned

The pipeline was evaluated on real runs in three environments: a church, the DLS lab, and a large IIT test room. Mapping runs validate that the same setup can produce usable 2D grids across very different layouts. Semantic mapping focuses on a small closed set of classes to keep the database compact and reliable. Navigation trials in the IIT room cover multiple object layouts and multi object scenarios.

Two practical limitations show up clearly in real runs: depth alignment errors can bias object placement a bit, and localization inconsistencies can cause early goal acceptance in edge cases. Both are visible in logs and RViz, so they are debuggable and fixable rather than hidden failures.