Letting Gemini Drive My Rover
LLMs have had visual capabilities for a while now, with the general ability to understand what they’re looking at. However, while this semantic understanding has always been impressive, they have demonstrated very poor spatial understanding, with earlier models struggling with very basic tasks like drawing bounding boxes or giving accurate pixel coordinates of what they see.
This is starting to change and both Qwen3 and Gemini 3’s releases last fall highlighted their new “Spatial Reasoning” capabilities - not just understanding what’s in a scene, but how these objects relate to each other, how they are placed, etc. Gemini 3 in particular shows impressive ability to draw trajectories describing movement from one place in a scene to another. How well does this really work in practice? Can Gemini drive a rover?
This is interesting because LLMs can be a great complement to existing ROS2 robotics tools. These tools have a strong understanding of geometry - they can see the world in 3D (thanks to LIDAR and depth cameras), they’re aware of how big the robot’s body is, they can see obstacles and plan paths around them etc. But they’re also kinda blind - scenes are geometric and existing models only offer basic and brittle understanding of the world. LLMs on the other hand, lack any sense of scale, size or geometry, but have great semantic understanding and strong reasoning capabilities. It’s a match made in heaven!
I have a small tracked Waveshare robot powered by a Jetson Orin Nano 8GB and using an OAK-D Pro depth camera. The camera captures RGB-D images, where each pixel has a colour and depth, allowing the robot to see the world in 3D. To test Gemini’s capabilities, I wrote the following program:
The user enters an arbitrary target that’s in the robots field of view. This can be as simple as “Blue Ball” or as complicated as “orange lego bus on the left side of the red chair”.
A request is sent to Gemini to generate a trajectory from the robot’s current position to the user’s target. The prompt includes instructions on what a good trajectory looks like, what to avoid etc.
Gemini returns the trajectory as a set of (x,y) coordinates in JSON format. Because the robot has a depth camera, it is able to project these (x,y) coordinates into (x,y,z) coordinates using the depth info and the camera’s intrinsic parameters.
These (x,y,z) coordinates are sent to Nav2 (ROS2’s navigation tool) as a set of waypoints, including calculating the correct robot orientation for each waypoint. Nav2 can reject this trajectory if it is not traversable based on it’s own understanding of the 3D scene, obstacles, etc. Rejected trajectories are potentially valuable fine-tuning data.
Nav2 executes the trajectory but using its local planner keeps an eye out for obstacles and stops the rover if a new obstacle is detected, so if a person or pet moves in front of the rover while it’s executing it’ll stop safely.
So how well does this work in practice? Not great, but not terrible. The whole setup works as intended, the trajectories Gemini generates place the robot near the target, but they’re not sensibly spaced out, and it doesn’t do well with far-away objects. I suspect the robot's type or low perspective 1 foot off the ground might be uncommon. In addition, lag is a major problem - it takes about 30-60s for an API call so having this run continuously would be both slow and expensive.
Put this in a loop and have Gemini try to achieve an objective, like map out a room by itself.
Fine-tune Gemini 3 Pro with a collection of good trajectories we’ve seen, improving its performance. We can use Nav2’s geometric understanding to fix and clean up trajectories as well - for example enforcing 50cm spacing between waypoints. This effectively creates quality semi-synthetic training data.
Distill Gemini 3 Pro’s capabilities into Gemma3 4B or Qwen3 4B. These vision models are small enough to run locally on the robot’s Jetson at several hertz, getting rid of the lag and the need for an internet connection.This may only work well for scoped tasks like indoor navigation.
As we’ve seen with Claude Code and other coding agents over the last year, giving LLM agents robust tools (3D scene understanding, obstacle detection) and sensible harnesses can give them impressive new capabilities. One reason you might want to use LLMs directly instead of Vision Language Actions models is that when paired with a map, tools like ROS2 can implement existing navigation algorithms like covering an area and can effectively keep track of state. VLAs are largely reactive, keeping only a few seconds of history, so keeping track of state has to be done in an external memory system regardless.


No comments:
Post a Comment