Sunday, February 8, 2026

Letting Gemini Drive My Rover

 Letting Gemini Drive My Rover

LLMs have had visual capabilities for a while now, with the general ability to understand what they’re looking at. However, while this semantic understanding has always been impressive, they have demonstrated very poor spatial understanding, with earlier models struggling with very basic tasks like drawing bounding boxes or giving accurate pixel coordinates of what they see.


This is starting to change and both Qwen3 and Gemini 3’s releases last fall highlighted their new “Spatial Reasoning” capabilities - not just understanding what’s in a scene, but how these objects relate to each other, how they are placed, etc. Gemini 3 in particular shows impressive ability to draw trajectories describing movement from one place in a scene to another. How well does this really work in practice? Can Gemini drive a rover?

This is interesting because LLMs can be a great complement to existing ROS2 robotics tools. These tools have a strong understanding of geometry - they can see the world in 3D (thanks to LIDAR and depth cameras), they’re aware of how big the robot’s body is, they can see obstacles and plan paths around them etc. But they’re also kinda blind - scenes are geometric and existing models only offer basic and brittle understanding of the world. LLMs on the other hand, lack any sense of scale, size or geometry, but have great semantic understanding and strong reasoning capabilities. It’s a match made in heaven!


I have a small tracked Waveshare robot powered by a Jetson Orin Nano 8GB and using an OAK-D Pro depth camera. The camera captures RGB-D images, where each pixel has a colour and depth, allowing the robot to see the world in 3D. To test Gemini’s capabilities, I wrote the following program:

  1. The user enters an arbitrary target that’s in the robots field of view. This can be as simple as “Blue Ball” or as complicated as “orange lego bus on the left side of the red chair”.

  2. A request is sent to Gemini to generate a trajectory from the robot’s current position to the user’s target. The prompt includes instructions on what a good trajectory looks like, what to avoid etc.

  3. Gemini returns the trajectory as a set of (x,y) coordinates in JSON format. Because the robot has a depth camera, it is able to project these (x,y) coordinates into (x,y,z) coordinates using the depth info and the camera’s intrinsic parameters.

  4. These (x,y,z) coordinates are sent to Nav2 (ROS2’s navigation tool) as a set of waypoints, including calculating the correct robot orientation for each waypoint. Nav2 can reject this trajectory if it is not traversable based on it’s own understanding of the 3D scene, obstacles, etc. Rejected trajectories are potentially valuable fine-tuning data.

  5. Nav2 executes the trajectory but using its local planner keeps an eye out for obstacles and stops the rover if a new obstacle is detected, so if a person or pet moves in front of the rover while it’s executing it’ll stop safely.



So how well does this work in practice? Not great, but not terrible. The whole setup works as intended, the trajectories Gemini generates place the robot near the target, but they’re not sensibly spaced out, and it doesn’t do well with far-away objects. I suspect the robot's type or low perspective 1 foot off the ground might be uncommon. In addition, lag is a major problem - it takes about 30-60s for an API call so having this run continuously would be both slow and expensive. 




There’s two interesting follow-ups from here:

  1. Put this in a loop and have Gemini try to achieve an objective, like map out a room by itself.

  2. Fine-tune Gemini 3 Pro with a collection of good trajectories we’ve seen, improving its performance. We can use Nav2’s geometric understanding to fix and clean up trajectories as well - for example enforcing 50cm spacing between waypoints. This effectively creates quality semi-synthetic training data.

  3. Distill Gemini 3 Pro’s capabilities into Gemma3 4B or Qwen3 4B. These vision models are small enough to run locally on the robot’s Jetson at several hertz, getting rid of the lag and the need for an internet connection.This may only work well for scoped tasks like indoor navigation.



As we’ve seen with Claude Code and other coding agents over the last year, giving LLM agents robust tools (3D scene understanding, obstacle detection) and sensible harnesses can give them impressive new capabilities. One reason you might want to use LLMs directly instead of Vision Language Actions models is that when paired with a map, tools like ROS2 can implement existing navigation algorithms like covering an area and can effectively keep track of state. VLAs are largely reactive, keeping only a few seconds of history, so keeping track of state has to be done in an external memory system regardless.

Thursday, February 5, 2026

Keep Your Voice

If you care about your voice, don't let LLMs write your words. But that doesn't mean you can't use AI to think, critique and draft lots of words for you. It depends on what purpose you're writing it for. If you're writing an impersonal document, like a design document, briefing, etc then who cares. In many cases (scientific papers, legal documents) you already have to write them in a voice that is not your own. Go ahead and write these in AI. But if you're trying to say something more personal then the words should be your own, AI will always try to 'smooth' out your voice, and if you care about it, you gotta write it yourself.

Now, how do you use AI effectively and still retain your voice? Here's one technique that works well: start with a voice memo, just record yourself maybe during a walk, and talk about a subject you want, free form, skip around jump sentences, just get it all out of your brain. Then open up a chat, add the recording or transcript, clearly state your intent in one sentence and ask the AI to consider your thoughts, your intent and ask clarifying questions. Like, what does the AI not understand about how your thoughts support the clearly stated intent of what you want to say. That'll produce a first draft, which will be bad. Then tell the AI all the things that don't make sense to you, that you don't like, just comment on the whole doc, get a second draft. Ask the AI if it has more questions for you, you can use live chat to make this conversation go smoother as well, when the AI is asking you questions, you can talk freely by voice. Repeat this one or two more times, and a much finer draft will take shape that is closer to what you want to say. During this drafting state, the AI will always try to smooth or average out your ideas, so it is important to keep pointing out all the ways in which it is wrong.

This process will help you with all the thinking involved being more up-front. Once you're read and critiqued several drafts, all your ideas will be much more clear. Then, sit down and write your own words from scratch, they will come much easier after all your thoughts have been exercised during the drafting process.