Sunday, February 8, 2026

Letting Gemini Drive My Rover

 Letting Gemini Drive My Rover

LLMs have had visual capabilities for a while now, with the general ability to understand what they’re looking at. However, while this semantic understanding has always been impressive, they have demonstrated very poor spatial understanding, with earlier models struggling with very basic tasks like drawing bounding boxes or giving accurate pixel coordinates of what they see.


This is starting to change and both Qwen3 and Gemini 3’s releases last fall highlighted their new “Spatial Reasoning” capabilities - not just understanding what’s in a scene, but how these objects relate to each other, how they are placed, etc. Gemini 3 in particular shows impressive ability to draw trajectories describing movement from one place in a scene to another. How well does this really work in practice? Can Gemini drive a rover?

This is interesting because LLMs can be a great complement to existing ROS2 robotics tools. These tools have a strong understanding of geometry - they can see the world in 3D (thanks to LIDAR and depth cameras), they’re aware of how big the robot’s body is, they can see obstacles and plan paths around them etc. But they’re also kinda blind - scenes are geometric and existing models only offer basic and brittle understanding of the world. LLMs on the other hand, lack any sense of scale, size or geometry, but have great semantic understanding and strong reasoning capabilities. It’s a match made in heaven!


I have a small tracked Waveshare robot powered by a Jetson Orin Nano 8GB and using an OAK-D Pro depth camera. The camera captures RGB-D images, where each pixel has a colour and depth, allowing the robot to see the world in 3D. To test Gemini’s capabilities, I wrote the following program:

  1. The user enters an arbitrary target that’s in the robots field of view. This can be as simple as “Blue Ball” or as complicated as “orange lego bus on the left side of the red chair”.

  2. A request is sent to Gemini to generate a trajectory from the robot’s current position to the user’s target. The prompt includes instructions on what a good trajectory looks like, what to avoid etc.

  3. Gemini returns the trajectory as a set of (x,y) coordinates in JSON format. Because the robot has a depth camera, it is able to project these (x,y) coordinates into (x,y,z) coordinates using the depth info and the camera’s intrinsic parameters.

  4. These (x,y,z) coordinates are sent to Nav2 (ROS2’s navigation tool) as a set of waypoints, including calculating the correct robot orientation for each waypoint. Nav2 can reject this trajectory if it is not traversable based on it’s own understanding of the 3D scene, obstacles, etc. Rejected trajectories are potentially valuable fine-tuning data.

  5. Nav2 executes the trajectory but using its local planner keeps an eye out for obstacles and stops the rover if a new obstacle is detected, so if a person or pet moves in front of the rover while it’s executing it’ll stop safely.



So how well does this work in practice? Not great, but not terrible. The whole setup works as intended, the trajectories Gemini generates place the robot near the target, but they’re not sensibly spaced out, and it doesn’t do well with far-away objects. I suspect the robot's type or low perspective 1 foot off the ground might be uncommon. In addition, lag is a major problem - it takes about 30-60s for an API call so having this run continuously would be both slow and expensive. 




There’s two interesting follow-ups from here:

  1. Put this in a loop and have Gemini try to achieve an objective, like map out a room by itself.

  2. Fine-tune Gemini 3 Pro with a collection of good trajectories we’ve seen, improving its performance. We can use Nav2’s geometric understanding to fix and clean up trajectories as well - for example enforcing 50cm spacing between waypoints. This effectively creates quality semi-synthetic training data.

  3. Distill Gemini 3 Pro’s capabilities into Gemma3 4B or Qwen3 4B. These vision models are small enough to run locally on the robot’s Jetson at several hertz, getting rid of the lag and the need for an internet connection.This may only work well for scoped tasks like indoor navigation.



As we’ve seen with Claude Code and other coding agents over the last year, giving LLM agents robust tools (3D scene understanding, obstacle detection) and sensible harnesses can give them impressive new capabilities. One reason you might want to use LLMs directly instead of Vision Language Actions models is that when paired with a map, tools like ROS2 can implement existing navigation algorithms like covering an area and can effectively keep track of state. VLAs are largely reactive, keeping only a few seconds of history, so keeping track of state has to be done in an external memory system regardless.

Thursday, February 5, 2026

Keep Your Voice

If you care about your voice, don't let LLMs write your words. But that doesn't mean you can't use AI to think, critique and draft lots of words for you. It depends on what purpose you're writing it for. If you're writing an impersonal document, like a design document, briefing, etc then who cares. In many cases (scientific papers, legal documents) you already have to write them in a voice that is not your own. Go ahead and write these in AI. But if you're trying to say something more personal then the words should be your own, AI will always try to 'smooth' out your voice, and if you care about it, you gotta write it yourself.

Now, how do you use AI effectively and still retain your voice? Here's one technique that works well: start with a voice memo, just record yourself maybe during a walk, and talk about a subject you want, free form, skip around jump sentences, just get it all out of your brain. Then open up a chat, add the recording or transcript, clearly state your intent in one sentence and ask the AI to consider your thoughts, your intent and ask clarifying questions. Like, what does the AI not understand about how your thoughts support the clearly stated intent of what you want to say. That'll produce a first draft, which will be bad. Then tell the AI all the things that don't make sense to you, that you don't like, just comment on the whole doc, get a second draft. Ask the AI if it has more questions for you, you can use live chat to make this conversation go smoother as well, when the AI is asking you questions, you can talk freely by voice. Repeat this one or two more times, and a much finer draft will take shape that is closer to what you want to say. During this drafting state, the AI will always try to smooth or average out your ideas, so it is important to keep pointing out all the ways in which it is wrong.

This process will help you with all the thinking involved being more up-front. Once you're read and critiqued several drafts, all your ideas will be much more clear. Then, sit down and write your own words from scratch, they will come much easier after all your thoughts have been exercised during the drafting process.

Sunday, March 26, 2023

A Stadium Full Of Ancestors

You are sitting in the centre field-level front-row seat of a large football stadium. On your right is your mom, and beside her, your grandfather. You know them well, you say hi. But beside him is his mom, and her father, and his father, and his mother, and her mother, and her mother, and her father, and his father, and on, and on, and on - an uninterrupted line of your ancestors snaking their way all around first row, then the second row, then the third row, and on and on until they fill the entire stadium. 

You don't know these people, but at first they look very familiar, except they wear funny clothes. They're obviously conscious and intelligent and deserving of legal personhood just like you. But you look a few rows up and they start to look off, kinda weird. Whether Neanderthal or Denisovan, they're not quite like you any more. The further up you look, the less humanlike they look, until you look at the top row and see your great^100,000-grandmother Lucy. She's bipedal, but not human anymore. If you saw her in a zoo rather than at the top of the stadium, you would never think that she is conscious, or deserves the same legal rights as you.

So where in this stadium did intelligence and consciousness arise? Is there a single ancestor you could plausibly point to and say, "This person deserves legal personhood, she's conscious, but her mom, no she is not a person". It's impossible, the boundaries of intelligence are too fuzzy. The best you can do is point to some rather large group and say that somewhere in there, the rate of change added up enough to make some kind of difference.

As we wonder whether an AI is "alive", when they will become conscious or intelligent enough to deserve legal protection, it's useful to remember that the answer will probably be at least as hard. A definitive answer may well be impossible, it will be fuzzy and we'll reach broad consensus slowly.

Monday, September 28, 2015

How Cheap Can Autonomous Cars Get?

Autonomous cars are coming and everybody thinks they'll be a pretty big deal, but it's impossible to predict what their impact would be exactly. Today they are billed simultaneously as the saviour of our congested, car-dependent cities, and a job-killing, life-destroying tool of the global technocapitalist class.

To help us think about their impact, let's consider how cheap self driving cars could be. We'll use the simple metric of money per kilometre and make the following assumptions:

  • The year is 2039 and full Autonomous Cars (ACs) have been shipping for over a decade.
  • All new ACs are electric, with an efficiency of MPGe (5km/kWh).
  • The autonomous drive systems have been around for a while and are commoditized, like ABS or Traction Control systems are today. They add $ to a vehicle.
  • Electricity costs $/kWh
  • Our car costs $, comes with a km range, and the 100kWh battery pack has charge/discharge cycles.
  • Insurance costs $/year and maintenance is $/year. The life of the car is
  • The residual value after end of life is $


Given the above, our battery usable life is:

1000cycles * 500km = 500000km

The running costs of the car are:

5 years * ( + ) + 500000km / 5km/kWh * $5/kWh = $xxxx

So the total cost of this car is:

$xxxx + $ + $ - $ = $

Which gives us a total of:

$ * 100 / km = cents per kilometre.


Feel free to play around with the assumptions. After playing around a bit, we can see that the biggest impact, apart from the price of the car itself, comes from the quality of the battery - a larger battery with a longer life (recharge cycles) is more important than cheap electricity or a marginal improvement in efficiency (milage). This makes sense, as a car with 50 more recharge cycles will give you significantly more milage for the same buck.

Using the default assumptions, we get a price of 14 cents per km if you ride in this non-luxury, mid-size car 100,000km/year. Assuming this car is part of a taxi service and we add some profit, we can expect to pay $2 for 10km, which is quite significant as it is even cheaper than most public transit systems today.

Thursday, February 7, 2013

What Comes After Services? Interpretation

I've been reading the book Regenesis by George M. Church, which provides a great overview of where the biotech industry is heading by an author with ample academic and entrepreneurial experience.

In one section, Church describes the evolution of genetic research over the last several decades - how scientists went from doing everything manually to using machines and how much more productive they became. He ends with:
So in summary, the descent of man (the devolution of research persons) went like this: (1) DIY. (2) Buy parts. (3) Buy kits. (4) Buy machines. (5) Buy services. (6) Buy interpretation. 
This immediately struck me, not only because of its resemblance to the evolution of the computer and many other technologies, but also to a very important aspect of startups: how to make money by offering customers value.

Generally speaking, your profits will be proportional to the value you offer your customers and you'll pull ahead of your competitors by offering more value. At the same time the commoditizing nature of technology means that what's valuable and profitable today will become ordinary and cheap tomorrow. The challenge for startups then is to move up this value chain, disrupt competitors stuck on the lower rungs of the ladder and reap the profits.

Here at Kytephone, we make an app to turn an ordinary Android into a kids phone with parental controls. This lets us give parents peace of mind by offering them a service that lets them locate their child, see who they've been talking to, which apps they've been using etc. While parents certainly appreciate our service, what they really want is someone to tell them specific, important information - did my child get to school safely? Are they being harassed by someone? Are they spending too much time on Facebook? In other words, parents want an interpretation of their child's data to assuage their fears and worries.

Our challenge then is to give parents timely, important information about their children without them having to do anything. And bonus points for not asking for things back, like "Where does your child go to school?" While it is very hard for computers answer such questions, we are surely heading that way. We can see this not only in the Machine Learning boom, but also high-profile effort like Google Now or IBM Watson.

Tuesday, November 27, 2012

The Efficiency Index

I've always loved the idea of indexes - a collection of securities that make it on and off a list based on a well-defined set of rules. As I tried to image what a hypothetical "Drashkov Index" would look like, I quickly realized that all the companies I would put on there had one thing in common - if successful they would make the world a far more efficient place.

I believe that for the foreseeable future (the next two decades at least), the world will not see any form of cheap energy. Consequently, we won't experience anything resembling the cheap-oil fueled growth of the post WWII era. Today Americans make up about 5% of the world's population, yet consume about 25% of its resources. With billions of people striving for a Western standard of living, demand for energy will be insatiable. At the same time, the increasingly obvious effects of climate change will make more people receptive to treating and pricing carbon as the pollutant it is.

If we accept a world of high energy prices, the only way to grow and develop is to make our world a far more efficient place and in a way that is much different from merely optimizing our existing products and processes. To illustrate the difference, consider a few examples:

  • Marc Andreessen famously said that "Software is Eating the World". By its very nature, doing tasks in software is far more efficient than doing them in hardware. Writing an article on a laptop and publishing it online is far more efficient than using a typewriter, the post office and getting it printed on a pile of dead trees.
  • The mechanical parts and motors in electric cars are efficient in a way ICE engines and drive trains can never be. Tesla's Model S - a large, heavy luxury sedan - is considerably more fuel efficient than any econobox on the market today.
  • 3D printing / Additive Manufacturing is inherently a much more efficient way of building objects than the wasteful processes of today's manufacturing, which mostly involves starting with large blocks of matter and getting rid of lots of material.
  • Lab grown and artificial meat requires far less biomass and energy to make a pound of meat protein. While many people choose a vegetarian or vegan lifestyle, the majority of people in the world would like to enjoy the same meat and protein-heavy diet as westerners enjoy today. The only plausible way this will happen is through something as radically new and efficient as lab grown and artificial meat.

I hope that one day we'll see a cheap green-energy fueled economic boom, but for the forceable future, I think we'll be living in a world of expensive energy. The Efficiency Index - a collection of companies whose raison d'ĂȘtre is to make the world more efficient seems like a great investment. So, what companies would you put on the Efficiency Index?

Saturday, February 18, 2012

How To Make the 23" Android MegaPad

A few months ago I published a video showing me using my home-made 23" Android tablet, which got a bit of attention and made people wonder how it was made. A lot people made plausible guesses - that it was a staged video, that it ran android-x86, etc, but few made the right guess so I wanted to publish a how-to so anyone that's interested can make their own.

The core of the MegaPad is the TI PandaBoard - a $200 ARM development board which contains what are essentially the guts of any modern smartphone - a dual-core 1GHz CPU, 1GB RAM, GPU, WiFi and a host of connectors. The great folks at TI provide both Ubuntu and Android releases for the PandaBoard and it's fairly straightfoward to get one of the releases up and running. For the touch input and video output, I went with the Acer T230H, which was made for Windows 7 and uses optical-touch to provide 2 touch points with acceptable performance. This monitor has been discontinued by Acer, so I found it in Kijiji, but as we'll see, you can substitute with another suitable touch monitor.

The bit that makes it all work is this: pandaboard releases are made to work with a keyboard and mouse and will not recognize the Acer touchscreen if you merely plug it in. Luckily, Linux developers have written a touchscreen driver that works with the Acer monitor, so all we have to do is recompile the kernel with the right drivers and voila! We got ourselves a megapad. There a few more details. The touch driver that comes with the Gingerbread pandaboard release supports only one touch point, so in order to get dual-touch working as in the video, we'll have to patch the driver.

There are a number of improvements one could make to the MegaPad. The lagginess you see in the video is due to the fact that Gingerbread was never made to support 1080 at reasonable frame rates, however the folks at Linaro have been busy making an ICS release for the pandaboard, which you can check out here. I suspect ICS will run far smoother than Gingerbread. Moreover, one can use any size touchscreen, as long as linux drivers are available. I went with the T230H simply because it didn't require the big investment larger screens do. Now, without further ado:


Instructions

  1. First, get a PandaBoard from DigiKey, Mouser or any other retailer. I also recommend a Serial-To-USB connector, since you'll need to use the PandaBoard's serial connection as well as a very fast SD Card.
  2. Get an Acer T230H or any other touch monitor which has working linux drivers.
  3. install minicom or similar serial terminal on your laptop. The pandaboard does not how to boot itself, so you'll need to connect to the serial port and paste some bootargs to tell the bootloader where to find the kernel.
  4. Get everything working with a keyboard and mouse: You can follow the instructions here on how to download and put the binaries onto the sdcard. Essentially, the SDCard will be formatted with 3 partitions: bootfs where you put the boot loader and linux kernel, rootfs where you put Android and data where you can store media files. To boot up the board, put the card in, connect the power and start the serial console. When you see the fastboot countdown on the console, interrupt it by pressing enter so you can get a prompt, then paste in the bootargs found here.
  5. Once you have Android up and running on the pandaboard using a keyboard and mouse, we can recompile the kernel to get our touchscreen working.
  6. Set up your computer and get the tools you'll need to download and compile the kernel by following these instructions. Next, follow the instructions here to download and compile the kernel without any modifications.
  7. Once you have everything set up (and patched, as per the instructions), try compiling the kernel and loading Android on the Pandaboard with it. The compilation, which should take 10 mins or less, will produce a file called "uImage", which is the entire linux kernel. Take that file and copy it into the bootfs partition on your sdcard and start the pandaboard as before.
  8. If everything is working fine and you can compile and load your own kernel, we're ready for the modifications to get the touchscreen working
  9. In your kernel dir, go to arch/arm/configs/panda_defconfig and search for "#CONFIG_HID_QUANTA is not set", which should be commented out. Change the line to "CONFIG_HID_QUANTA=y".
  10. Recompile the kernel, load it on the SD card and connect the USB touch port from the touchscreen into the pandaboard and disconnect the keyboard and mouse. Once the boot is complete you should have a fully operable, albeit single-touch, megapad!
  11. To get dual-touch working, patch the quanta driver using this patch. You can find the driver under drivers/hid/hid-quanta.c. Once again, recompile and reload. You should now have dual-touch fully working.
Tips
  • The pandaboard may appear slow, which may be due to Android's inability to write to the /data directory. Put the SDCard in your computer, go to the rootfs (Android) partition and do "chmod -R 777 /data". I found that this made performance on my pandaboard acceptable.
  • The pandaboard needs a weird 5V @ 20A power supply and most 5V supplies won't work. I suggest merely plugging in the miniUSB port into your computer and power it the way you would recharge any phone or tablet.
  • Getting adb working is much the same as any other device. Add the following (SUBSYSTEM=="usb", ATTR{idVendor}=="0451", ATTR{idProduct}=="d102", MODE="0666") to your udev rules, restart udev and adb and you're good to go

If you have any questions and comments, you can check me out on Google+ here:

Happy Hacking