Two Tiers, Not One Big Model
There is a recurring temptation to ask a single end-to-end model to take 'clean the sticky spill' and produce motor commands. We've concluded that is the wrong abstraction for service work. Operators give instructions at one level; controllers consume commands at another; the gap between them is exactly the kind of thing a small, inspectable plan should bridge.
Our architecture is a high-level planner (an LLM with tools) that decomposes the instruction into a sequence of cabin-cleaning stages, and a low-level VLA++ policy that executes each stage closed-loop on perception and force.
Where the LLM Stops, the Tree Starts
Inside each stage, we use behavior trees as a fallback when the policy reports low confidence. The tree's nodes are deliberately boring: 'retreat 5 cm', 'request operator confirmation', 'switch tool to wiper', 'rerun perception'. This gives us deterministic recovery behavior without asking the LLM to be a real-time controller.
Tying Language to the Scene Graph
When an operator says 'the spill on the passenger seat', the planner needs to know which seat that is in the live scene graph. We maintain a labeled scene graph from perception (driver seat, passenger seat, rear bench, floor mats, console) and ground language references through a small classifier rather than relying on the LLM to reason about cabin layout from pixels.
Did We Actually Clean It?
Every stage ends with a verification step: a before/after visual comparison of the target region. The simplest version is a learned 'cleanliness' classifier on the cropped patch. If the verification fails, the plan re-enters the relevant stage with an escalated tool selection (e.g., from microfiber wiper to spray-and-wipe).
The honest limit: visual verification cannot detect everything (residual stickiness, smell). We complement it with operator spot-check workflows in the early deployments and treat that gap as part of the spec, not a failure.
The Six Cabin Stages
Inspect (build scene graph). Pick (remove large debris by hand-equivalent grasps). Sort (separate trash from belongings; flag belongings for operator). Vacuum (loose debris on mats and seats). Wipe (hard surfaces, glass, high-touch points). Verify (per-region cleanliness check, regenerate plan if needed).
This sequence is a strong default; it is not a religion. The planner can skip stages that the inspection pass shows unnecessary.
Read the planner notes
Visit handybot.ai →Force limits, watchdogs, e-stops, and the certification path for commercial service environments.