Where Pure Vision Policies Run Out of Information
Vision-Language-Action (VLA) models — Google's RT-2, Physical Intelligence's π0, OpenVLA — have made imitation policies dramatically more general. They map an image and a natural-language instruction directly to a robot action. They are also, in our experience, exactly as good as the pixels they are looking at.
Cabin cleaning breaks that assumption in two specific ways. First, a sticky spill and a wet spill look identical to a camera; the difference is in the force required to release the wiper. Second, debris on a carpeted floor mat — sand, salt, broken glass, cereal — is visually ambiguous under cabin lighting; the difference is in the sound the vacuum makes when it ingests it.
We are not the first to notice this. The argument we are making in our internal architecture is that bolting force and audio onto a VLA backbone — what we call VLA++ — is the smallest change that recovers the missing information without throwing out the language conditioning that makes these models flexible.
Acoustic Tokens From a Directional Array
The mic array is four MEMS microphones on the end-effector housing, sampled at 48 kHz. We compute log-mel spectrograms with a 25 ms window and 10 ms hop, then pass them through a small CNN encoder pretrained on AudioSet. The output is a sequence of 64-dim tokens at roughly 100 Hz.
What we care about is not the audio itself but its change when the tool contacts material. Glass shards through a vacuum nozzle have a distinctive high-frequency click that sand does not. We encode a short rolling window so the policy sees onset events, not steady-state noise.
Encoding 6-Axis F/T as Tokens
The wrist sensor returns 6 channels (Fx Fy Fz, Tx Ty Tz) at 1 kHz. Feeding raw samples into a transformer is wasteful — most of the signal is in the temporal envelope, not individual samples. We downsample to 100 Hz and encode each channel with a small 1D conv that produces a 32-dim token per timestep.
We also include two derived channels that turn out to matter: the norm of the force vector and the angle between the force vector and the commanded motion. A wipe that is being resisted produces a clear opposing component; a wipe that is sliding cleanly does not.
Late Fusion vs. Cross-Attention
There are two honest options for combining four streams: concatenate the per-modality embeddings before the action head (late fusion), or interleave them as separate token streams that cross-attend inside the transformer.
Late fusion is simpler and trains faster. Cross-attention is more expressive and historically wins on multimodal benchmarks. Our plan is to ablate both on the same dataset and report the gap. Until that data is in, anyone claiming one decisively beats the other in this setting is guessing.
What We Actually Want to Measure
Aggregate success rate is a misleading number for a cleaning robot. A policy that succeeds 95% of the time but smears jam across the headliner on the other 5% is unshippable. Our eval suite is structured around three bounded tasks where we can score outcomes per attempt: spill recovery (visual cleanliness check before/after), material-conditioned force adaptation (peak force stays below a per-material limit), and glass detection (the policy must call out and refuse to vacuum visible shards).
These are not yet benchmarks we publish leaderboard scores on. They are an internal harness, and we would rather show the harness than fabricate a number for the dek.
Read the architecture notes
Visit handybot.ai →A first-principles, citation-backed model for car-wash franchises, rental hubs, and AV depots considering autonomous interior cleaning. Full cost stack, supervision math, and the sensitivities that actually move payback.