Dev
Architecture

VLA++ in Practice: Fusing Vision, Language, Acoustics & Force into One Action Head

Inside the four-input policy we are building. Why audio cues matter for distinguishing sand from glass shards, and what the architecture looks like before the data is in.

#02 · Dev18 minFor: ML / robotics researchers
01Motivation

Where Pure Vision Policies Run Out of Information

Vision-Language-Action (VLA) models — Google's RT-2, Physical Intelligence's π0, OpenVLA — have made imitation policies dramatically more general. They map an image and a natural-language instruction directly to a robot action. They are also, in our experience, exactly as good as the pixels they are looking at.

Cabin cleaning breaks that assumption in two specific ways. First, a sticky spill and a wet spill look identical to a camera; the difference is in the force required to release the wiper. Second, debris on a carpeted floor mat — sand, salt, broken glass, cereal — is visually ambiguous under cabin lighting; the difference is in the sound the vacuum makes when it ingests it.

We are not the first to notice this. The argument we are making in our internal architecture is that bolting force and audio onto a VLA backbone — what we call VLA++ — is the smallest change that recovers the missing information without throwing out the language conditioning that makes these models flexible.

02Audio

Acoustic Tokens From a Directional Array

The mic array is four MEMS microphones on the end-effector housing, sampled at 48 kHz. We compute log-mel spectrograms with a 25 ms window and 10 ms hop, then pass them through a small CNN encoder pretrained on AudioSet. The output is a sequence of 64-dim tokens at roughly 100 Hz.

What we care about is not the audio itself but its change when the tool contacts material. Glass shards through a vacuum nozzle have a distinctive high-frequency click that sand does not. We encode a short rolling window so the policy sees onset events, not steady-state noise.

03Force-Torque

Encoding 6-Axis F/T as Tokens

The wrist sensor returns 6 channels (Fx Fy Fz, Tx Ty Tz) at 1 kHz. Feeding raw samples into a transformer is wasteful — most of the signal is in the temporal envelope, not individual samples. We downsample to 100 Hz and encode each channel with a small 1D conv that produces a 32-dim token per timestep.

We also include two derived channels that turn out to matter: the norm of the force vector and the angle between the force vector and the commanded motion. A wipe that is being resisted produces a clear opposing component; a wipe that is sliding cleanly does not.

04Fusion

Late Fusion vs. Cross-Attention

There are two honest options for combining four streams: concatenate the per-modality embeddings before the action head (late fusion), or interleave them as separate token streams that cross-attend inside the transformer.

Late fusion is simpler and trains faster. Cross-attention is more expressive and historically wins on multimodal benchmarks. Our plan is to ablate both on the same dataset and report the gap. Until that data is in, anyone claiming one decisively beats the other in this setting is guessing.

05Evaluation

What We Actually Want to Measure

Aggregate success rate is a misleading number for a cleaning robot. A policy that succeeds 95% of the time but smears jam across the headliner on the other 5% is unshippable. Our eval suite is structured around three bounded tasks where we can score outcomes per attempt: spill recovery (visual cleanliness check before/after), material-conditioned force adaptation (peak force stays below a per-material limit), and glass detection (the policy must call out and refuse to vacuum visible shards).

These are not yet benchmarks we publish leaderboard scores on. They are an internal harness, and we would rather show the harness than fabricate a number for the dek.

Topics
VLAmultimodalpolicy learningimitation learning
Continue

Read the architecture notes

Visit handybot.ai →
Related from the other side
How to Model ROI for an Interior-Cleaning Robot

A first-principles, citation-backed model for car-wash franchises, rental hubs, and AV depots considering autonomous interior cleaning. Full cost stack, supervision math, and the sensitivities that actually move payback.

More Dev posts