VLA++ in Practice: Fusing Vision, Language, Acoustics & Force into One Action Head

01— Motivation

Where Pure Vision Policies Run Out of Information

Vision-Language-Action (VLA) models — Google's RT-2, Physical Intelligence's π0, OpenVLA — have made imitation policies dramatically more general. They map an image and a natural-language instruction directly to a robot action. They are also, in our experience, exactly as good as the pixels they are looking at.

Cabin cleaning breaks that assumption in two specific ways. First, a sticky spill and a wet spill look identical to a camera; the difference is in the force required to release the wiper. Second, debris on a carpeted floor mat — sand, salt, broken glass, cereal — is visually ambiguous under cabin lighting; the difference is in the sound the vacuum makes when it ingests it.

We are not the first to notice this. The argument we are making in our internal architecture is that bolting force and audio onto a VLA backbone — what we call VLA++ — is the smallest change that recovers the missing information without throwing out the language conditioning that makes these models flexible.

02— Audio

Acoustic Tokens From a Directional Array

The mic array is four MEMS microphones on the end-effector housing, sampled at 48 kHz. We compute log-mel spectrograms with a 25 ms window and 10 ms hop, then pass them through a small CNN encoder pretrained on AudioSet. The output is a sequence of 64-dim tokens at roughly 100 Hz.

What we care about is not the audio itself but its change when the tool contacts material. Glass shards through a vacuum nozzle have a distinctive high-frequency click that sand does not. We encode a short rolling window so the policy sees onset events, not steady-state noise.

03— Force-Torque

Encoding 6-Axis F/T as Tokens

The wrist sensor returns 6 channels (Fx Fy Fz, Tx Ty Tz) at 1 kHz. Feeding raw samples into a transformer is wasteful — most of the signal is in the temporal envelope, not individual samples. We downsample to 100 Hz and encode each channel with a small 1D conv that produces a 32-dim token per timestep.

We also include two derived channels that turn out to matter: the norm of the force vector and the angle between the force vector and the commanded motion. A wipe that is being resisted produces a clear opposing component; a wipe that is sliding cleanly does not.

04— Fusion

Late Fusion vs. Cross-Attention

There are two honest options for combining four streams: concatenate the per-modality embeddings before the action head (late fusion), or interleave them as separate token streams that cross-attend inside the transformer.

Late fusion is simpler and trains faster. Cross-attention is more expressive and historically wins on multimodal benchmarks. Our plan is to ablate both on the same dataset and report the gap. Until that data is in, anyone claiming one decisively beats the other in this setting is guessing.

05— Evaluation

What We Actually Want to Measure

Aggregate success rate is a misleading number for a cleaning robot. A policy that succeeds 95% of the time but smears jam across the headliner on the other 5% is unshippable. Our eval suite is structured around three bounded tasks where we can score outcomes per attempt: spill recovery (visual cleanliness check before/after), material-conditioned force adaptation (peak force stays below a per-material limit), and glass detection (the policy must call out and refuse to vacuum visible shards).

These are not yet benchmarks we publish leaderboard scores on. They are an internal harness, and we would rather show the harness than fabricate a number for the dek.

Topics

VLAmultimodalpolicy learningimitation learning

Continue

Read the architecture notes

Visit handybot.ai →

Related from the other side

How to Model ROI for an Interior-Cleaning Robot →

A first-principles, citation-backed model for car-wash franchises, rental hubs, and AV depots considering autonomous interior cleaning. Full cost stack, supervision math, and the sensitivities that actually move payback.

VLA++ in Practice: Fusing Vision, Language, Acoustics & Force into One Action Head

Where Pure Vision Policies Run Out of Information

Acoustic Tokens From a Directional Array

Encoding 6-Axis F/T as Tokens

Late Fusion vs. Cross-Attention

What We Actually Want to Measure

Nav2 on a Holonomic Base: Centimeter-Accurate Docking at the Wash Bay

Segmenting 15 Interior Materials in Real Time on a Jetson Orin

Force-Controlled Wiping with a 6-DoF Telescopic Arm