Why Off-the-Shelf Doesn't Cover Cabins
Public segmentation datasets — ADE20K, COCO-Stuff, Cityscapes — have one or two cabin-relevant classes between them. None distinguish leather from vinyl, or alcantara from cloth, at the granularity a force-controlled wiper needs.
Our taxonomy has 15 classes: leather (real and synthetic), alcantara, woven fabric, vinyl, hard plastic (matte and glossy), wood trim, brushed metal, chrome trim, glass, rubber floor mat, carpet floor mat, headliner fabric, and exposed foam (a damage class).
Capture is done with a small handheld rig that pairs a polarized RGB camera with a depth sensor. Polarization helps separate dielectric reflections from the underlying material color, which is the single biggest source of label noise on glossy surfaces.
Distilling SAM2 for Edge Inference
SAM2 is a strong teacher: prompt it with points and it returns clean instance masks. It is also too heavy to run at sensor rate on a Jetson Orin alongside everything else the robot needs to compute.
Our approach is standard distillation. We pre-segment cabin imagery with SAM2 to get instance masks, label each instance with a material class via a small classifier head and human review, then train a compact student segmentation network (a MobileViT-class backbone with a lightweight decoder) to predict the 15-class semantic map directly.
The student does not need to match SAM2's mask quality on novel classes; it only needs to match it on the 15 classes we care about. That tradeoff is what makes real-time inference possible.
INT8 Without Killing Glass IoU
Glass is the class that suffers most from naive INT8 quantization, because the model relies on subtle highlight cues that compress poorly into 8-bit activations. The fix is per-channel quantization for the decoder layers and a calibration set that is deliberately over-weighted toward cabins with prominent glass — windshields, sunroofs, and infotainment screens.
We accept a small overall mIoU drop in exchange for keeping glass IoU close to FP16. The robot can tolerate confusing two kinds of plastic; it cannot tolerate confusing glass with vinyl.
From Segmentation to Force Setpoint
Segmentation is only useful if downstream control consumes it. The output of the model is sampled at the contact point of the wiper and used as an index into a per-material force lookup table — soft on leather, firm on glass, careful around exposed foam.
When the prediction is uncertain (low max-class probability or rapid class flicker between frames) the controller falls back to a conservative low-force regime and slows the wipe. Better a slow clean than a scratched A-pillar.
Where the Model Still Gets Fooled
Three failure modes recur. Chrome trim under direct sunlight reads as glass. Sun-bleached leather drifts toward fabric. Transparent floor mats over carpet read as carpet, which is technically correct for vision but wrong for the wiper, which now has to negotiate an invisible plastic layer.
We handle these with confidence thresholds and material-pair fallbacks rather than pretending the model is infallible. Honest failure modes are part of the spec.
Read the dataset notes
Visit handybot.ai →Why scaled robotaxi services can't fully realize their economics until interior maintenance becomes autonomous too.