Self-Supervised Learning for Robot Grasping
Published:
Resources:
Summary. On the DLR–Hand II platform, we fine-tune a Point–JEPA backbone on spatially sequenced point clouds and predict grasp joint angles with a multi-hypothesis head. JEPA pretraining consistently improves top-logit RMSE in low-label regimes and boosts coverage@15°, with parity at 100% labels. :contentReference[oaicite:0]{index=0}
Highlights
- Backbone: Point–JEPA for 3D point clouds; attention-pooled object embedding.
- Head: K=5 multi-hypothesis joints + logits; winner-takes-all (min-over-K) training; top-logit selection at inference.
- Label efficiency: Largest gains at 25% labels; consistent improvements at 1–10%; parity at full supervision.
- Evaluation: Object-level splits; fixed val/test fixtures across label budgets.
Results (validation)
- Top-logit RMSE (↓) — Scratch → JEPA (relative gain)
- 1%: 0.363 → 0.335 (+7.7%)
- 10%: 0.335 → 0.303 (+9.6%)
- 25%: 0.332 → 0.246 (+25.9%)
- 100%: 0.235 → 0.234 (≈ parity)
- Selector fidelity: Smaller selection gap with JEPA at 1–10% (e.g., 0.142 vs 0.165 at 1%; 0.157 vs 0.176 at 10%).
- Coverage@15° (↑): Higher with JEPA in low-label settings (e.g., 0.955 vs 0.938 at 10%; 0.866 vs 0.861 at 1%). :contentReference[oaicite:1]{index=1}
Method at a Glance
- Tokenize meshes → point-cloud patches; positional encodings.
- Pretrain JEPA (context vs. target encoders; EMA target).
- Pool contextualized patch features → global object embedding.
- Predict K joint-angle hypotheses + logits given object embedding + wrist pose.
- Train with WTA / min-over-K + cross-entropy on the winning hypothesis; infer with top-logit.
Status: Aiming for workshop submission (in progress).
