Self-Supervised Learning for Robot Grasping

Published: August 01, 2025

Resources:

Summary. On the DLR–Hand II platform, we fine-tune a Point–JEPA backbone on spatially sequenced point clouds and predict grasp joint angles with a multi-hypothesis head. JEPA pretraining consistently improves top-logit RMSE in low-label regimes and boosts coverage@15°, with parity at 100% labels. :contentReference[oaicite:0]{index=0}

Highlights

Backbone: Point–JEPA for 3D point clouds; attention-pooled object embedding.
Head: K=5 multi-hypothesis joints + logits; winner-takes-all (min-over-K) training; top-logit selection at inference.
Label efficiency: Largest gains at 25% labels; consistent improvements at 1–10%; parity at full supervision.
Evaluation: Object-level splits; fixed val/test fixtures across label budgets.

Results (validation)

Top-logit RMSE (↓) — Scratch → JEPA (relative gain)
- 1%: 0.363 → 0.335 (+7.7%)
- 10%: 0.335 → 0.303 (+9.6%)
- 25%: 0.332 → 0.246 (+25.9%)
- 100%: 0.235 → 0.234 (≈ parity)
Selector fidelity: Smaller selection gap with JEPA at 1–10% (e.g., 0.142 vs 0.165 at 1%; 0.157 vs 0.176 at 10%).
Coverage@15° (↑): Higher with JEPA in low-label settings (e.g., 0.955 vs 0.938 at 10%; 0.866 vs 0.861 at 1%). :contentReference[oaicite:1]{index=1}

Method at a Glance

Tokenize meshes → point-cloud patches; positional encodings.
Pretrain JEPA (context vs. target encoders; EMA target).
Pool contextualized patch features → global object embedding.
Predict K joint-angle hypotheses + logits given object embedding + wrist pose.
Train with WTA / min-over-K + cross-entropy on the winning hypothesis; infer with top-logit.

Status: Aiming for workshop submission (in progress).

Bluesky Facebook LinkedIn X (formerly Twitter)

Jed Guzelkabaagac

Highlights

Results (validation)

Method at a Glance