Vision-Language-Action Models for Robotic Manipulation with SoArm-101 使用 SoArm-101 的视觉-语言-动作模型进行机器人操作
End-to-end imitation learning with LeRobot framework 使用LeRobot框架的端到端模仿学习
Keywords: Python, LeRobot, Vision-Language-Action, Deep Learning, Robot Manipulation, SoArm-101, Teleoperation, Imitation Learning
| Dataset on Hugging Face Hub | Trained Model on Hugging Face Hub |
Description
This project explores the application of state-of-the-art Vision-Language-Action (VLA) models for robotic manipulation tasks using the SoArm-101 robot arm. VLA models represent a new paradigm in robot learning that combines visual perception, natural language understanding, and action prediction to enable robots to perform complex manipulation tasks through learned behaviors.
The project successfully established a complete pipeline from teleoperation and data collection through model training and real-world deployment, providing valuable insights into the practical challenges of implementing VLA systems on affordable hardware.
What are Vision-Language-Action Models?
Vision-Language-Action (VLA) models are a class of neural networks that learn to map visual observations and language instructions directly to robot actions. Unlike traditional robotic systems that rely on hand-engineered perception and control pipelines, VLA models learn end-to-end policies from demonstration data, enabling more flexible and generalizable robot behaviors.
These models typically:
- Process camera images to understand the scene
- Accept natural language commands to specify task goals
- Output low-level robot control commands (joint positions, velocities, etc.)
The key insight is that by learning from human demonstrations, robots can acquire complex manipulation skills without explicit programming of every motion primitive.
Hardware Setup
The project uses the SoArm-101 robot arm, a compact and accessible robotic manipulator suitable for tabletop manipulation tasks. The system is integrated with LeRobot, an open-source framework designed to facilitate robot learning research.
graph TB
subgraph Workspace["🏠 Physical Workspace"]
subgraph Arms["Dual Arm Setup"]
LEADER["👤 Leader Arm<br/>Torque Disabled<br/>Human Guidance"]
FOLLOWER["🤖 Follower Arm<br/>Torque Enabled<br/>Mirrors Leader"]
end
subgraph Sensors["Perception"]
CAM["📷 USB Camera<br/>Logitech C920<br/>640×480 @ 30Hz"]
end
subgraph Objects["Task Objects"]
CUBE["🟦 Target Objects<br/>Small Cubes"]
end
end
subgraph Computer["💻 Host Computer"]
subgraph Ports["USB Ports"]
USB1["🔌 /dev/ttyACM0<br/>Leader"]
USB2["🔌 /dev/ttyACM1<br/>Follower"]
USB3["🔌 /dev/video0<br/>Camera"]
end
subgraph Software["Software Stack"]
PYTHON["🐍 Python 3.10+"]
LEROBOT["🤗 LeRobot"]
OPENCV["📹 OpenCV"]
TORCH["🔥 PyTorch"]
end
end
LEADER ---|"Serial 1Mbps"| USB1
FOLLOWER ---|"Serial 1Mbps"| USB2
CAM ---|"USB 2.0"| USB3
USB1 --> LEROBOT
USB2 --> LEROBOT
USB3 --> OPENCV
OPENCV --> LEROBOT
LEROBOT --> TORCH
style LEADER fill:#e8f5e9
style FOLLOWER fill:#e1f5ff
style CAM fill:#fff3cd
style LEROBOT fill:#ffeaa7
style TORCH fill:#ffcdd2
Key Components:
- SO101 Leader Arm: Manual control arm for teleoperation demonstrations (torque disabled for free movement)
- SO101 Follower Arm: 5-DOF robotic arm with gripper (Feetech servos, torque enabled)
- Camera Pipeline: Supports both USB webcams and Intel RealSense cameras via MJPEG streaming servers (640x480 @ 30 FPS)
- LeRobot Framework: Provides standardized interfaces for data collection, training, and deployment
Robot Joint Configuration
The SO101 arm features 6 controllable joints:
| Joint | Name | Type | Description |
|---|---|---|---|
| 1 | shoulder_pan |
Revolute | Base rotation (yaw) |
| 2 | shoulder_lift |
Revolute | Shoulder elevation (pitch) |
| 3 | elbow_flex |
Revolute | Elbow bend (pitch) |
| 4 | wrist_flex |
Revolute | Wrist pitch |
| 5 | wrist_roll |
Revolute | Wrist rotation (roll) |
| 6 | gripper |
Prismatic | Gripper open/close |
Observation Space: 6D joint positions (radians for revolute, meters for prismatic) Action Space: 6D target positions (same format)
System Architecture
End-to-End Pipeline Overview
The complete VLA system spans data collection, cloud-based training, and real-world deployment. This architecture enables a seamless workflow from human demonstrations to autonomous robot operation.
flowchart TB
subgraph Collection["🎮 Data Collection Phase"]
direction TB
LEADER["👤 Human Operator<br/>Leader Arm"]
FOLLOWER["🤖 Follower Arm<br/>SO101"]
CAMERA["📷 USB Camera<br/>640x480"]
TELNET["💻 Telnet Control<br/>Start/Stop/Abort"]
DATASET["📦 LeRobot Dataset<br/>Parquet + Videos"]
LEADER -->|"Position Commands"| FOLLOWER
CAMERA -->|"RGB Frames"| DATASET
FOLLOWER -->|"Joint States"| DATASET
TELNET -->|"Recording Control"| DATASET
end
subgraph Cloud["☁️ Cloud Services"]
direction TB
HF["🤗 Hugging Face Hub<br/>Datasets & Models"]
WANDB["📊 Weights & Biases<br/>Experiment Tracking"]
end
subgraph Training["🧠 Training Phase"]
direction TB
GPU["⚡ GPU Server<br/>RTX 3080/4090"]
subgraph Models["Model Architectures"]
ACT["🔷 ACT<br/>Action Chunking<br/>Transformer"]
DIFF["🔶 Diffusion Policy<br/>Denoising<br/>Generation"]
PI0["🔴 π0<br/>Vision-Language-Action"]
end
GPU --> Models
end
subgraph Deployment["🚀 Deployment Phase"]
direction TB
JETSON["📟 Edge Device<br/>Jetson Nano"]
ROBOT["🦾 Robot Arm<br/>Autonomous"]
SERVER["🖥️ Inference Server<br/>TCP/IP"]
JETSON <-->|"Observations/Actions"| SERVER
JETSON --> ROBOT
end
Collection -->|"Push Dataset"| HF
HF -->|"Load Dataset"| Training
Training -->|"Push Model"| HF
Training -->|"Log Metrics"| WANDB
HF -->|"Pull Model"| Deployment
style Collection fill:#e8f5e9
style Cloud fill:#fff3cd
style Training fill:#e3f2fd
style Deployment fill:#fce4ec
Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| Framework | Robot learning framework | |
| Deep Learning | Neural network training | |
| Vision | Image capture & processing | |
| Hardware | Servo motors (STS3215) | |
| Dataset | Dataset & model hosting | |
| Tracking | Experiment logging | |
| Edge | Edge deployment |
Telnet-Based Data Collection Control
One of the key architectural decisions was implementing a telnet-based command interface for controlling the data collection process. This design enables operators to control recording sessions remotely without interrupting the teleoperation workflow.
graph LR
subgraph Control["Control Interface"]
TELNET["Telnet Client<br/>localhost:1234"]
end
subgraph Threads["Multi-Threaded System"]
CMD["Command Server<br/>Thread"]
IMG["Image Capture<br/>Thread"]
MAIN["Main Loop<br/>Thread"]
end
subgraph State["Shared State"]
REC["Recording Flag"]
FRAME["Latest Frame"]
end
TELNET -->|Commands| CMD
CMD -->|Toggle| REC
IMG -->|Update| FRAME
REC --> MAIN
FRAME --> MAIN
style TELNET fill:#e1f5ff
style REC fill:#fff3cd
Command Interface:
s- Start/stop recording episodesa- Abort current episode (discard frames without saving)q- Quit and save dataset
Why Telnet? The choice of a text-based telnet protocol over custom software was deliberate:
- Cross-platform compatibility: Any device with a terminal can control the system
- Simplicity: No client software installation required
- Reliability: Standard TCP sockets with well-understood behavior
- Debuggability: Human-readable commands for easy troubleshooting
Multi-Threaded Architecture
The data collection system employs three concurrent threads to ensure smooth operation:
- Camera Server Thread: MJPEG streaming server (Webcam or RealSense) that serves frames independently at 30 FPS, decoupled from the control loop to prevent frame drops during servo communication
- Command Server Thread: Listens for telnet connections and processes recording commands with thread-safe state updates
- Main Teleoperation Loop: Runs at 30 Hz, reading leader arm positions and transmitting to the follower while recording synchronized observations
Thread synchronization uses Python locks to protect shared state (recording flag and latest image buffer), preventing race conditions between command processing and data capture.
Client-Server Inference Architecture
For deployment, the system supports a distributed architecture that separates GPU inference from robot control:
flowchart LR
subgraph Robot["🤖 Robot Client (Jetson Nano)"]
direction TB
ARM["🦾 SO101 Arm<br/>6-DOF Control"]
CAM["📷 Camera<br/>Frame Capture"]
CLIENT["📡 TCP Client<br/>Async I/O"]
subgraph EdgeStack["Edge Software"]
NUMPY["NumPy"]
SERIAL["pyserial"]
end
end
subgraph Network["🌐 Network"]
TCP["TCP/IP<br/>Port 8000<br/>~10ms RTT"]
end
subgraph GPU["⚡ GPU Server (RTX 3080/4090)"]
direction TB
SERVER["🖥️ TCP Server<br/>Multi-threaded"]
subgraph Inference["Inference Pipeline"]
PREPROCESS["📥 Preprocessing<br/>Normalize & Resize"]
MODEL["🧠 VLA Model<br/>ACT / Diffusion / π0"]
POSTPROCESS["📤 Postprocessing<br/>Action Scaling"]
end
subgraph GPUStack["GPU Software"]
TORCH2["🔥 PyTorch"]
CUDA["CUDA 11.8+"]
HF2["🤗 Transformers"]
end
end
CAM -->|"RGB Frame"| CLIENT
ARM -->|"Joint State"| CLIENT
CLIENT <-->|"JSON Protocol"| TCP
TCP <-->|"Observations ➡️<br/>⬅️ Actions"| SERVER
SERVER --> PREPROCESS
PREPROCESS --> MODEL
MODEL --> POSTPROCESS
POSTPROCESS --> SERVER
CLIENT -->|"Target Position"| ARM
style Robot fill:#e8f5e9
style GPU fill:#e3f2fd
style Network fill:#fff3cd
style MODEL fill:#ffcdd2
Protocol Specification:
| Message Type | Direction | Format | Example |
|---|---|---|---|
| Observation | Client → Server | {"image": base64, "state": [6 floats]} |
Joint positions + RGB |
| Action | Server → Client | {"action": [6 floats]} |
Target joint positions |
| Heartbeat | Bidirectional | {"ping": timestamp} |
Connection health |
This architecture enables running computationally expensive neural network inference on a powerful GPU server while the robot operates on a resource-constrained edge device like NVIDIA Jetson.
Model Architectures
The following diagram illustrates the internal architecture of each VLA model used in this project:
flowchart LR
subgraph Input["📥 Input"]
IMG["🖼️ Camera Image<br/>640×480 RGB"]
STATE["📊 Joint State<br/>6D Position"]
LANG["💬 Language<br/>(Optional)"]
end
subgraph ACT_Model["🔷 ACT Architecture"]
direction TB
RESNET["ResNet-18<br/>Vision Encoder"]
POS_EMB["Positional<br/>Embedding"]
TRANSFORMER["Transformer<br/>Encoder-Decoder"]
VAE["VAE Latent<br/>Space"]
CHUNK["Action Chunk<br/>100 Steps"]
end
subgraph Diffusion_Model["🔶 Diffusion Policy"]
direction TB
UNET["U-Net<br/>Denoiser"]
NOISE["Gaussian<br/>Noise"]
DDPM["DDPM<br/>100 Steps"]
DENOISE["Iterative<br/>Denoising"]
end
subgraph Pi0_Model["🔴 π0 Architecture"]
direction TB
PALIGEMMA["PaliGemma<br/>VLM Backbone"]
EXPERT["Gemma Expert<br/>Action Head"]
FLOW["Flow Matching<br/>Denoising"]
end
subgraph Output["📤 Output"]
ACTIONS["🎯 Action Sequence<br/>Joint Targets"]
end
IMG --> RESNET
STATE --> POS_EMB
RESNET --> TRANSFORMER
POS_EMB --> TRANSFORMER
TRANSFORMER --> VAE
VAE --> CHUNK
CHUNK --> ACTIONS
IMG --> UNET
STATE --> UNET
NOISE --> DDPM
DDPM --> DENOISE
UNET --> DENOISE
DENOISE --> ACTIONS
IMG --> PALIGEMMA
LANG --> PALIGEMMA
PALIGEMMA --> EXPERT
EXPERT --> FLOW
FLOW --> ACTIONS
style ACT_Model fill:#e3f2fd
style Diffusion_Model fill:#fff3cd
style Pi0_Model fill:#ffebee
style Input fill:#e8f5e9
style Output fill:#fce4ec
ACT (Action Chunking Transformer)
A transformer-based model that predicts sequences of future actions (action chunks) rather than single actions, enabling more coherent and long-horizon behaviors.
Key Design Choices:
- Chunk Size of 100: Predicts 100 future actions at once, reducing the frequency of policy queries and enabling smoother trajectories
- VAE Training: Uses KL divergence loss for latent space regularization, helping the model learn a compact representation of action distributions
- ResNet-18 Vision Backbone: Efficient visual feature extraction balancing accuracy and inference speed
The action chunking approach proved particularly effective for manipulation tasks where temporal coherence matters. Rather than predicting one action at a time (which can lead to jerky motion), predicting a sequence of actions allows the robot to execute smooth, purposeful movements.
Diffusion Policy
A denoising diffusion-based policy that learns to generate robot actions through iterative refinement, enabling smooth and multimodal action distributions.
Key Design Choices:
- Iterative Denoising: Generates actions by progressively refining random noise, allowing the model to represent complex, multimodal action distributions
- Horizon of 16 Steps: Balances prediction accuracy with computational efficiency
- 100 DDPM Inference Steps: Provides high-quality action generation at the cost of increased inference time
The diffusion approach excels at handling ambiguous situations where multiple valid actions exist. For example, when grasping an object, there may be several valid approach angles - diffusion models can represent this uncertainty naturally.
Pi-Zero (π0)
A Vision-Language-Action model that leverages large-scale pre-trained vision-language models for robot control. It uses a PaliGemma-based vision-language backbone combined with a specialized action expert.
Key Design Choices:
- PaliGemma Backbone: Uses pre-trained vision-language features (Gemma-2B variant) to understand complex scenes and natural language instructions.
- Action Expert: A dedicated Gemma-based transformer (Gemma-300M) that specializes in mapping latent features to precise robot actions.
- Flow Matching: Employs flow matching for high-fidelity action generation, providing a more efficient alternative to traditional diffusion.
- Action Chunking: Predicts action sequences (chunk size of 50) to ensure temporal coherence in robot movements.
Development Journey & Challenges
Hardware Integration
Calibration Complexity: The SO101 arm requires careful calibration to establish joint zero positions. Initial attempts without proper calibration led to unpredictable movements and position drift. The solution involved implementing a structured calibration routine that records reference positions and stores them in a standardized format.
Serial Port Management: With multiple USB devices (leader arm, follower arm, camera), consistent device naming became essential. Linux udev rules were implemented to assign predictable device paths (/dev/ttyACM0, /dev/ttyACM1) based on device attributes rather than connection order.
Timing Synchronization: Achieving smooth leader-follower tracking required careful tuning of the control loop frequency. Too slow (below 20 Hz) resulted in jerky following behavior; too fast (above 30 Hz) overwhelmed the servo communication bandwidth. The sweet spot of 20-30 Hz provided responsive yet stable control.
Training Observations
Dataset Quality Matters: Early experiments with hastily collected demonstrations yielded poor policy performance. The quality of demonstrations - smooth trajectories, consistent task execution, and varied object positions - proved more important than quantity. A focused dataset of 50 high-quality episodes outperformed 200 rushed demonstrations.
Chunk Size Impact: For ACT, the choice of chunk size significantly affected behavior. Smaller chunks (20-30 actions) resulted in more reactive but less smooth motion. Larger chunks (100+ actions) produced smoother trajectories but reduced adaptability to unexpected situations.
Training Stability: Both ACT and Diffusion policies showed sensitivity to learning rate. ACT required lower learning rates (1e-5) for stable convergence, while Diffusion Policy tolerated higher rates (1e-4) due to its inherent noise injection during training.
System Design Decisions
Why Multi-Threading?: Initial single-threaded implementations suffered from frame drops during serial communication delays. Separating image capture, command processing, and teleoperation into independent threads with proper synchronization eliminated these issues.
Why Client-Server for Inference?: The SO101 is often deployed on Jetson Nano or similar edge devices with limited GPU memory. By offloading inference to a remote GPU server, the robot can run sophisticated VLA models that would otherwise exceed local memory constraints.
Why LeRobot?: Building on the LeRobot framework provided immediate access to standardized dataset formats, training utilities, and Hugging Face Hub integration. This accelerated development significantly compared to implementing everything from scratch.
Results & Performance
Timing & Latency
| Metric | Value | Notes |
|---|---|---|
| Teleoperation Control Loop | 30 Hz | Stable with servo communication |
| Image Capture Rate | 30 FPS | Decoupled from control loop |
| ACT Inference (Local GPU) | ~50ms | RTX 3080, batch size 1 |
| Diffusion Inference (Local GPU) | ~200ms | 100 DDPM steps |
| Network Round-Trip (Client-Server) | ~10ms | Local network |
Task Performance
The system was evaluated on a pick-and-place task with small cubes:
| Model | Training Episodes | Success Rate | Notes |
|---|---|---|---|
| ACT | 50 | 5% (1/20) | Initial baseline with limited data |
| Diffusion | 50 | TBD | Pending evaluation |
| Pi-Zero (π0) | 50 | TBD | In evaluation |
Key Observations:
- The initial ACT policy achieved a 5% success rate (1 out of 20 attempts), indicating room for improvement
- Failure modes primarily involved grasp positioning errors and timing misalignment
- The policy demonstrated learned approach behaviors but struggled with precise gripper control
- Results suggest the need for more demonstration data and potentially task-specific training refinements
Analysis of Failures:
- Most failures occurred during the grasp phase, with the gripper closing too early or too late
- Some attempts showed correct trajectory planning but missed the target object by small margins
- The low success rate highlights the challenge of learning fine manipulation from limited demonstrations
Lessons Learned
What Worked Well
-
Telnet Control Interface: The simple text-based protocol eliminated friction in the data collection workflow. Operators could control recording from any device without installing custom software.
-
Multi-Threaded Architecture: Separating concerns into independent threads with explicit synchronization prevented subtle timing bugs and made the system more robust.
-
LeRobot Integration: Building on an established framework saved significant development time and ensured compatibility with the broader ecosystem.
-
Action Chunking: Predicting sequences of actions rather than single steps produced noticeably smoother robot behavior.
Areas for Improvement
-
Calibration Workflow: The current calibration process requires manual positioning. An automated calibration routine would reduce setup time and improve reproducibility.
-
Error Recovery: Current policies lack explicit error recovery mechanisms. When a grasp fails, the robot continues with the planned trajectory rather than adapting.
-
Real-Time Adaptation: Both ACT and Diffusion policies operate open-loop within their prediction horizons. Incorporating feedback during execution could improve robustness.
-
Inference Latency: Diffusion Policy’s ~200ms inference time limits control responsiveness. Techniques like DDIM sampling or distillation could reduce this.
Insights on VLA Models
End-to-End Learning is Powerful but Data-Hungry: VLA models can learn complex manipulation behaviors without explicit programming, but they require substantial high-quality demonstration data. The quality-over-quantity principle proved essential.
Architecture Matters for Behavior Characteristics: ACT’s action chunking produces smoother trajectories, while Diffusion’s iterative refinement handles ambiguity better. The choice depends on task requirements.
Deployment Constraints Drive Design: Real-world deployment considerations (edge compute, network latency, reliability) significantly influenced the system architecture. Elegant algorithms are insufficient without practical deployment paths.
Current Status
Completed:
- Full teleoperation and data collection pipeline with telnet control
- ACT, Diffusion, and Pi-Zero (π0) policy training with Weights & Biases integration
- Local and distributed (client-server) inference pipelines for all models
- Hugging Face Hub integration for dataset and model sharing
In Progress:
- Pi-Zero-Point-Five (π0.5) model implementation
- Comprehensive benchmarking across task variations
- Real-world performance evaluation with diverse objects
Future Directions
- Expand Model Coverage: Implement Pi-Zero-Point-Five (π0.5) architectures and further optimize π0 performance
- Multi-Task Learning: Train policies that can handle multiple manipulation tasks with language conditioning
- Dataset Publication: Release demonstration datasets and trained models on Hugging Face Hub for community use
- Safety Systems: Develop monitoring and intervention mechanisms for safer autonomous operation
Conclusion
This project demonstrates that state-of-the-art VLA models can be successfully deployed on affordable robotic hardware with careful system design. The combination of teleoperation for data collection, GPU-accelerated training, and distributed inference enables a complete learning pipeline from demonstration to deployment.
Key takeaways include the importance of data quality over quantity, the impact of architectural choices on robot behavior characteristics, and the necessity of practical deployment considerations in system design. The telnet-based control interface and multi-threaded architecture proved particularly valuable for reliable operation.
The insights gained from implementing and comparing ACT and Diffusion policies provide a foundation for future work on more sophisticated VLA architectures and multi-task robot learning.
References
- LeRobot: An Open-Source Framework for Robot Learning - GitHub
- Action Chunking Transformer (ACT): Learning Fine-Grained Bimanual Manipulation - Paper
- Diffusion Policy: Visuomotor Policy Learning via Action Diffusion - Paper
- Pi-Zero: A Vision-Language-Action Flow Model for General Robot Control - Physical Intelligence
- Hugging Face Hub: Model and dataset hosting platform - huggingface.co
- Project Dataset: SO101 ACT demonstration data - Hugging Face Hub
- Trained ACT Model: Pre-trained ACT policy for SO101 - Hugging Face Hub
关键词: Python, LeRobot, 视觉-语言-动作, 深度学习, 机器人操作, SoArm-101, 遥操作, 模仿学习
| Hugging Face Hub 上的数据集 | Hugging Face Hub 上的训练模型 |
描述
该项目探索将最先进的视觉-语言-动作(VLA)模型应用于使用 SoArm-101 机械臂的机器人操作任务。VLA 模型代表了机器人学习的新范式,它结合视觉感知、自然语言理解和动作预测,使机器人能够通过学习行为执行复杂的操作任务。
该项目成功建立了从遥操作和数据收集到模型训练和实际部署的完整流程,为在经济实惠的硬件上实施 VLA 系统的实际挑战提供了宝贵见解。
什么是视觉-语言-动作模型?
视觉-语言-动作(VLA)模型是一类神经网络,学习将视觉观察和语言指令直接映射到机器人动作。与依赖于手工设计的感知和控制流程的传统机器人系统不同,VLA 模型从演示数据中端到端学习策略,实现更灵活和可泛化的机器人行为。
这些模型通常:
- 处理相机图像以理解场景
- 接受自然语言命令以指定任务目标
- 输出底层机器人控制命令(关节位置、速度等)
关键见解是,通过从人类演示中学习,机器人可以获得复杂的操作技能,而无需显式编程每个运动基元。
模型架构
ACT(动作分块 Transformer)
一种基于 Transformer 的模型,预测未来动作序列(动作块)而不是单个动作,实现更连贯和长视野的行为。
关键设计选择:
- 块大小为 100:一次预测 100 个未来动作,减少策略查询频率并实现更平滑的轨迹
- VAE 训练:使用 KL 散度损失进行潜在空间正则化
- ResNet-18 视觉骨干:高效的视觉特征提取
扩散策略
一种基于去噪扩散的策略,通过迭代细化学习生成机器人动作,实现平滑和多模态动作分布。
Pi-Zero (π0)
一种结合了大规模预训练视觉-语言模型和专用动作专家的视觉-语言-动作模型。
关键设计选择:
- PaliGemma 骨干:利用预训练的视觉-语言特征(Gemma-2B 变体)来理解复杂场景和自然语言指令。
- 动作专家:专用的 Gemma 变体(Gemma-300M),用于将潜在特征映射到精确的机器人动作。
- 流匹配 (Flow Matching):采用流匹配进行高保真动作生成,相比传统扩散模型更具效率。
- 动作分块:预测动作序列(块大小为 50)以确保机器人运动的时间连贯性。
结果与性能
任务性能
系统在小立方体的拾取放置任务上进行了评估:
| 模型 | 训练情节 | 成功率 | 备注 |
|---|---|---|---|
| ACT | 50 | 5% (1/20) | 有限数据的初始基线 |
| Diffusion | 50 | 待定 | 等待评估 |
| Pi-Zero (π0) | 50 | 待定 | 评估中 |
关键观察:
- 初始 ACT 策略达到 5% 成功率(20 次尝试中 1 次),表明有改进空间
- 失败模式主要涉及抓取定位错误和时序不对齐
- 策略展示了学习到的接近行为,但在精确夹持控制方面存在困难
结论
该项目证明了最先进的 VLA 模型可以通过精心的系统设计成功部署在经济实惠的机器人硬件上。遥操作数据收集、GPU 加速训练和分布式推理的组合实现了从演示到部署的完整学习流程。
通过实现和比较 ACT、扩散策略和 Pi-Zero (π0) 所获得的见解,为未来研究更复杂的 VLA 架构(如 π0.5)和多任务机器人学习奠定了基础。