Vision-Language-Action Models for Robotic Manipulation with SoArm-101 使用 SoArm-101 的视觉-语言-动作模型进行机器人操作

End-to-end imitation learning with LeRobot framework 使用LeRobot框架的端到端模仿学习

Keywords: Python, LeRobot, Vision-Language-Action, Deep Learning, Robot Manipulation, SoArm-101, Teleoperation, Imitation Learning

View This Project on GitHub

Dataset on Hugging Face Hub

Trained Model on Hugging Face Hub

Description

This project explores the application of state-of-the-art Vision-Language-Action (VLA) models for robotic manipulation tasks using the SoArm-101 robot arm. VLA models represent a new paradigm in robot learning that combines visual perception, natural language understanding, and action prediction to enable robots to perform complex manipulation tasks through learned behaviors.

The project successfully established a complete pipeline from teleoperation and data collection through model training and real-world deployment, providing valuable insights into the practical challenges of implementing VLA systems on affordable hardware.

What are Vision-Language-Action Models?

Vision-Language-Action (VLA) models are a class of neural networks that learn to map visual observations and language instructions directly to robot actions. Unlike traditional robotic systems that rely on hand-engineered perception and control pipelines, VLA models learn end-to-end policies from demonstration data, enabling more flexible and generalizable robot behaviors.

These models typically:

Process camera images to understand the scene
Accept natural language commands to specify task goals
Output low-level robot control commands (joint positions, velocities, etc.)

The key insight is that by learning from human demonstrations, robots can acquire complex manipulation skills without explicit programming of every motion primitive.

Hardware Setup

The project uses the SoArm-101 robot arm, a compact and accessible robotic manipulator suitable for tabletop manipulation tasks. The system is integrated with LeRobot, an open-source framework designed to facilitate robot learning research.

graph TB
    subgraph Workspace["🏠 Physical Workspace"]
        subgraph Arms["Dual Arm Setup"]
            LEADER["👤 Leader Arm<br/>Torque Disabled<br/>Human Guidance"]
            FOLLOWER["🤖 Follower Arm<br/>Torque Enabled<br/>Mirrors Leader"]
        end

        subgraph Sensors["Perception"]
            CAM["📷 USB Camera<br/>Logitech C920<br/>640×480 @ 30Hz"]
        end

        subgraph Objects["Task Objects"]
            CUBE["🟦 Target Objects<br/>Small Cubes"]
        end
    end

    subgraph Computer["💻 Host Computer"]
        subgraph Ports["USB Ports"]
            USB1["🔌 /dev/ttyACM0<br/>Leader"]
            USB2["🔌 /dev/ttyACM1<br/>Follower"]
            USB3["🔌 /dev/video0<br/>Camera"]
        end

        subgraph Software["Software Stack"]
            PYTHON["🐍 Python 3.10+"]
            LEROBOT["🤗 LeRobot"]
            OPENCV["📹 OpenCV"]
            TORCH["🔥 PyTorch"]
        end
    end

    LEADER ---|"Serial 1Mbps"| USB1
    FOLLOWER ---|"Serial 1Mbps"| USB2
    CAM ---|"USB 2.0"| USB3

    USB1 --> LEROBOT
    USB2 --> LEROBOT
    USB3 --> OPENCV
    OPENCV --> LEROBOT
    LEROBOT --> TORCH

    style LEADER fill:#e8f5e9
    style FOLLOWER fill:#e1f5ff
    style CAM fill:#fff3cd
    style LEROBOT fill:#ffeaa7
    style TORCH fill:#ffcdd2

Key Components:

SO101 Leader Arm: Manual control arm for teleoperation demonstrations (torque disabled for free movement)
SO101 Follower Arm: 5-DOF robotic arm with gripper (Feetech servos, torque enabled)
Camera Pipeline: Supports both USB webcams and Intel RealSense cameras via MJPEG streaming servers (640x480 @ 30 FPS)
LeRobot Framework: Provides standardized interfaces for data collection, training, and deployment

Robot Joint Configuration

The SO101 arm features 6 controllable joints:

Joint	Name	Type	Description
1	`shoulder_pan`	Revolute	Base rotation (yaw)
2	`shoulder_lift`	Revolute	Shoulder elevation (pitch)
3	`elbow_flex`	Revolute	Elbow bend (pitch)
4	`wrist_flex`	Revolute	Wrist pitch
5	`wrist_roll`	Revolute	Wrist rotation (roll)
6	`gripper`	Prismatic	Gripper open/close

Observation Space: 6D joint positions (radians for revolute, meters for prismatic) Action Space: 6D target positions (same format)

System Architecture

End-to-End Pipeline Overview

The complete VLA system spans data collection, cloud-based training, and real-world deployment. This architecture enables a seamless workflow from human demonstrations to autonomous robot operation.

flowchart TB
    subgraph Collection["🎮 Data Collection Phase"]
        direction TB
        LEADER["👤 Human Operator<br/>Leader Arm"]
        FOLLOWER["🤖 Follower Arm<br/>SO101"]
        CAMERA["📷 USB Camera<br/>640x480"]
        TELNET["💻 Telnet Control<br/>Start/Stop/Abort"]
        DATASET["📦 LeRobot Dataset<br/>Parquet + Videos"]

        LEADER -->|"Position Commands"| FOLLOWER
        CAMERA -->|"RGB Frames"| DATASET
        FOLLOWER -->|"Joint States"| DATASET
        TELNET -->|"Recording Control"| DATASET
    end

    subgraph Cloud["☁️ Cloud Services"]
        direction TB
        HF["🤗 Hugging Face Hub<br/>Datasets & Models"]
        WANDB["📊 Weights & Biases<br/>Experiment Tracking"]
    end

    subgraph Training["🧠 Training Phase"]
        direction TB
        GPU["⚡ GPU Server<br/>RTX 3080/4090"]

        subgraph Models["Model Architectures"]
            ACT["🔷 ACT<br/>Action Chunking<br/>Transformer"]
            DIFF["🔶 Diffusion Policy<br/>Denoising<br/>Generation"]
            PI0["🔴 π0<br/>Vision-Language-Action"]
        end

        GPU --> Models
    end

    subgraph Deployment["🚀 Deployment Phase"]
        direction TB
        JETSON["📟 Edge Device<br/>Jetson Nano"]
        ROBOT["🦾 Robot Arm<br/>Autonomous"]
        SERVER["🖥️ Inference Server<br/>TCP/IP"]

        JETSON <-->|"Observations/Actions"| SERVER
        JETSON --> ROBOT
    end

    Collection -->|"Push Dataset"| HF
    HF -->|"Load Dataset"| Training
    Training -->|"Push Model"| HF
    Training -->|"Log Metrics"| WANDB
    HF -->|"Pull Model"| Deployment

    style Collection fill:#e8f5e9
    style Cloud fill:#fff3cd
    style Training fill:#e3f2fd
    style Deployment fill:#fce4ec

Technology Stack

Layer	Technology	Purpose
Framework		Robot learning framework
Deep Learning		Neural network training
Vision		Image capture & processing
Hardware		Servo motors (STS3215)
Dataset		Dataset & model hosting
Tracking		Experiment logging
Edge		Edge deployment

Telnet-Based Data Collection Control

One of the key architectural decisions was implementing a telnet-based command interface for controlling the data collection process. This design enables operators to control recording sessions remotely without interrupting the teleoperation workflow.

graph LR
    subgraph Control["Control Interface"]
        TELNET["Telnet Client<br/>localhost:1234"]
    end

    subgraph Threads["Multi-Threaded System"]
        CMD["Command Server<br/>Thread"]
        IMG["Image Capture<br/>Thread"]
        MAIN["Main Loop<br/>Thread"]
    end

    subgraph State["Shared State"]
        REC["Recording Flag"]
        FRAME["Latest Frame"]
    end

    TELNET -->|Commands| CMD
    CMD -->|Toggle| REC
    IMG -->|Update| FRAME
    REC --> MAIN
    FRAME --> MAIN

    style TELNET fill:#e1f5ff
    style REC fill:#fff3cd

Command Interface:

s - Start/stop recording episodes
a - Abort current episode (discard frames without saving)
q - Quit and save dataset

Why Telnet? The choice of a text-based telnet protocol over custom software was deliberate:

Cross-platform compatibility: Any device with a terminal can control the system
Simplicity: No client software installation required
Reliability: Standard TCP sockets with well-understood behavior
Debuggability: Human-readable commands for easy troubleshooting

Multi-Threaded Architecture

The data collection system employs three concurrent threads to ensure smooth operation:

Camera Server Thread: MJPEG streaming server (Webcam or RealSense) that serves frames independently at 30 FPS, decoupled from the control loop to prevent frame drops during servo communication
Command Server Thread: Listens for telnet connections and processes recording commands with thread-safe state updates
Main Teleoperation Loop: Runs at 30 Hz, reading leader arm positions and transmitting to the follower while recording synchronized observations

Thread synchronization uses Python locks to protect shared state (recording flag and latest image buffer), preventing race conditions between command processing and data capture.

Client-Server Inference Architecture

For deployment, the system supports a distributed architecture that separates GPU inference from robot control:

flowchart LR
    subgraph Robot["🤖 Robot Client (Jetson Nano)"]
        direction TB
        ARM["🦾 SO101 Arm<br/>6-DOF Control"]
        CAM["📷 Camera<br/>Frame Capture"]
        CLIENT["📡 TCP Client<br/>Async I/O"]

        subgraph EdgeStack["Edge Software"]
            NUMPY["NumPy"]
            SERIAL["pyserial"]
        end
    end

    subgraph Network["🌐 Network"]
        TCP["TCP/IP<br/>Port 8000<br/>~10ms RTT"]
    end

    subgraph GPU["⚡ GPU Server (RTX 3080/4090)"]
        direction TB
        SERVER["🖥️ TCP Server<br/>Multi-threaded"]

        subgraph Inference["Inference Pipeline"]
            PREPROCESS["📥 Preprocessing<br/>Normalize & Resize"]
            MODEL["🧠 VLA Model<br/>ACT / Diffusion / π0"]
            POSTPROCESS["📤 Postprocessing<br/>Action Scaling"]
        end

        subgraph GPUStack["GPU Software"]
            TORCH2["🔥 PyTorch"]
            CUDA["CUDA 11.8+"]
            HF2["🤗 Transformers"]
        end
    end

    CAM -->|"RGB Frame"| CLIENT
    ARM -->|"Joint State"| CLIENT
    CLIENT <-->|"JSON Protocol"| TCP
    TCP <-->|"Observations ➡️<br/>⬅️ Actions"| SERVER
    SERVER --> PREPROCESS
    PREPROCESS --> MODEL
    MODEL --> POSTPROCESS
    POSTPROCESS --> SERVER
    CLIENT -->|"Target Position"| ARM

    style Robot fill:#e8f5e9
    style GPU fill:#e3f2fd
    style Network fill:#fff3cd
    style MODEL fill:#ffcdd2

Protocol Specification:

Message Type	Direction	Format	Example
Observation	Client → Server	`{"image": base64, "state": [6 floats]}`	Joint positions + RGB
Action	Server → Client	`{"action": [6 floats]}`	Target joint positions
Heartbeat	Bidirectional	`{"ping": timestamp}`	Connection health

This architecture enables running computationally expensive neural network inference on a powerful GPU server while the robot operates on a resource-constrained edge device like NVIDIA Jetson.

Model Architectures

The following diagram illustrates the internal architecture of each VLA model used in this project:

flowchart LR
    subgraph Input["📥 Input"]
        IMG["🖼️ Camera Image<br/>640×480 RGB"]
        STATE["📊 Joint State<br/>6D Position"]
        LANG["💬 Language<br/>(Optional)"]
    end

    subgraph ACT_Model["🔷 ACT Architecture"]
        direction TB
        RESNET["ResNet-18<br/>Vision Encoder"]
        POS_EMB["Positional<br/>Embedding"]
        TRANSFORMER["Transformer<br/>Encoder-Decoder"]
        VAE["VAE Latent<br/>Space"]
        CHUNK["Action Chunk<br/>100 Steps"]
    end

    subgraph Diffusion_Model["🔶 Diffusion Policy"]
        direction TB
        UNET["U-Net<br/>Denoiser"]
        NOISE["Gaussian<br/>Noise"]
        DDPM["DDPM<br/>100 Steps"]
        DENOISE["Iterative<br/>Denoising"]
    end

    subgraph Pi0_Model["🔴 π0 Architecture"]
        direction TB
        PALIGEMMA["PaliGemma<br/>VLM Backbone"]
        EXPERT["Gemma Expert<br/>Action Head"]
        FLOW["Flow Matching<br/>Denoising"]
    end

    subgraph Output["📤 Output"]
        ACTIONS["🎯 Action Sequence<br/>Joint Targets"]
    end

    IMG --> RESNET
    STATE --> POS_EMB
    RESNET --> TRANSFORMER
    POS_EMB --> TRANSFORMER
    TRANSFORMER --> VAE
    VAE --> CHUNK
    CHUNK --> ACTIONS

    IMG --> UNET
    STATE --> UNET
    NOISE --> DDPM
    DDPM --> DENOISE
    UNET --> DENOISE
    DENOISE --> ACTIONS

    IMG --> PALIGEMMA
    LANG --> PALIGEMMA
    PALIGEMMA --> EXPERT
    EXPERT --> FLOW
    FLOW --> ACTIONS

    style ACT_Model fill:#e3f2fd
    style Diffusion_Model fill:#fff3cd
    style Pi0_Model fill:#ffebee
    style Input fill:#e8f5e9
    style Output fill:#fce4ec

ACT (Action Chunking Transformer)

A transformer-based model that predicts sequences of future actions (action chunks) rather than single actions, enabling more coherent and long-horizon behaviors.

Key Design Choices:

Chunk Size of 100: Predicts 100 future actions at once, reducing the frequency of policy queries and enabling smoother trajectories
VAE Training: Uses KL divergence loss for latent space regularization, helping the model learn a compact representation of action distributions
ResNet-18 Vision Backbone: Efficient visual feature extraction balancing accuracy and inference speed

The action chunking approach proved particularly effective for manipulation tasks where temporal coherence matters. Rather than predicting one action at a time (which can lead to jerky motion), predicting a sequence of actions allows the robot to execute smooth, purposeful movements.

Diffusion Policy

A denoising diffusion-based policy that learns to generate robot actions through iterative refinement, enabling smooth and multimodal action distributions.

Key Design Choices:

Iterative Denoising: Generates actions by progressively refining random noise, allowing the model to represent complex, multimodal action distributions
Horizon of 16 Steps: Balances prediction accuracy with computational efficiency
100 DDPM Inference Steps: Provides high-quality action generation at the cost of increased inference time

The diffusion approach excels at handling ambiguous situations where multiple valid actions exist. For example, when grasping an object, there may be several valid approach angles - diffusion models can represent this uncertainty naturally.

Pi-Zero (π0)

A Vision-Language-Action model that leverages large-scale pre-trained vision-language models for robot control. It uses a PaliGemma-based vision-language backbone combined with a specialized action expert.

Key Design Choices:

PaliGemma Backbone: Uses pre-trained vision-language features (Gemma-2B variant) to understand complex scenes and natural language instructions.
Action Expert: A dedicated Gemma-based transformer (Gemma-300M) that specializes in mapping latent features to precise robot actions.
Flow Matching: Employs flow matching for high-fidelity action generation, providing a more efficient alternative to traditional diffusion.
Action Chunking: Predicts action sequences (chunk size of 50) to ensure temporal coherence in robot movements.

Development Journey & Challenges

Hardware Integration

Calibration Complexity: The SO101 arm requires careful calibration to establish joint zero positions. Initial attempts without proper calibration led to unpredictable movements and position drift. The solution involved implementing a structured calibration routine that records reference positions and stores them in a standardized format.

Serial Port Management: With multiple USB devices (leader arm, follower arm, camera), consistent device naming became essential. Linux udev rules were implemented to assign predictable device paths (/dev/ttyACM0, /dev/ttyACM1) based on device attributes rather than connection order.

Timing Synchronization: Achieving smooth leader-follower tracking required careful tuning of the control loop frequency. Too slow (below 20 Hz) resulted in jerky following behavior; too fast (above 30 Hz) overwhelmed the servo communication bandwidth. The sweet spot of 20-30 Hz provided responsive yet stable control.

Training Observations

Dataset Quality Matters: Early experiments with hastily collected demonstrations yielded poor policy performance. The quality of demonstrations - smooth trajectories, consistent task execution, and varied object positions - proved more important than quantity. A focused dataset of 50 high-quality episodes outperformed 200 rushed demonstrations.

Chunk Size Impact: For ACT, the choice of chunk size significantly affected behavior. Smaller chunks (20-30 actions) resulted in more reactive but less smooth motion. Larger chunks (100+ actions) produced smoother trajectories but reduced adaptability to unexpected situations.

Training Stability: Both ACT and Diffusion policies showed sensitivity to learning rate. ACT required lower learning rates (1e-5) for stable convergence, while Diffusion Policy tolerated higher rates (1e-4) due to its inherent noise injection during training.

System Design Decisions

Why Multi-Threading?: Initial single-threaded implementations suffered from frame drops during serial communication delays. Separating image capture, command processing, and teleoperation into independent threads with proper synchronization eliminated these issues.

Why Client-Server for Inference?: The SO101 is often deployed on Jetson Nano or similar edge devices with limited GPU memory. By offloading inference to a remote GPU server, the robot can run sophisticated VLA models that would otherwise exceed local memory constraints.

Why LeRobot?: Building on the LeRobot framework provided immediate access to standardized dataset formats, training utilities, and Hugging Face Hub integration. This accelerated development significantly compared to implementing everything from scratch.

Results & Performance

Timing & Latency

Metric	Value	Notes
Teleoperation Control Loop	30 Hz	Stable with servo communication
Image Capture Rate	30 FPS	Decoupled from control loop
ACT Inference (Local GPU)	~50ms	RTX 3080, batch size 1
Diffusion Inference (Local GPU)	~200ms	100 DDPM steps
Network Round-Trip (Client-Server)	~10ms	Local network

Task Performance

The system was evaluated on a pick-and-place task with small cubes:

Model	Training Episodes	Success Rate	Notes
ACT	50	5% (1/20)	Initial baseline with limited data
Diffusion	50	TBD	Pending evaluation
Pi-Zero (π0)	50	TBD	In evaluation

Key Observations:

The initial ACT policy achieved a 5% success rate (1 out of 20 attempts), indicating room for improvement
Failure modes primarily involved grasp positioning errors and timing misalignment
The policy demonstrated learned approach behaviors but struggled with precise gripper control
Results suggest the need for more demonstration data and potentially task-specific training refinements

Analysis of Failures:

Most failures occurred during the grasp phase, with the gripper closing too early or too late
Some attempts showed correct trajectory planning but missed the target object by small margins
The low success rate highlights the challenge of learning fine manipulation from limited demonstrations

Lessons Learned

What Worked Well

Telnet Control Interface: The simple text-based protocol eliminated friction in the data collection workflow. Operators could control recording from any device without installing custom software.
Multi-Threaded Architecture: Separating concerns into independent threads with explicit synchronization prevented subtle timing bugs and made the system more robust.
LeRobot Integration: Building on an established framework saved significant development time and ensured compatibility with the broader ecosystem.
Action Chunking: Predicting sequences of actions rather than single steps produced noticeably smoother robot behavior.

Areas for Improvement

Calibration Workflow: The current calibration process requires manual positioning. An automated calibration routine would reduce setup time and improve reproducibility.
Error Recovery: Current policies lack explicit error recovery mechanisms. When a grasp fails, the robot continues with the planned trajectory rather than adapting.
Real-Time Adaptation: Both ACT and Diffusion policies operate open-loop within their prediction horizons. Incorporating feedback during execution could improve robustness.
Inference Latency: Diffusion Policy’s ~200ms inference time limits control responsiveness. Techniques like DDIM sampling or distillation could reduce this.

Insights on VLA Models

End-to-End Learning is Powerful but Data-Hungry: VLA models can learn complex manipulation behaviors without explicit programming, but they require substantial high-quality demonstration data. The quality-over-quantity principle proved essential.

Architecture Matters for Behavior Characteristics: ACT’s action chunking produces smoother trajectories, while Diffusion’s iterative refinement handles ambiguity better. The choice depends on task requirements.

Deployment Constraints Drive Design: Real-world deployment considerations (edge compute, network latency, reliability) significantly influenced the system architecture. Elegant algorithms are insufficient without practical deployment paths.

Current Status

Completed:

Full teleoperation and data collection pipeline with telnet control
ACT, Diffusion, and Pi-Zero (π0) policy training with Weights & Biases integration
Local and distributed (client-server) inference pipelines for all models
Hugging Face Hub integration for dataset and model sharing

In Progress:

Pi-Zero-Point-Five (π0.5) model implementation
Comprehensive benchmarking across task variations
Real-world performance evaluation with diverse objects

Future Directions

Expand Model Coverage: Implement Pi-Zero-Point-Five (π0.5) architectures and further optimize π0 performance
Multi-Task Learning: Train policies that can handle multiple manipulation tasks with language conditioning
Dataset Publication: Release demonstration datasets and trained models on Hugging Face Hub for community use
Safety Systems: Develop monitoring and intervention mechanisms for safer autonomous operation

Conclusion

This project demonstrates that state-of-the-art VLA models can be successfully deployed on affordable robotic hardware with careful system design. The combination of teleoperation for data collection, GPU-accelerated training, and distributed inference enables a complete learning pipeline from demonstration to deployment.

Key takeaways include the importance of data quality over quantity, the impact of architectural choices on robot behavior characteristics, and the necessity of practical deployment considerations in system design. The telnet-based control interface and multi-threaded architecture proved particularly valuable for reliable operation.

The insights gained from implementing and comparing ACT and Diffusion policies provide a foundation for future work on more sophisticated VLA architectures and multi-task robot learning.

References

LeRobot: An Open-Source Framework for Robot Learning - GitHub
Action Chunking Transformer (ACT): Learning Fine-Grained Bimanual Manipulation - Paper
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion - Paper
Pi-Zero: A Vision-Language-Action Flow Model for General Robot Control - Physical Intelligence
Hugging Face Hub: Model and dataset hosting platform - huggingface.co
Project Dataset: SO101 ACT demonstration data - Hugging Face Hub
Trained ACT Model: Pre-trained ACT policy for SO101 - Hugging Face Hub

关键词: Python, LeRobot, 视觉-语言-动作, 深度学习, 机器人操作, SoArm-101, 遥操作, 模仿学习

在 GitHub 上查看此项目

Hugging Face Hub 上的数据集

Hugging Face Hub 上的训练模型

描述

该项目探索将最先进的视觉-语言-动作（VLA）模型应用于使用 SoArm-101 机械臂的机器人操作任务。VLA 模型代表了机器人学习的新范式，它结合视觉感知、自然语言理解和动作预测，使机器人能够通过学习行为执行复杂的操作任务。

该项目成功建立了从遥操作和数据收集到模型训练和实际部署的完整流程，为在经济实惠的硬件上实施 VLA 系统的实际挑战提供了宝贵见解。

什么是视觉-语言-动作模型？

视觉-语言-动作（VLA）模型是一类神经网络，学习将视觉观察和语言指令直接映射到机器人动作。与依赖于手工设计的感知和控制流程的传统机器人系统不同，VLA 模型从演示数据中端到端学习策略，实现更灵活和可泛化的机器人行为。

这些模型通常：

处理相机图像以理解场景
接受自然语言命令以指定任务目标
输出底层机器人控制命令（关节位置、速度等）

关键见解是，通过从人类演示中学习，机器人可以获得复杂的操作技能，而无需显式编程每个运动基元。

模型架构

ACT（动作分块 Transformer）

一种基于 Transformer 的模型，预测未来动作序列（动作块）而不是单个动作，实现更连贯和长视野的行为。

关键设计选择：

块大小为 100：一次预测 100 个未来动作，减少策略查询频率并实现更平滑的轨迹
VAE 训练：使用 KL 散度损失进行潜在空间正则化
ResNet-18 视觉骨干：高效的视觉特征提取

扩散策略

一种基于去噪扩散的策略，通过迭代细化学习生成机器人动作，实现平滑和多模态动作分布。

Pi-Zero (π0)

一种结合了大规模预训练视觉-语言模型和专用动作专家的视觉-语言-动作模型。

关键设计选择：

PaliGemma 骨干：利用预训练的视觉-语言特征（Gemma-2B 变体）来理解复杂场景和自然语言指令。
动作专家：专用的 Gemma 变体（Gemma-300M），用于将潜在特征映射到精确的机器人动作。
流匹配 (Flow Matching)：采用流匹配进行高保真动作生成，相比传统扩散模型更具效率。
动作分块：预测动作序列（块大小为 50）以确保机器人运动的时间连贯性。

结果与性能

任务性能

系统在小立方体的拾取放置任务上进行了评估：

模型	训练情节	成功率	备注
ACT	50	5% (1/20)	有限数据的初始基线
Diffusion	50	待定	等待评估
Pi-Zero (π0)	50	待定	评估中

关键观察：

初始 ACT 策略达到 5% 成功率（20 次尝试中 1 次），表明有改进空间
失败模式主要涉及抓取定位错误和时序不对齐
策略展示了学习到的接近行为，但在精确夹持控制方面存在困难

结论

该项目证明了最先进的 VLA 模型可以通过精心的系统设计成功部署在经济实惠的机器人硬件上。遥操作数据收集、GPU 加速训练和分布式推理的组合实现了从演示到部署的完整学习流程。

通过实现和比较 ACT、扩散策略和 Pi-Zero (π0) 所获得的见解，为未来研究更复杂的 VLA 架构（如 π0.5）和多任务机器人学习奠定了基础。

18 Dec 2025

« Modern Robotics Library in Rust and C++ 现代机器人学 Rust 和 C++ 库

Jingkun (Allen) Liu

Vision-Language-Action Models for Robotic Manipulation with SoArm-101 使用 SoArm-101 的视觉-语言-动作模型进行机器人操作

Description

What are Vision-Language-Action Models?

Hardware Setup

Robot Joint Configuration

System Architecture

End-to-End Pipeline Overview

Technology Stack

Telnet-Based Data Collection Control

Multi-Threaded Architecture

Client-Server Inference Architecture

Model Architectures

ACT (Action Chunking Transformer)

Diffusion Policy

Pi-Zero (π0)

Development Journey & Challenges

Hardware Integration

Training Observations

System Design Decisions

Results & Performance

Timing & Latency

Task Performance

Lessons Learned

What Worked Well

Areas for Improvement

Insights on VLA Models

Current Status

Future Directions

Conclusion

References

描述

什么是视觉-语言-动作模型？

模型架构

ACT（动作分块 Transformer）

扩散策略

Pi-Zero (π0)

结果与性能

任务性能

结论

Explore →