Turtlebot3 Navigation Using Deep Reinforcement Learning 使用深度强化学习的 Turtlebot3 导航
Custom Gymnasium environment for hallway navigation training 用于走廊导航训练的自定义Gymnasium环境
C++, ROS2, Open Robotics Turtlebot3 Burger, 2D LiDar, OpenAI Gymnasium, Deep Reinforcement Learning
Description
In this project, I designed and implemented an OpenAI Gymnasium environment for training a Deep Reinforcement Learning model to enable autonomous navigation of a Turtlebot3 Burger through complex hallway environments. The system leverages 2D LiDAR sensors for perception and employs state-of-the-art Deep RL algorithms to learn collision-free navigation policies.
Project Overview
graph TB
subgraph Environment["Simulation Environment"]
TB3[Turtlebot3 Burger]
LIDAR[2D LiDAR Sensor]
WORLD[Hallway Environment]
end
subgraph RL["Reinforcement Learning"]
GYM[OpenAI Gymnasium<br/>Custom Environment]
AGENT[Deep RL Agent]
POLICY[Navigation Policy]
end
subgraph ROS["ROS2 Integration"]
ODOM[Odometry Node]
CTRL[Control Node]
SIM[Gazebo Simulation]
end
TB3 --> LIDAR
LIDAR --> GYM
WORLD --> GYM
GYM --> AGENT
AGENT --> POLICY
POLICY --> CTRL
CTRL --> TB3
ODOM --> GYM
style TB3 fill:#e1f5ff
style GYM fill:#fff4e1
style AGENT fill:#d4edda
Demo Video
The result of this project cam be shown in the video below:
Odometry Model
For this project, since the focus is the control of a turtlebot3, not localization, so I assume the perfect odometry for this project.
To get the odometry from the wheel angle and velocity, it applied kinematics model from Modern Robotics to conduct the Forward Kinematics and Inverse Kinematics of the robot.
The geometry model of the robot can be present using $H$ matrix.
\[\begin{align*} H(\theta) &= \begin{bmatrix} h_1(\theta) \\ h_2(\theta) \end{bmatrix} \\ h_i(\theta) &= \frac{1}{r_i \cos{\gamma_i}} \begin{bmatrix} x_i \sin{\left (\beta_i + \gamma_i \right )} - y_i \cos{\left (\beta_i + \gamma_i \right)} \\ \cos{\left ( \beta_i + \gamma_i + \theta \right )} \\ \sin{\left ( \beta_i + \gamma_i + \theta \right )} \end{bmatrix}^\intercal \end{align*}\]For the Turtlebot3, which is a diff-drive mobile robot. The parameters are defined as
So that \(H(0)\) can be calculated as
\[H(0)=\begin{bmatrix} -d & 1 & -\infty \\ d & 1 & -\infty \end{bmatrix}\]Where \(d\) is the one-half of the distance between the wheels and the \(r\) is the radius of the wheel.
Hence \(F\) matrix can be calculated by
\[F=H(0)^\dagger = r \begin{bmatrix} -\frac{1}{2d} & \frac{1}{2d} \\ \frac{1}{2} & \frac{1}{2} \\ 0 & 0 \end{bmatrix}\]Forward Kinematics
The forward kinematics is to find the tranform from current state to next state \(T_{bb'}\) given the change in wheel positions \(\left( \Delta\phi_l, \Delta\phi_r\right)\).
- Find the body twist \({\cal V}_b\) from the body frame of the current state to next state:
- Construct the 3d twist by
- Then, the transform from current state to next state \(T_{bb'}\) can be calculated by
Inverse Kinematics
Inverse kinematics is to find the wheel velocities \(\left(\dot{\phi_l}, \dot{\phi_r} \right)\) given the body twist \({\cal V}_b\).
The wheel velocities can be calculated as
\[\begin{bmatrix} \dot{\phi}_l \\ \dot{\phi}_r \end{bmatrix} = H(0){\cal V}_b = \frac{1}{r}\begin{bmatrix} -d & 1 & -\infty \\ d & 1 & -\infty \end{bmatrix} \begin{bmatrix} \omega_z \\ v_x \\ v_y \end{bmatrix}\]Since this robot is non-holonomic so that it cannot move sideways. Hence \(v_y=0\), then the wheel velocities \(\left(\dot{\phi_l}, \dot{\phi_r} \right)\) are:
\[\begin{align*} \dot{\phi}_l &= \frac{1}{r} \left ( -d \omega_z + v_x\right) \\ \dot{\phi}_r &= \frac{1}{r} \left ( d \omega_z + v_x\right) \\ \end{align*}\]OpenAI Gymnasium
To start the training pipeline of the turtlebot3, I wrapped the ROS2 turtlebot3 simulation within the OpenAI Gymnasium environment, creating a standardized interface for Deep RL algorithms.
flowchart LR
subgraph Gymnasium["OpenAI Gymnasium Environment"]
OBS[Observation Space<br/>x, y, θ]
ACT[Action Space<br/>v_x, ω_z]
REW[Reward Function]
end
subgraph Robot["Turtlebot3 Simulation"]
STATE[Robot State]
SENSORS[LiDAR Sensors]
ACTUATORS[Wheel Actuators]
end
STATE --> OBS
SENSORS --> OBS
ACT --> ACTUATORS
ACTUATORS --> STATE
STATE --> REW
style OBS fill:#e1f5ff
style ACT fill:#fff4e1
style REW fill:#d4edda
Observation Space
The observation space for the environment can be defined as:
\[\begin{align*} x &= -1 \rightarrow 4 \text{ meters (world x-coordinate)} \\ y &= -0.5 \rightarrow 4.5 \text{ meters (world y-coordinate)} \\ \theta &= -\pi \rightarrow \pi \text{ radians (orientation)} \end{align*}\]Action Space
The action space can be defined as:
\[\begin{align*} \dot{x} &= 0.0 \rightarrow 0.2 \text{ m/s (linear velocity)} \\ \dot{\theta} &= -0.5 \rightarrow 0.5 \text{ rad/s (angular velocity)} \end{align*}\]Reward Function
The reward structure encourages collision-free goal-reaching behavior:
- Collision penalty: -1 (when robot hits a wall)
- Goal reward: +1 (when robot reaches the target)
- This sparse reward signal requires the agent to learn exploratory behavior
Training Process
The Deep Reinforcement Learning training follows an episodic approach where the agent learns through trial and error.
stateDiagram-v2
[*] --> Initialize: Episode Start
Initialize --> Observe: Reset to start position
Observe --> SelectAction: Get current state
SelectAction --> Execute: Choose action (RL policy)
Execute --> GetReward: Apply action to robot
GetReward --> CheckTerminal: Evaluate outcome
CheckTerminal --> Observe: Continue (no collision, no goal)
CheckTerminal --> Learn: Terminal state reached
Learn --> UpdatePolicy: Collision (-1) or Goal (+1)
UpdatePolicy --> Initialize: Next episode
note right of CheckTerminal
Terminal conditions:
- Wall collision (-1)
- Goal reached (+1)
- Max steps exceeded
end note
note right of UpdatePolicy
Policy updated using:
- Experience replay
- Neural network backprop
- TD-learning
end note
The training process is shown in the figure below, in which every robot will be reset to its original position whenever it hits a wall.

Training Algorithm
The agent learns through the following cycle:
- Initialization: Robot spawns at the start position
- Observation: Collect state information (position, orientation, LiDAR data)
- Action Selection: Neural network policy outputs velocity commands
- Execution: Robot moves according to the selected action
- Reward Evaluation: System checks for collision or goal achievement
- Learning: Update policy based on reward signal
- Reset: Upon terminal state, reset environment and repeat
Packages
nuturtle_description: Visualizing the pose of the turtlebot3 burgerturtlelib: Call customized library functions for kinematics and slam calculations.nusim: Simulate the turtlebot in the real world.nuturtle_control: Performing the turtlebot kinematics calculation and parsing the command sent to theturtlebot3.nuslam: Performs the SLAM algorithm for correcting the error odometry.nuturtle_interfaces: All message interfaces for communication between the nodesnuturtle_slam: Integrateslam_toolboxfor setting up learning environmentnuturtle_deeprl: Wrapped withOpenAI Gymnasiumfor Deep Reinforcement Learning traing environment
C++, ROS2, Open Robotics Turtlebot3 Burger, 2D LiDar, OpenAI Gymnasium, 深度强化学习
描述
在这个项目中,我设计和实现了一个 OpenAI Gymnasium 环境,用于训练深度强化学习模型,使 Turtlebot3 Burger 能够在复杂走廊环境中自主导航。系统利用2D激光雷达传感器进行感知,采用最先进的深度强化学习算法学习无碰撞导航策略。
演示视频
该项目的结果可以在下面的视频中看到:
OpenAI Gymnasium
为了启动 turtlebot3 的训练流程,我将 ROS2 turtlebot3 仿真封装在 OpenAI Gymnasium 环境中,为深度强化学习算法创建了标准化接口。
观测空间
环境的观测空间可以定义为:
- x: -1 → 4 米(世界 x 坐标)
- y: -0.5 → 4.5 米(世界 y 坐标)
- θ: -π → π 弧度(方向)
动作空间
动作空间可以定义为:
- ẋ: 0.0 → 0.2 m/s(线速度)
- θ̇: -0.5 → 0.5 rad/s(角速度)
奖励函数
奖励结构鼓励无碰撞的到达目标行为:
- 碰撞惩罚: -1(当机器人撞墙时)
- 目标奖励: +1(当机器人到达目标时)
- 这种稀疏奖励信号要求智能体学习探索性行为
训练过程
深度强化学习训练采用情节式方法,智能体通过试错学习。
训练过程如下图所示,每当机器人撞墙时,都会重置到其初始位置。
