Maker.io main logo

Tiny Reinforcement Learning (TinyRL) for Robotics

2023-12-18 | By ShawnHymel

License: Attribution Arduino

Reinforcement learning (RL) is a form of machine learning that involves training agents to interact with an environment in order to maximize cumulative rewards. It often involves trial-and-error learning for the agent. See this article to learn more about reinforcement learning.

In this post, we will discuss how to run reinforcement learning agents on microcontrollers and the benefits of doing so. To see one such example of a TinyRL agent in action, check out the following video:

 

In the video, I use the proximal policy optimization (PPO) algorithm to train an agent to perform the swing-up action on an inverted pendulum. Balancing the beam proved too difficult for a number of reasons, which will be discussed later in the article.

Caveat: Reinforcement learning is still an active area of research. Using an RL algorithm to solve a control problem is, in almost all cases, not the most efficient approach. Classic control theory will almost always yield better results with far less frustration. I wanted to tackle a simple control theory problem with RL using real hardware because it’s cool to have a robot “just figure it out on its own.”

Overview of Reinforcement Learning

Reinforcement learning involves finding an optimal solution to general control theory problems that can solve various tasks. An agent is the AI decision-making process that takes in observations, chooses actions, and learns from rewards.

Reinforcement learning loop

The environment is the world the agent interacts with. This can be a board game, a video game, a virtual environment, or the real world. In our example, we use the STEVAL-EDUKIT01 inverted pendulum kit from ST Microelectronics. Wrapper code is used to sense the environment to generate an observation as well as perform the actions as decided by the agent. We use an Arduino to act as an interface between the environment and the agent during the learning process. A powerful CPU or GPU is used to train the agent in real-time.

The observation consists of four values: encoder angle (θ), encoder angular velocity (dθ/dt), stepper motor angle (φ), and stepper motor angular velocity (dφ/dt). The agent uses these values to make a decision about which action to take next.

Reward

Initially, I tried to have the agent learn to swing the pendulum up and balance it at the top (vertical). This also included using a continuous action space where the agent could choose to move the stepper by any amount between -30° and +30°.

Reward function for continuous action space

The reward consisted of a function (see image above) using all 4 observations that would ideally be 0 if the pendulum pointed straight up (θ = 0), was not moving (dθ/dt = 0), and the stepper motor did not move (dφ/dt = 0) from its original position (φ = 0). This was based on the reward function from the inverted pendulum problem in the gymnasium framework.

The episode would end if the stepper motor moved more than 180° in either direction with a 500-point penalty.

After many tests, I discovered two main issues with this approach:

  • Training agents with continuous action spaces is difficult
  • The round trip time for take observation, update agent, and perform action was over 30 ms

The first issue could be overcome with additional training time or moving to a virtual environment (such as NVIDIA Omniverse or Unity) to train an agent initially. A virtual world offers the ability to train multiple agents in multiple environments in parallel with relatively little overhead. It also solves the issue of needing to reset the physical hardware after each episode.

The second issue was a showstopper. Based on my experiments with a PID controller, the round trip time needs to be around 30 ms or less in order to maintain the pendulum in the inverted position. Once again, this could be remedied by moving to a virtual environment. However, you still must perform final agent tweaking using real hardware.

The round trip time resulted from the training framework (Stable Baselines3) taking 20-40 ms to perform updates along with the required serial communication between the wrapper framework (gymnasium) and the Arduino (5-10 ms). While it is possible to perform inference inside the microcontroller, as we will see in a moment, training the agent is nearly impossible due to resource constraints.

To prove that TinyRL is still possible, I reduced the scope of the problem to just focus on the swing-up part. To accomplish that, I added two additional constraints on the reward function: end the episode with a 100-point reward if the pendulum is moving slowly enough near the top position and end the episode with a 200-point penalty if the pendulum is moving too quickly near the top.

Reward function with discrete action space

As we no longer care about balancing the pendulum, I changed to a discrete action space with only 3 possible actions: move the stepper motor by -10°, do not move the stepper motor, and move the stepper motor by +10°. In the end, this made training much easier in less time.

Deploying the Agent

I used PPO in the Stable Baselines3 framework to train the agent. The agent is a simple 3-layer dense neural network with 256 nodes in each of its hidden layers. Note that PPO is a form of actor-critic method, which relies on two similar models.

Actor-critic neural network

The actor (right side in the image) is the agent. Its job is to take the observation and make a prediction about which action should be performed. The critic (left side in the image) has the job of estimating the value function based on the current observation. The critic is only used during training, which means we can leave it out once training is complete.

To deploy the agent to our microcontroller, we first strip out the critic side of the model. We then compress and quantize the model to make inference run faster on our microcontroller.

Optimize and compress neural network

I used Edge Impulse to perform the compression and Arduino to run the final inference code. You are welcome to use any such frameworks–these just made my life easier since I know them well. I deployed the model to a Seeed Studio XIAO ESP32S3, as the new ESP32S3 chipset contains powerful neural network acceleration. 

Once deployed, the interpreter, agent, and action wrapper run entirely in the microcontroller.

Reinforcement learning loop on a microcontroller

Note that there is no longer a need for a reward function, as training is complete. The agent (the 3-layer neural network in our case) takes in an observation and makes a predicted action. The Arduino moves the stepper motor specified by the agent. With some luck, the pendulum will swing up to the top!

Going Further

The code for this project can be found in this repository.

See the following content to learn more about reinforcement learning:

制造商零件编号 STEVAL-EDUKIT01
STEVAL-EDUKIT01 EVALUATION KIT F
STMicroelectronics
制造商零件编号 113991114
SEEEDSTUDIO XIAO ESP32S3 NO HDRS
Seeed Technology Co., Ltd
Add all DigiKey Parts to Cart
TechForum

Have questions or comments? Continue the conversation on TechForum, DigiKey's online community and technical resource.

Visit TechForum