The Introduction is all about students becoming familiar with basic concepts and terms used in Reinforcement Learning (RL). It is based on the slides, which introduce not only common applications but the general idea of how an AI can learn by itself by performing and evaluating actions.
The slides start by introducing RL using a short video about a self learning hide and seek AI of OpenAI. If the students don't understand english well enough, either auto translated subtitles can be used or the teacher can take care of translating/explaining what happens.
The main point to bring across is that in RL the AI learns by itself, without direct guidance of the programmers. The agents in the video where never told how to do things, only what they can do and what their goal is.
After the video, the next example is Leela Chess Zero (Lc0), a modern chess AI, trained using reinforcement learning. While traditional chess AIs had static functions to evaluate board positions (like getting fixed points for material differences and piece positions), Lc0 learned what good positions are by playing millions of games against itself and other players, and therefore can make better decisions on what actions to take. It is important to state that modern chess engines are far superior to human players, even the best grandmaster players can be beaten using a smartphone.
The reason why games are quite popular in RL research is, that there are well defined rules which makes it easier to define concrete actions and goals, which are needed for RL to work. This will be explained in more detail after the example slides.
As chess engines still mostly calculate moves in a similar way they did in the late 90s, when they started to beat human players consistently3, they sometimes are downplayed due to the 'simplicity' of the game. Still RL algorithms evolved and nowadays can go toe to toe to human players in much more complex scenarios. Therefore, the third example is about AlphaStar, an AI from Google's Deepmind research department, which has learned to play a complex modern computer game up to a level comparable to the best human players.4 Given that there are roughly 1026 legal actions at every point in time (compared to chess with an average of less than 40), it is astounding that the AI can choose the best ones while adhering to the same limitations a human player has (limited vision, reaction time).
The final example is a more practical one about custom advertisement (Ad). It demonstrates, that RL can be used in many different ways, in this case as a revenue optimization by learning which Ads are most clicked (or hovered over) and adapting future ads displayed to maximize the chance the user will klick on an Ad.
There are many more examples of use cases for RL, like training self driving cars using simulations, but these examples are a good enough starting point to now tackle the question, how an algorithm can actually learn by itself.
To be able to understand the basics of RL, one must know at least some definitions, which are introduced in the first few slides. Then the game TicTacToe is used as an example
Finally, the students have to think about what agents, states, actions and rewards are in the following examples. This can be approached in multiple ways ranging from a simple group discussion, to presentations in small groups, the methods page contains more ideas.
The final slides introduce the concept of Q-learning, as well as the RL interaction loop. It is based around the concept of action and reaction, where the agent interacts with the environment by choosing an action to perform. This action changes the environment and as a result the agent receives the representation of the new state as well as a reward indicating how good the action was. Given this new information the agent then might update its behavior and then chooses a new action to perform. This cycle continues until a goal is reached.
Q-learning is based on the RL interaction loop and stores a quality value (Q-value) for each pair of state and action. This value indicates how 'good' the action is given the current state. Therefore, whenever a state is reached, the action with the highest Q-value can be chosen for optimal behavior. As the Q-value might not be correct at the beginning (in fact, quite often it is set to a random value for each action at the beginning), it can be adapted by increasing/decreasing it in regards to the reward received for performing the action. When this process is repeated often enough, the Q-values will reach a point, where always the correct action will be performed.
As long as there are not too many state-action-pairs, this values can be stored in a table where each row consists of a state and the Q-values of all available actions.
The table on slide 29 shows a small section og such a Q-table for an exemplary jump-n-run game. Reaching a diamond is rewarded with +1 and falling into the water with -1. The next table on slide 30 shows the same situation, but with a more advanced table which includes future rewards into their action. This means an action that might result in falling into the water one action later will also get a small penalty.
This process of adapting the Q-values is called training. While it sometimes is feasible to train an AI during normal use, quite often the training process is independent from the final use, as it can take millions of iterations to create meaningful values.
After all this theory it is time to test the knowledge on real examples, which leads to the next three exercises, which build on each other.
This exercise provides an introduction into RL learning using the coin game. To help understand the basic concept, slides can be used as an introduction. The game requires some kind of tokens like coins, cones, treats, ... in a large quantity, so every pair of students has access to at least 5 of them. The game starts by having 5 coins on a table and the players alternately taking one or two of them away. The player taking the last token loses.
Introduce the game using the slides 1-8 and let the students play a few matches in pairs.
Introduce the AI using the slides 9-35. It is recommended to provide the material (board and actions) beforehand, so the students can see how everything looks like in reality and also are able to actively play along with the examples.
Then let the pairs of student play, with one taking the role of the AI, and the other playing normally. As the games continue more and more actions will be removed from the AI until nothing but the winning actions are left.
Finally talk about the following key takeaways:
This exercise build on the knowledge of the previous one (coin game) and is based on the website https://www.stefanseegerer.de/schlag-das-krokodil/ and the game Hexapawn. In Hexapawn two players play against each other on a 3 by 3 board using only pawns from chess, which can either move one field straight forward or capture an enemy pawn diagonally. A player wins by either reaching the opposite side with one pawn or by preventing the enemy to make a move (blocking all pawns).
First, demonstrate the capabilities of the website and explain the rules of the game. Especially the settings Response Time and Only possible moves should be explained, as they help in the following tasks. Furthermore it is vital that students understand the meaning of the coloured points (correspond to same coloured action) and see that these points can be removed or added.
Now it is time to let the students play on their own. The goal is to win as often as possible, before the AI cannot be beaten. This seems like a simple task, but soon students will realize they need a good understanding of the inner workings to achieve over 10 or even over 20 wins. Be careful, when the site is reloaded, the wins will also reset!
The key takeaways are similar to the coin game, but more pronounced, as students have to explicitly take actions they have not been used before, to force the AI into unknown territory. It can also be seen, that the number of possible states increases quite fast with the number of available actions. It can easily be envisioned, that on a bigger board (like a chessboard), there are so many possible states (roughly 1044)1 that it is not feasible to train an AI by hand, or even include all possible states.
This exercise again builds on the previous one (Hexapawn) and introduces a game with even more states, TicTacToe. As a good introduction the first part of this video from Matt Parker about MENACE can be used. It explains the basic idea of MENACE, which is similar to the Hexapawn example from before, but instead of working digitally, they store coloured beads inside physical match boxes.
While it might be an interesting idea to create such a system with the class, it will probably take quite some time, not only to create, but also to train to get meaningful results. Therefore, in this exercise an online simulator of the system will be used.
First, introduce the MENACE system, a RL TicTacToe machine, either by using the first ~2:20min (or more) this video from Matt Parker or by discussing (or group work) how the previous example could be modified to fit the game of TicTacToe.
Then demonstrate how to use the simulator on the website https://www.mscroggs.co.uk/menace/, explain especially the meaning of the numbers and systems to the right (states and q-value of actions, higher = more likely to be chosen, mirrored states are excluded) and how you can play games yourself or let the AI play against other AIs (like the random playing AI or a perfectly playing AI).
Given the knowledge of how to use the website, the students now have to train their own AI, which should be as good as possible.
Soon the students will realize, that even if this sounds like an easy exercise, probably none of the approaches will lead to a perfectly playing AI. The main reasons why this occurs are:
The technical problem behind this is exploration vs exploitation1, which will be explored more in the next exercise.
By using a maze in which a robot is rewarded by finding a battery, this exercise demonstrates that RL is not as straight forward as one might think. The exercise starts with this slides, explaining the scenario of a robot inside a maze which has to find a battery. In the beginning the robot will take random turns and over time it will learn the correct path to its goal. In the second exercise, the maze has multiple batteries with different values, which will lead to sub-optimal results, as explained in the last few slides.
First, introduce the students to the game using the slides 1-18. Then provide each student (or pair of students) with a copy of the RL Maze Basic, a dice and a pen.
Then the students have to play multiple rounds, until the path to the battery is clearly defined. To update values, one can simply cross it out and write the new value close to it. Some students will be 'lucky' and the path will form very fast, while with other the robot will go into every dead end before it reaches its goal. If some students are way too fast, they can try again with a new sheet of the same maze and see how it behaves differently the second time.
Provide each student (or pair of students) with a copy of the RL Maze Advanced and again let them train their robot. Even if there are multiple batteries this time, the robot still goes back to the beginning whenever it reaches a battery. The game is over again, if the robot has a clear path to one of the batteries, even if it is not the one with the highest value.
When everyone is finished, different robots will have reached different scores. Now create groups of around 4 students and let them discuss why some robots scored better than others and work out a possible solution. It might help to group students with different robot scores together, so they have a better visualization of the problem. Then each group has to present their findings and solutions.
Finally, use slides 19-26 to discuss the exploration vs exploitation trade-off1 and possible solutions for this exercise.
Provide new copies of the RL Maze Advanced so that the groups can test various new approaches (like different exploration rates).
This last chapter acts as a recapitulation of the whole module. It is based on a quiz with ten true or false questions and encourages students to think about all the aspects they have learned.
The quiz itself consists of slides, showing one question after another before showing the correct answers. Therefore, students can write down their responses and then calculate their score in the end. It is highly encouraged to pause after revealing the answer for every questions to discuss briefly why it is correct or wrong. Alternatively, this exercise can also be done in groups to further encourage discussion between students.