If supervised learning is target practice, and unsupervised learning is art, then what is reinforcement learning? Well, drawing upon our dart throwing example once more, it’s like throwing darts blindfolded.
In supervised learning, all the data points have target values ( and sometimes labels ). In unsupervised learning, none of the data have any target values ( nor labels ). Reinforcement learning is sorta somewhere in between. Now, I don’t mean that some data points have target values while others don’t. That would actually be semi-supervised learning. The approach you take with these kinds of problems would be a hybrid of the techniques you normally use in supervised learning and in unsupervised learning.
No, in reinforcement learning, the data points do have target values ( and sometimes labels ), but you’re not shown what they are. Instead, you’re given a summary. Let’s go back to the dart throwing example to see what I mean.
The Blind Leading The Blind
You can feel and smell the wet grass around you. You realize you’re lying in a prairie somewhere. You open your eyes and are treated to a wondrous sight of a cloudless, azure sky — which unfortunately is partially blocked by a short, dumpy, balding, elderly gentleman wearing a magician’s robe and peering down at you. Do you:
- punch him in the face and try to get as far away as possible as fast as you can
- say, “Whazzappppp!”
- shut your eyes and play ‘possum, hoping he’ll think you’re dead and walk away
- cast the all-powerful Spell of Evedor—
“We don’t have time for Zork!” the elderly gentleman rudely interrupts. “Welcome to the realm of Darts & Dartboards! A magical world where mystery and adventure awaits! I am the Dartmaster, and I shall be your guide.”
You sit up. “Oh, I get it! I’m in a massively multiplayer online role-playing game!”
Dartmaster chortles, “Ha! I wish! We don’t have the budget for that kind of thing. In this realm, we rely on a millennia-old technology called imagination.”
“Ehh. I’d rather have huge data centers do the imagining for me.”
“Tough. You’re stuck with your own imagination,” he retorts. “The reason you’re here is because we need your help.”
“Oh?”
“Yes. The High Council of Exalted Dart Throwers were deeply impressed with your participation trophy. They teleported you here because they believe you are the only one who can save our world.”
You lean forward, intrigued. “Go on.”
“You see, this place can only exist when the darts and the dartboards are in perfect balance. If they are even off slightly, then we tip inevitably to our doom,” Dartmaster reveals. “Unfortunately, some dark energy recently passed through our lands, causing the dartboards to grow like crazy. With the balance completely out of whack, we are on the brink of world-wide destruction. To restore that balance, a noble warrior — we believe it’s you — must complete a difficult challenge.”
“What sort of challenge?”
“Dart throwing. But, it’s not like any that you’ve done before. For, in this world, one can only throw darts while blindfolded.”
As if on cue, a dart whizzes by your nose. Dartmaster turns and yells, “HEY, DIMWAD! WATCH WHERE YOU THROW THAT THING!”
The dart thrower off in the distance yells back, “I CAN’T! I’M BLINDFOLDED!”
Dartmaster turns back to you. “So, will you accept the challenge?”
You think about it for a moment. “Oh, what the hell. YOLO.” You get up. “My mad dart-throwing skillz can handle whatever’s in store. Bring it on, yo!”
Dartmaster is ecstatic. He grabs your hand and pulls you towards an opening at the foot of a grassy hill. You begin to have second thoughts and try to balk, but Dartmaster is surprisingly strong, and he drags you into the opening. You’re taken down a flight of stairs into a lit passageway. He leads you past door after nondescript door, until finally he stops in front of one. “Here we are! You’re going to tackle the challenge in this room,” Dartmaster utters.
“Uh, can we do this somewhere else, instead of a dungeon?” you ask.
“I don’t make the rules, kiddo,” he replies.
“But, you’re the Dartmaster … .”
“You’re thinking of the other short, dumpy fella. Dartmasters aren’t nearly as powerful. We’re just a step above fourth-level goblins.” He whips out a blindfold and wraps it around your head, over your eyes. You can hear him open the creaky door.
As you feel your way into the room, Dartmaster tells you the rules.
“You shall have a total of a hundred darts to throw. I will give them to you in batches of ten. After you exhaust each set, I will tell you the following:
- the number of darts that hit the bullseye
- the number of darts that hit the dartboard but not the bullseye
- the number of darts that missed the dartboard entirely
“Your objective is to hit the bullseye as often as possible. Hitting the dartboard but missing the bullseye isn’t as good but still okay.
“Please keep in mind that there may be more than one dartboard. There’s a wall in front of you, another to the left of you, and a third to the right of you. There are also the floor and the ceiling. Each one of these surfaces can have none, one, or multiple dartboards.
“Some dartboards may be bigger than others. Some may have larger bullseyes. It’s up to you to figure all this out.”
He guides you to the oche and hands you the first batch of darts. “You can begin whenever you’re ready. Good luck — the fate of Darts & Dartboards rests on your exponentiated shoulder angles.”
You take a deep breath … .
- If you’re ready to begin the challenge, turn to page 134.
- If you’re not ready to begin the challenge, then take another deep breath before turning to page 134.
Page 134
You decide to fling all of your darts at the wall in front of you. Dartmaster disappointedly declares that none of your darts hit anything, and then hands you the next batch.
TM( ) = ?
BULLSEYE → ( predicted ) target variable is less than or equal to ¾ inch
DARTBOARD → ( predicted ) target variable is greater than ¾ inch, but less than or equal to 13¼ inches
NOT_DARTBOARD → ( predicted ) target variable is greater than 13¼ inches
Practice Throw | Horizontal Shoulder Angle | Vertical Shoulder Angle | TM( ) Used | Predicted Landing (Predicted Target Variable) | Predicted Label | Where Dart Lands (Target Variable) | Label | Error |
---|---|---|---|---|---|---|---|---|
pt1 | 14 | 3 | front wall | ? | ? | ? | NOT_DARTBOARD | ? |
pt2 | 12 | 25 | front wall | ? | ? | ? | NOT_DARTBOARD | ? |
pt3 | -20 | 14 | front wall | ? | ? | ? | NOT_DARTBOARD | ? |
pt4 | -16 | -2 | front wall | ? | ? | ? | NOT_DARTBOARD | ? |
pt5 | -2 | 3 | front wall | ? | ? | ? | NOT_DARTBOARD | ? |
pt6 | 9 | -30 | front wall | ? | ? | ? | NOT_DARTBOARD | ? |
pt7 | -17 | 2 | front wall | ? | ? | ? | NOT_DARTBOARD | ? |
pt8 | 1 | 0 | front wall | ? | ? | ? | NOT_DARTBOARD | ? |
pt9 | 1 | -1 | front wall | ? | ? | ? | NOT_DARTBOARD | ? |
pt10 | -2 | 1 | front wall | ? | ? | ? | NOT_DARTBOARD | ? |
You contemplate your next move. Could it be that the reason why you didn’t hit anything was because your aim was terrible? Or, can it be that there isn’t any dartboard there at all? If you believe your aim was at fault, then you can continue throwing at the front wall. If, on the other hand, you suspect that there is no dartboard, then you can try a different wall, or the floor, or the ceiling.
You decide to continue throwing at the front wall. After you’ve thrown the entire second batch, Dartmaster indicates that one of your darts did indeed hit a dartboard. In fact, it hit the bullseye!
TM( ) = ?
BULLSEYE → ( predicted ) target variable is less than or equal to ¾ inch
DARTBOARD → ( predicted ) target variable is greater than ¾ inch, but less than or equal to 13¼ inches
NOT_DARTBOARD → ( predicted ) target variable is greater than 13¼ inches
Practice Throw | Horizontal Shoulder Angle | Vertical Shoulder Angle | TM( ) Used | BULLSEYE | DARTBOARD | NOT_DARTBOARD |
---|---|---|---|---|---|---|
pts[1 to 10] | [-20 to 14] | [-13 to 25] | front wall | 0 | 0 | 10 |
pts[11 to 20] | [-17 to -9] | [-16 to -8] | front wall | 1 | 0 | 9 |
So now, knowing that there is a dartboard in front of you, do you concentrate all your remaining darts there and try to hit the bullseye as many times as you can? Or, do you explore and try a different location?
Just because the front wall has a dartboard does not mean it will give you the best outcome. What if the other surfaces have better arrangements that would enable you to garner many more hits? What if the ceiling has three dartboards? Or, a giant dartboard covers the entire floor, which would mean you would never miss? Or, a single bullseye covers the entire right wall?
You choose to explore. In order to gain more information, you will need to sacrifice some darts. But you need to be careful not to use too many, or you will end up wasting a lot.
You throw your next batch at the left wall. To your surprise, Dartmaster excitedly announces that you’ve hit a dartboard seven times, and two of them were on the bullseye! What a fortunate turn of events! Had you continued throwing at the front wall, you probably would not have gotten as good a haul.
TM( ) = ?
BULLSEYE → ( predicted ) target variable is less than or equal to ¾ inch
DARTBOARD → ( predicted ) target variable is greater than ¾ inch, but less than or equal to 13¼ inches
NOT_DARTBOARD → ( predicted ) target variable is greater than 13¼ inches
Practice Throw | Horizontal Shoulder Angle | Vertical Shoulder Angle | TM( ) Used | BULLSEYE | DARTBOARD | NOT_DARTBOARD |
---|---|---|---|---|---|---|
pts[1 to 10] | [-20 to 14] | [-13 to 25] | front wall | 0 | 0 | 10 |
pts[11 to 20] | [-17 to -9] | [-16 to -8] | front wall | 1 | 0 | 9 |
pts[21 to 30] | [80 to 90] | [15 to -11] | left wall | 2 | 5 | 3 |
Okay, you decide to continue exploring and try the floor next. This time, he dispiritedly tells you that none of your throws hit anything.
TM( ) = ?
BULLSEYE → ( predicted ) target variable is less than or equal to ¾ inch
DARTBOARD → ( predicted ) target variable is greater than ¾ inch, but less than or equal to 13¼ inches
NOT_DARTBOARD → ( predicted ) target variable is greater than 13¼ inches
Practice Throw | Horizontal Shoulder Angle | Vertical Shoulder Angle | TM( ) Used | BULLSEYE | DARTBOARD | NOT_DARTBOARD |
---|---|---|---|---|---|---|
pts[1 to 10] | [-20 to 14] | [-13 to 25] | front wall | 0 | 0 | 10 |
pts[11 to 20] | [-17 to -9] | [-16 to -8] | front wall | 1 | 0 | 9 |
pts[21 to 30] | [80 to 90] | [15 to -11] | left wall | 2 | 5 | 3 |
pts[31 to 40] | [-17 to 16] | [-75 to -90] | floor | 0 | 0 | 10 |
At this point, you’ve already used forty darts. It’s getting late early. Should you continue exploring, or do you play it safe?
Because the left wall offers a potentially high yield, you decide to stop exploring and concentrate your remaining darts there. Here are the final results of all your throws.
TM( ) = ?
BULLSEYE → ( predicted ) target variable is less than or equal to ¾ inch
DARTBOARD → ( predicted ) target variable is greater than ¾ inch, but less than or equal to 13¼ inches
NOT_DARTBOARD → ( predicted ) target variable is greater than 13¼ inches
Practice Throw | Horizontal Shoulder Angle | Vertical Shoulder Angle | TM( ) Used | BULLSEYE | DARTBOARD | NOT_DARTBOARD |
---|---|---|---|---|---|---|
pts[1 to 10] | [-20 to 14] | [-13 to 25] | front wall | 0 | 0 | 10 |
pts[11 to 20] | [-17 to -9] | [-16 to -8] | front wall | 1 | 0 | 9 |
pts[21 to 30] | [80 to 90] | [15 to -11] | left wall | 2 | 5 | 3 |
pts[31 to 40] | [-17 to 16] | [-75 to -90] | floor | 0 | 0 | 10 |
pts[41 to 50] | [80 to 90] | [15 to -11] | left wall | 3 | 1 | 6 |
pts[51 to 60] | [80 to 90] | [15 to -11] | left wall | 2 | 3 | 5 |
pts[61 to 70] | [80 to 90] | [15 to -11] | left wall | 1 | 4 | 5 |
pts[71 to 80] | [80 to 90] | [15 to -11] | left wall | 4 | 2 | 4 |
pts[81 to 90] | [80 to 90] | [15 to -11] | left wall | 2 | 7 | 1 |
pts[91 to 100] | [80 to 90] | [15 to -11] | left wall | 3 | 4 | 3 |
You’ve hit the bullseye a total of 18
times, and the dartboard-sans-bullseye an additional 26
times. Pretty impressive!
But the question now is, could you have done better? At each point where you had a decision to make, if you had chosen differently, could you have potentially gained more?
In supervised learning, since the information about each data point is transparent and complete, you can be reasonably certain that you’ve found the best solution. In reinforcement learning, however, the information you get is always opaque and incomplete, so you’re never certain that what you’ve done is correct. You’ll always wonder whether you could have done better.
With supervised learning, the idea of finite resources is implied but never explicitly stated. After all, if you have an infinite training dataset, what’s the point of training? It will take you forever just to get through the set. With reinforcement learning, however, this finiteness is front and center. From the limited number of throws you can make, to the limited information you’re given. Everything you do comes with a price. Each additional piece of information requires you to pay in some way. Every action you take has a consequence that can positively or negatively impact you down the road. Scarcity, resource allocation, exploration, and regret are huge issues.
There's No Place Like Home
Dartmaster takes your blindfold off. He’s grinning from ear-to-ear. “You’ve succeeded! You did just enough to completely restore the balance to our world! We’re saved! You have the eternal gratitude of all the non-playable characters here.”
A beautiful butterfly starts fluttering around you. Out of nowhere, an adorable bunny rabbit hops into your arms and gives you a big hug.
Dartmaster waves his hand. “And now, it’s time for you to return home.” You start to feel dizzy. Everything fades to black ….
You wake up to the smell of your own vomit. You look around and realize that you’re lying on the floor of your local bar. Your friends are nowhere to be seen. Your head is pounding like crazy. “It was all just a dream!” you mutter to yourself.
You then notice a beautiful butterfly fluttering around you. Was it really a dream? Or did it actually happen?
“No, it was definitely a dream.,” you respond. “Look, there’s Bobby Ewing taking a shower in one of the bathroom stalls.”
Strategery!
Alright, I’ll admit, the dart throwing example above is bit contrived. That’s because dart throwing is inherently a supervised learning problem that I’ve tried to morph into a reinforcement learning one.
A less contrived example would be the game Battleship. Using dart throwing as an analogy, this game would be the equivalent of you still being blindfolded, but your opponent gives you the labels ( “HIT!”, or “MISS!” ) after each throw, instead of after every ten.
In addition, if you’ve ever played 4X strategy games, then you’ve dealt with reinforcement learning. In most 4X games, there are generally four phases:
- The first is when you build as many scout ships as you can to explore the region around you.
- Once you have sufficient information about where the best resources are, you build as many colony ships as quickly as you can to claim those resources.
- Next, you build resource-extracting facilities to mine as much of those resources as you’re able.
- You then spend the rest of the game defending your assets from your opponents.
For more details on DeepMind’s A.I. system, Alphastar, and how it was able to challenge top human players in the online real-time strategy game, StarCraft II, check out this fascinating article in Nature.
One of a Thousand Regrets
The fact is, most of the problems that you deal with in life are reinforcement learning problems. If you’ve ever had regrets, or used phrases like, “had I known then what I know now”, or “woulda, coulda, shoulda”, or “I didn’t realize the unintended consequences,” then you went and done reinforcement learning.
If you are stuck in a rut and wish you are doing something else, or you’re trying to decide between going back to school to acquire new skills versus sticking with your current job that you absolutely hate, then you are engaged in reinforcement learning right now.
You can even say that reinforcement learning is the embodiment of the human condition. For it is rare to live a life without any regrets, especially if you’ve lived for centuries.
This is why machine learning systems adept at tackling reinforcement learning problems can potentially have a much greater impact on society than systems designed to deal with other types of learning tasks.
Footnote
For more information on how the human brain is a reinforcement learning engine, check out the following: