Sessions is temporarily moving to YouTube, check out all our new videos here.

Playing Pong with Crappy Q Learning

Dirichi Ike-Njoku speaking at Ember London in June, 2017
Great talks, fired to your inbox 👌
No junk, no spam, just great talks. Unsubscribe any time.

About this talk

Teach a browser to play Pong.


- I'm Dirichi Ike-Njoku and I'll be presenting, well teaching, a few pixels on your screen to play Pong with crappy Q Learning. Does anyone want to hazard, like, a guess at what reinforcement learning is? - I'm sure you didn't get it off the slides. - Right. - Yeah, because they associates the food with the sound of the bell. - What was the name of that-- - Pavlov, yes, exactly. So, yeah, reinforcement learning borrows from that kind of thinking. That you can, the way to teach something to, the way that organisms or, like, I don't know, apparently non-living things learn too, is by trying things out and then failing or succeeding and then associating, you know, the success or the failure of the outcome with the action that he took. So reinforcement learning is essentially just like I'm running around in a room, I'm a rat or something, and I hit a lever and then the human being, who is like playing this stupid prank on me, throws in cheese. I'm like oh, okay, cool, I just found cheese when I hit the lever, that's cool. Eat the cheese, run around in the room again, look at the lever. Do I want cheese? I do. Hit it again and like that. Or, if I'm a dog, I drool when I, you know, hear bells ringing which is kind of messy. But, so yeah, that's what reinforcement learning is. It's as simple as that. Put a cat in a box, hit the lever, you know, the box open, the cat escapes. So it learns to associate the positive reward from that action with, from, well, it learns to associate the positive outcomes with the actions that it did, that we see that the outcome. Behavioural psychology gave us reinforcement learning and essentially models, specifically, machine learning models reinforcement learning as an agent, a rat, or a cat in an environment, a box with a lever or something, and a bunch of rewards which amount to like the outcomes that come from each action that he take. So if I run around for a bit in my tiny box and I don't hit the lever, there's no real outcome because I'm not, you know, escaping from the box. There's no reward. There's no cheese. Whereas if I hit the lever then there's cheese. So, that's essentially how reinforcement learning models the action of learning. States, actions, and rewards. And I guess a fancy sounding name, I told you, I suck with names but this guy's name was, like, he tacked his name right on so I wouldn't forget, so it's a Markhov Decision Process. Markhov Decision Process is essentially just like show a sequence of states and actions and rewards. So, if you see here, that's a state, S one, S zero, S two are all states. And your agent, your tiny, little rat in the box, or your tiny, little cat in the box, has a probability of transitioning from one state to another, based on the actions it takes at each state. So if I'm in S one and I take action, A one, then I have a 0.05 probability of transitioning to S two. It's not, it's not even fancy math, it's just like a common, like, a statement. If I'm in this state, how do I get to the next state with some probabilities? But yeah, but obviously there are some problems. Like it sounds really fancy but like if I was the rat, actually, rats are a lot smarter, let's use the, let's bring the example home to Pong. If I'm playing Pong, if I'm a paddle and I'm just going up and down the screen, if I hit the ball and the balls goes past my, wait, everybody knows how to play Pong, right? It's okay, you don't have to be shy. I can explain. Okay, cool. If I hit the, if I'm the paddle and I hit the ball, it could take like, I don't know, 20 refreshes of the screen or 20, like, I don't know, iterations of the game engine renderer for me to get that reward. And, in that interim, I would have taken like 20 other actions, so how do I learn to associate the positive reward from actually hitting the Pong with the action of hitting the Pong? You know what I'm saying? So there's like a temporal, like, difference. There's a time difference between the point at which you got your reward and the time at which you made the action, the critical event that gave you that reward. So that's a common problem in reinforcement learning and it's called Credit Assignment, trying to assign the rewards to the actual action. Another, you know, problem is like local minima. In general machine learning or, in this specific case, exploration versus exploitation. So like I could, as a tiny, little rat, enjoy the feel of the wood on my skin in the box. Which, you know, that's nice but I'm not getting any cheese from that so I'm probably going to starve to death. So, if I was a really, really, really, really dumb rat, I wouldn't eventually go and touch the lever and get cheese, I mean, I guess I'm not dumb, I'm just like not very willing to try new things. I'm not talking about myself, I'm talking about the rat. But yeah, I just really like the feel of the wood against my skin so I'm just going to stay there and just keep rubbing my, like that, and then I don't get cheese and I die. So, but then in the short-term, I'm getting rewards, right, in the short-term I'm getting like, you know, pleasure from having the wood against my skin. So I'm like yes, living life, living large, balling, hashtag, I don't know, living my best life or something. But then I'm not getting the cheese so I'm dying, right. So that's a common problem in reinforcement learning, how do I make sure that I explore all available, you know, possible states so that I can get the rewards. But yet, how do I not tyre myself out because the other scenario could be that the rat could just be trying every single thing, just touching every single point in the box. You have to die from that kind of thing, from that kind of, like, I don't know, concentration, like exploring every single possible state and then making sure that like there's no reward there, right. So you have to actually find a balance between exploring and exploiting what you know. And then there's a curse of dimensionality, which is, you know, a general machine learning problem where you have like massive data and like it takes, I don't know, too long to iterate all of your details that making sense of it and it's a common problem, you have like, I don't know, a table with like billions, or maybe not billions of rows. If you're Google and you're trying to index the entire internet, you know, that's a massive problem, right. It's not just limited to like the speed of the algorithm, it's even like how much detail you can even house in the first place. So that's like a problem you're also facing in machine, in reinforcement learning because, if you think about it, if you have that tiny, little rat again in the box, they are, I don't know, space is infinite, right, but you can always divide space into, like, a smaller, smaller, like half, half, half unit or something. And so, is the rat going to actually explore every single atom of space in the box just so he can make sure, like, you know, he gets cheese? That's not very expedient, right. But yeah, it's a curse of dimensionality in reinforcement learning, like how do you make sure that you explore all possible states. Actually no, not that how do you make sure that you explore all possible states, it's just states, the space of states might be so continuous that you cannot, like, you know, that it will be really difficult to explore like the entire space cause it's infinite. Enter Q Learning. The best, well the only kind of reinforcement learning that I know right now. As I said, I'm not an expert but I'm getting there. So it's a technique of reinforcement learning and it's goal is essentially to maximise long-term future rewards. I'm going to break down what future rewards means but it's most easily represented as a table. So, say I have a table of states. Now I'm bringing it back to Pong, no more rats in the box. Is it below the ball or is it above the ball? Those are two states. And then possible actions. Does it move up, does it move down? True story. This is how I hard-coded the part of the game that competes against my machine learning algorithm. In retrospect, instead of like subjecting to myself to the curse of dimensionality and exploring, like, the location of, like, the paddle in pixels or something like that and then trying to make sense of that, I could have just gone with this and this talk would have been two minutes long. But unfortunately, I'm not an expert. So but then this is how I literally hard-coded my algorithm, like, in real life, I was like oh, okay, if the ball is up, I probably want to, like, if the ball is higher than my paddle, I probably want to move up to meet it. If the ball is lower than my paddle then I probably want to move down to meet it so that it connects with the paddle and doesn't go past me and I don't lose points and I don't die because I didn't eat cheese. So but then Q, a Q Learning algorithm, I realise I've not even explained it, essentially models the action of learning as a bunch of states and a bunch of actions. You put in a reward for each possible combination of states and actions. And here's the thing, some states and actions might never even coincide. Like if I'm at the edge of the box as a rat again, I can't move, I can't move behind that, right. I can't move behind the edge of the box so that doesn't even make sense, that's not even a remote possibility. But being, you know, how do you say, the non-expert that I am, I just like, in my code, make all states and all actions like possible because you can be whatever you want to be. So here's the intuition, like, you know, broader intuition for Q Learning. If I'm in this state, I'm in the below ball state, I observe all my actions and I choose, well, this is when the algorithm is mature. I choose the action that will get me the most future rewards. By that I mean if I'm in this state, I want to look into the future and see the action I will take that will maximise my rewards in the future. So this is what I was referring to by the exploration versus exploitation problem. So I don't want to just take an action that will give me a short-term reward. So these rewards don't represent the immediate reward you get by taking this action but they represent the long-term reward that, at the end of the game, this is how much you're going to end up with. Does that make sense? Are there any questions? None? You guys are really smart, it took me a really long time to absorb this but, you know, good on you. So, I was saying Q state, Q Learning is presentable as a table, which is states and actions, and within each cell is like the future reward you're going to get at the end of the game. And so yeah, I've explained that. Then let's see, I'm surprised, I don't know what's going to come next in this. Mo' problems. How can you predict the future? Like how can I, in state zero, know what is going to happen in state N, the terminal state of the game? There's no, I mean we're not wizards, we can't, I mean we're programmers, that's all we are. Let's take a moment and let that sink in. So how can you predict the future? You can't but you can approximate the future. Which is what is really, really cool about Q Learning. So let's just take it for granted that, like, at state zero, you know that your rewards at state, well, at state zero, all the rewards you are going to get in the future are reward at zero plus reward at one plus reward at two, all the way to reward at N. And same for, well actually no, I've not explained that yet. So at each state you're in, the future rewards you're going to get is the sum of all the rewards you get at each successive state. Of course the future rewards, the future rewards that you get in like, say, state N, will probably have less and less to do with, with what you did at state zero. For example, if I read for my exams three months ago, because I'm kind of dumb, I'm going to like naturally forget stuff, right, and then at the end of my exam, I'm not going to, like, when the exams come, I'm not going to do very well. Whereas, okay this was probably a bad example, where if I read for my exam one day before the exam, I will remember some stuff, it will be fresh in my short-term memory, right. So, by that line of thinking, the rewards that you get from each successive state have less and less to do with what you did in the past. Which makes sense, right. Right? Can I get an amen? So the way Q Learning actually models the future rewards is not the sum of your rewards but is like an exponential sum of your rewards. Where this y, which is supposed to be gamma, but you know, I'm not an expert, is less than one so at r one, you probably, if your gamma is 0.8, you'd have 0.8 plus, 0.8 times r one. At r two, you'd have 0.8 squared, 0.64, 0.512, etc, all the way to the end. Oh my, okay, wow, sorry. So yeah, that's the, that's how Q Learning models the rewards that you get in the future. So now let's actually get down to the down and dirty with Q Learning. This is essentially, there's a guy called Bellman. Don't know what he looks like. But he did this. So the idea is, you remember I told you that your Q is representable as a table, right? With states and action? So what you, to learn, and this is how I guess I wrote it, to learn at each state transition from state one to state two, what you do is you get your reward, you make, actually you take a step, I'm rat, I take a step, or actually no I touch the lever and then I check for my immediate rewards. The immediate reward I get is one slice of delicious cheese, right, makes sense. But then I also want to look to the next state. The next state is me having the cheese which is, you know, a good state. And add, for Q s a, which is like an, s being the index, the state index, and a being the action index in the Q table. I will add a reward to that value that is commensurate with the best action to take from the next state. I think I need to break this down further. Okay, let's try again. Let's try again, stay with me. Let's break down, let's break both of these down. S, here, s prime represents the next state. A prime represents the next action. Max a prime represents the action that you will take in the next state that gives you the highest reward. So if I have two actions in the next state, up and down, and I'm at state a. Having taken the first action that will take me to s prime, what I'll do is I'll check all those actions that are available in s prime, in state s prime, and I'll choose the best action. Because naturally I want to choose the action that takes me, like, that gives me the highest rewards in the future. So obviously in the next state, the action that will give me the highest rewards in the future is the action that has the highest rewards in the future. So let's say I'm here. I'm at Q s a. My s is below ball, my a is, my a is move up, right. So I was in this below ball state and I choose an action called move up, right. So that means my immediate reward is 0.5. So I'm going to look to the next, and then I guess that transitions me to this state, above ball, right. So I'm going to look at the state of space, the space of actions in this above ball state and choose the one that gives me the best future reward. Which is to move down. So that the next action I'll take after moving up is moving down because it gives me the highest future rewards. So I guess I just went from here to here. And now, 0.5 plus 0.5? - One. I was thinking about you when I was making these slides. So, yeah, there we go, so at each state you take an action then you look into the future and see all the possible states, all the possible actions you can take at the next state and choose the highest one and assign the rewards from that state to your own future rewards in S zero because, oh, okay. - You never stop. I mean, if the game has a terminal state then you could stop but then like real life is real life, right, like things run forever. Oh, actually, crap, people die. I'm sorry. You can, if you stop at a point, since the goal was to have maximised all your future rewards, right, then you sum up the rewards you've gotten from each state transition, right, and then you check, like, how much could I have possibly gotten if the highest score you get in the game is 100 and at the end you get 86, then your regret is like 100 minus 86, 14. So it makes sense if it's like, if you have a terminal state. Whereas if it's, like, if your game runs forever like mine does cause you can, you know, be whatever you want to be, you can just, like, the way to validate your algorithm is working is to observe that, like, from each successive state transition, your rewards are increasing. Or at least the rate of your accumulation of your rewards is also increasing so it's like acceleration. So yeah, we've kind of gotten the intuition for how Q Learning picks its next states. So, of course, I just, what I just said was, everything I just said for the last three minutes was wrong. Bear with me because that's like the intuition, like, you want to be able to be, like, from in this state transition from state zero to state one, just choose the action that gives me the maximum future rewards. Can anyone guess why that might not, or in what instances that might not work? - Exactly. So the problem becomes this stops working when you don't, as they said, when you don't have any experience or when you're, you've not, like, visited that state, like, one million times and known for sure that, like, and tried out all the actions in that state and known for sure that this action is always the best or maybe 86% of the time it's the best, right. So, if you're not experienced and, obviously, when you start the game, you're not going to be experienced, you don't have any knowledge whatsoever. So all your states and actions, they are not visited. You probably have like zero zero zero zero zero in your table or like in my case, random random random random random because you can do whatever you want to do. So you can't always depend on this to work. Although it kind of works, in the long-term it converges. Somebody, whose name I forgot, did a math, I don't know loads of people, to prove that, like, in the long-term it eventually converges. Your approximations of future rewards eventually converge with what the actual future rewards are, at some point in infinity. But, as James said, the problem is exploration versus exploitation. It's a cold start because you don't know anything once you're introduced as an agent into the environment. So you don't have any idea what your actions will give you. At best you have random guesses which don't really amount to anything. So one policy you could actually take is, at the beginning of the game, with a certain probability, maybe 70% of the time, try something random, you know, do something exciting. Go out with the friends when you have an exams, when you have exams tomorrow, and see if that, you know, gives you any, like, higher scores at the end of the day. Or like, I don't know, break something that your mom told you not to break and see if that gives you any rewards. Just try random stuff. These are bad examples by the way. But, in case anyone was, you know, thinking about breaking stuff. Try random stuff, see what happens with, like, say 70% probability and then, the rest of the time, actually follow your intuition, you know, follow your gut feeling. You know, if your gut tells you to study for your exams, maybe you should do that. Or if you don't want to study for your exams, that's cool too. So, now I've given you guys the intuition for how Q Learning works and I don't know if you have any questions about it you can feel free to ask me questions at the end of the talk. But one of the other challenges that we've not solved is state space, is the fact that, like, you have, I don't know how many 1084 by, I don't know, maybe 720 pixels on your screen. And to take that as a representation of space and feed that into the algorithm, I mean that's not the most, and then populate each, you know, 1084 times 720 by two, because right now we only have two actions, up and down, and populate each of them. Visit each state and then populate based on your experience. It's probably going to take you one million years to converge, don't ask me, I'm not an expert. But, yeah, so, and this is actually anecdotal, like I was doing just what I said you guys should not do right now and trying to use, well not pixels but like large aggregations of pixels to determine where my paddles are located and where the ball is located. And feeding that off to a database so that whenever I had to restart, I won't have to restart from scratch. I would have, you know, I will pull in learning that I already have from the past and, like, try again. I just found out today, I think one hour ago, when I was about to give this talk, that my, I was using CouchDB. CouchDB can only take in, at least my CouchDB on my computer, maybe it's not an expert, can only take in about 34,000 rows and I wanted to get 37,000 rows, which represent all the states so, I really burst my bubble but no worries. So maybe, ways that people mediate that is using neural networks or any other kind of algorithm that you could use to, like, approximate. Because let's, you know, try the intuition again. If I'm using Q Learning, back in the box, tiny, little rat looking for cheese, we have four, you know, four, four grids inside the box. If I'm in grid one and that's where the lever is and I touch the lever, then I know that that state is a good state to be at and all actions that lead me to that state are good. But if I'm in grid two which is close enough to grid one that I can touch the lever from there, I'm not going to, having known that grid one is close to grid two, I'm not going to know enough to associate grid two with a good, you know, with good rewards. Like I'm going to have to actually physically go visit that state, visit that grid, and then I can tell oh yeah, maybe this is a good state to be, maybe I can reach the lever from here. So, in a way, Q Learning is smart but Q Learning is also dumb because if you don't use, if you use it well, Q Learning with a table, if you used a table, you're going to have to visit every single possible combination of state and action to be able to make sense of the rewards that you get. Whereas if you used a neural network, you could, rather than visiting each state, you could guess at the rewards you get from each state and each action. One thing that was really cool was when I made, when I did this test, I know you guys are getting tired, the demo is coming, just give me two minutes. I had, I have, as I set out 37,000 rows and two columns. Well, actually no, 37,000 columns and two rows. What's going on? Yes, yes, 37,000 columns and two rows. So that's like massive data, right, that you can train, like, an algorithm on, like simple linear regression algorithm that can give you some intuition that can tell you what to do in states that you've not visited before. You know what I'm saying? I mean, if you guys are familiar with gradient descent with the way you do function approximation with linear regression, you just, like, even if I've not, I wish I had a slide that showed a graph with like dots on it. So there'll be some states that I've not visited or they are, if the graph was this way, it might be a state who I initially populated with, like, a random action but because I can, like, draw a straight line across that state and that action, then I can guess that my rewards assigned randomly initially to that state were wrong and I can better use a linear regression algorithm to choose, to make the right decision. So it would just like simplify and make me not have to visit each state physically to actually know the right rewards. I hope that makes sense. Even logistic regression, man, you don't even have to move very, well actually, logistic regression, you don't have to go that far. At least I don't think, I've not tried it yet, but I don't think, I think you can still get pretty good results with logistic regression as against a full on neural network. But, like, I think that will be the next iteration of this game, if I work on it. So, yeah, that's it, resources, I depended heavily on this article. It was an amazing article, I had to read it over and over again to actually get, even though it makes Q Learning, like, sound very simple, I had to read it over and over again to get what it meant. And then Andrew Ng on with his Machine Learning course. Really cool guy, makes things sound very simple even if they are complex, which is the kind of stuff I like. So some of that helped me to be less scared when I was doing my research for Q Learning with that article. If anybody wants to sue me, I have the image credits here, so don't sue me. So, that's it and I will just show you, like, a really quick demo. This is where it will start from. This is a fairly dumb Q Learning algorithm. And see I set the brain.Epilson to zero. This is, now it's no longer acting as randomly. It's a little better. But if we stop that and try and open something I did before. Here we go. And this is not even the best I've had, to be honest, like, but this was the only one I had the sense to record, thank God I did. But it looks like it has something of a coherent policy that follows the ball, even if the ball is far away from it and this is not hard-coded, this is just telling it hey, if the ball is behind you, take minus five from your rewards, or if you hit the ball, add plus two to your rewards. And then, as you can see, it had to get it wrong about 4852 times for it to actually be like oh okay, there might be, you know, there might be some, you know, science to this madness, to this world, this cruel world that I've been introduced to. To be punished. But, yeah, so that's pretty much it. Thank you guys for, you know, listening to me, taking your time to listen to a non-expert. Feel free to fire questions away. I think it starts now. Yes, Mr Willis? - Oh, okay, cool. The code is very messy cause, as I've said, non-expert. But see, this is like what ideally I would be able to, like, grab from my code and put into, like, another application that requires, like, a learning algorithm. I tried to stay true to the math but, like, you know, you know how life is. So Q s a would be the Q at state a and action a so old states, so this is a transition. I'm transitioning from this old state to this new state and I want to reward my current states with the future rewards from the next state. So this is where I've multiplied by gamma which is like a help discount the future of the rewards in the future. And then alpha, which I didn't show you in the equation, is what essentially mediates how randomly I act. Right, I think? Let me check it again. Alpha, no, no, alpha is the moving average, sorry. Let me explain that. So, I don't, I want to, I want, I don't want to just, like, absorb all the rewards from the future states, I want to also preserve some of the rewards that I have observed. Because, like, say if I'm, having reached this stage, I have made 1000 recordings of what should be in Q at state s and a. If I just get one more, what's it called? One more sampling and I see into the future for Q s prime and a prime, I don't want this one more sampling to outweigh the 1000 samplings I've had in the past, right. So I have to do a moving average where I say if you move this Q s a here, you have Q s a times one minus this.alpha. And I have Q prime s prime a times this.alpha. Sorry, I know that doesn't make a lot of sense. I will, wish I could do more for you guys. I will probably post, I don't know in Slack, like, the lengths of the article where he actually breaks stuff down. So this would be like, this is what happens in a state transition. And then what actually, how you actually learn is that... Oh yeah, here we go. So you, action, you select an action. Selecting an action is either random or the best action based on some variable. Then you perform the action, then you get the new states, you get the reward, and then you pass them into the function I just showed you, which is your Bellman updates. And then you set your states as the new states. Sorry. And then you decay Epilson. Epilson, by the way, is the rates, is the exploration versus exploitation mediation variable. So I think that's pretty much, that's the code, that's the main, you know, that's the main thing. The rest of it, the game is messy but I think I've tried to came very, stay very true to the way the equation is presented. Thank you very much.