And this happens all the time. There’s always a few images in every batch that are corrupted for whatever reason, you know Google Image tried to Told us that this URL had an image, but actually it doesn’t anymore So I’ve got we’ve got this thing in the library called “verify_images()” Which will check all of the images in a path And will tell you if there’s a problem if you say ‘delete=True’, it will actually delete it for you Okay, so that’s a really nice easy way to end up with a clean data set. So at this point I now have a bears folder containing a /grizzly folder and a /teddys folder and a /black folder in Other words I have the basic structure we need to create an image data bunch to start doing some deep learning So let’s go ahead and do that. Now Very often when you get when you download a data set from, like, Kaggle or from some academic data set, there will often be a folder called /train and a folder called /valid and a folder called /test right containing the different data sets. In this case, we don’t have a separate validation set because we just grab these images from Google search, right? But you still need a validation set Otherwise, you don’t know how well your model is going and we’ll talk about more about this in a moment So whenever you create a data bunch If you don’t have a separate training and validation set then you can just say okay well the training set is in the ‘current’ folder because by default it looks in a folder called /train and I want you to set aside 20 percent of the data, please. So this is going to create a validation set for you automatically and randomly. You’ll see that whenever I create a validation set randomly, I always set my random seed to something fixed beforehand. This means that every time I run this code, I’ll get the same validation set so in general I’m not a fan of making my machine learning experiments reproducible; i.e. ensuring I get exactly the same result every time. The randomness is to me really important a really important part of planning out is your solution stable, you know Is it going to work like each time you run it but what is important is that you always have the same? Validation set. But otherwise when you’re trying to decide has this hyper parameter change improved my model But you’ve got a different set of data you’re testing it on then you don’t know maybe that set of data It just happens to be a bit easier. Okay? So that’s why I always said the random seed here. So we’ve now gone (let’s run that cell), so we’ve now got a data bunch And so you can look inside at the data.classes and you’ll see These are the folders that we created So it knows that the classes or you know, so by classes we main all the possible labels black bear grizzly bear or teddy bear We can run show batch and we can take a little look And it tells us straight away that some of these are going to be a little bit tricky So this is not a photo, for instance, some of them kind of cropped funny Some of them might be tricky like if you ended up with a black bear standing on top of a grizzly bear that might be tough Anyway, so you can kind of double check here, ‘data.classes’, here they are They are (remember ‘.c’ is the attribute which the classifiers tells us how many possible labels there are?). We’ll learn about some other more specific meanings of ‘.c’ later. We can see how many things are in our training set, we can see how many things are in our validation set. So we’ve got 473 training set, 141 validation set So at that point we can go ahead you’ll see all these commands are identical to the pet classifier from last week we can create our CNN (our convolutional neural network) using that data I tend to default using a ‘resnet34’ and let’s print out the error rate each time and run ‘fit_one_cycle()’ four times and see how we go and We have a 2% error rate So that’s pretty good. I Personally, I mean sometimes it’s easy for me to recognize a black bear from a grizzly bear But sometimes it’s a bit tricky. This one seems to be doing pretty well Okay, so After I kind of make some progress with my model and things looking good I always like to save where I’m up to, to save me the 54 seconds of going back and doing it again and, As very usual we unfreeze() the rest of our model. We’re going to be learning more about what that means during the course And then we run the ‘learning rate finder’ and plot it (it tells you exactly what to type) and we take a look. Now, we’re going to be learning about learning rate, today, actually. But for now, here’s what you need to know: on the learning rate finder, what you’re looking for is the strongest downward slope That’s kind of sticking around for quite awhile, right? So this one here looks more like a bump, but this looks like an actual downward slope, to me So it’s kind of like it’s something you’re going to have to practice with and get a feel for, like what bit works so like if you’re not sure is it this bit or this bit, try both learning rates and see which one works better. Okay, but I’m I’ve been doing this for a while and I’m pretty sure this looks like where it’s really learning properly. So I would pick something Okay here it’s not so steep. So I would probably pick something back here for my learning rate. So you can see I picked 3×10^-5 so, you know somewhere around here. That sounds pretty good. So that’s for my bottom learning rate So my top learning rate I normally pick you know 1×10^-4 or 3×10^-4 It’s kind of like I don’t really think about it too much That’s a rule of thumb it always works pretty well One of the things you’ll realize is that Most of these parameters don’t actually matter that much, in detail. If you just copy the numbers that I use each time It’ll the vast majority the time it’ll just work fine and we’ll see places where it doesn’t today. Okay, so we’ve got a 1.4% error rate after doing another couple of epochs. So that’s looking great So we’ve downloaded some images from Google Image Search And created a classifier. We’ve got a 1.4% error rate. Let’s save it. And then as per usual We can use the ClassificationInterpretation class to have a look at what’s going on. And in this case, we made one mistake. There was one black bear classified as grizzly bear. So that’s That’s a really good step. We come a long way but possibly you could do even better if your data set was less noisy like maybe Google Image Search Didn’t give you exactly the right images all the time So how do we fix that? And so we want to clean it up. And so Combining a human expert with a computer learner is a really good idea almost not no-nobody, but very very few people publish on this very very few people teach this but to me It’s like the most useful skill particularly for you, you know Most of the people watching this are domain experts not computer science experts. And so this is where you can use your knowledge of You know ‘point mutations’ in genomics or ‘Panamanian buses’ or whatever. So, let’s see how that would work. What I’m going to do is, do you remember the .plot(top_losses) from last time where we saw the images which it was, like, either the most wrong about or the least confident about we’re going to look at those and decide which of those are noisy like if you think about it, it’s very unlikely that If there is some mislabeled data that it’s going to be predicted correctly and with high confidence. That that’s really unlikely to happen. So we’re going to focus on the on the ones which the model is saying either ‘it’s not confident of’ or it was ‘confident of but it was wrong about’. They are the things which might be mislabeled. So a big shout-out to the San Francisco Fast.ai Study Group who created this new widget this week called the FileDeleter. So that’s Zach and Jason and Francisco Built this thing where we basically can take the top_losses() from that interpretation object we just created, right, and then what we’re going to do is we’re going to say okay that returns top losses… there’s not just .plot(top_losses), but there’s also just .top_losses() and .top_losses() returns two things the ‘losses’ of the things that were the worst and the Indexes into the data set the things that were the worst and if you don’t pass anything at all It’s going to actually return the entire data set, but sorted, so the first things will be the highest losses. As we learned during the course or will keep seeing during the course, every data set In Fast.ai has an X and a Y. And the X contains the things that are used to, in this case, get the images. So this is the image file names and the Y’s will be the labels So if we grab the indexes and pass them into the data set X this is going to give us the file names of the data set Ordered by which ones had the highest loss so which ones it was either confident and wrong about or not confident about. And so we can pass that to this new widget that they’ve created called the FileDeleter widget So just to clarify, this top_loss_paths contains all of the file names in our data set and when I say ‘in our data set’ and this particular one is in our validation data set, so what this is going to do is it’s going to clean up mislabeled Images or images that shouldn’t be there And we’re going to remove them from the validation set so that our metrics will be more correct. You then need to rerun these two steps replacing .valid_ds with .train_ds, to clean up your training set to get the noise out of that as well. So it’s a good practice to do both We’ll talk about test sets later as well. If you also have a test set you would then repeat the same thing. So we run FileDeleter() passing in that sorted list of paths. And so what pops up is Basically the same thing as plot_top_losses. So in other words, these are the ones which is either wrong about Or the least confident about and so, not surprisingly, this one here does not appear to be a teddy bear, or a black bear or a brown bear. Right? So this shouldn’t be in our data set So what I do is I whack on the Delete button Okay, and all the rest do look, indeed, like bears and then so I can click confirm and it’ll bring up another five What’s that? That’s not a bear is it? So anybody know what that is? I’m going to say that’s not a bear. Delete. Confirm. Oh Not there. Well, that’s a teddy bear I’ll leave that. That’s not really, I’ll get rid of that one.
That makes sense? In fact, Let’s try it. So we know that the function to predict something is called learn.predict() Okay, so we can check: two question marks before it or after it to get the source code, And here it is, right? pred equals res (result).argmax() and then, what is the class? Well you just pass that into the classes array. So like you should find that the source code in the fastai library can both kind of strengthen your understanding of the concepts and make sure that you know, you know what’s going on and and really help you here. You’ve got a question. Come on over. Q: “Can we have a definition of the error rate being discussed and how it is calculated? I assume it’s cross validation error.” Sure So one way to answer the question of ‘How is error rate calculated?’ would be to type ‘error_rate??’ (question mark) and look at the source code. And it is “1 – accuracy”. Fair enough. And so then a question might be ‘What is accuracy?” accuracy?? (question mark) It is argmax. So we now know that means ‘find out which particular thing it is’ and then look at how often that equals the target. So in other words the actual value and take the mean. So that’s basically what it is. And so then the question is, okay, well, what does that being applied to? and always in fastai, metrics (so these things that we pass in, we call them metrics) are always going to be applied to the validation set. Okay So anytime you put a metric here, it’ll be applied to the validation set because that’s your best practice, right? That’s like, that’s what you always want to do, is make sure that you’re checking your performance on data that your model hasn’t seen and we’ll be learning more about the validation set shortly. Remember, you can also type doc(term-to-look-up) If the source code is not what you want which it might not well be, you actually want the documentation. That will both give you a summary of the types in and out of the function and a link to the full documentation where you can find out all about how metrics work, and what other metrics there are and so forth. And generally speaking you’ll also find links to more information. Where, for example, you will find complete runs through and sample code and so forth showing you how to use all these things So don’t forget that the doc() function is your friend. Okay? And also in the documentation both in the doc function and in the documentation, you’ll see a source link. This is like ??, but what the source link does is it takes you into the exact line of code in github. So you can see exactly how that’s implemented, and what else is around it so lots of good stuff there. Q. Why were you using 3s for your learning rates earlier with 3e-5 and 3e-4? We found that 3e-3 is just a really good default learning rate. It works most of the time. For your initial fine-tuning, before you unfreeze. And then I tend to kind of just multiply from there. So I generally find then that the the next stage I will pick ten times lower than that for the second part of the slice and whatever the LR_finder() found for the first part of the slice. The second part of the slice doesn’t come from the LR_finder(), it’s just a rule of thumb which is like 10 times less than your your first part which defaults to 3e-3, and then the first part of the slice is what comes out of the LR_finder() and we’ll be learning a lot more about these learning rate details both today and in the coming lessons. But yeah for now all you need to remember is that in your you know, your basic approach looked like this it was learn.fit_one_cycle(), some number of epochs (I often pick four) and some learning rate which defaults to 3e-3. I’ll just type it up fully so you can see, and then we do that for a bit and then we unfreeze it, right? And then we learn some more and so this is a bit where I just take whatever I did last time and divide it by 10, and then I also… Right? Like that? And then I have to put one more number in here…and that’s the number that I get from the learning_rate_finder a bit where it’s got the strongest slope. So that’s kind of the Kind of “don’t have to think about it, don’t really have to know what’s going on” Rule of Thumb that works most of the time. But let’s now dig in and actually understand it more completely. So we’re going to create this mathematical function that takes the numbers that represent the pixels and spits out probabilities for each possible [?] And by the way, a lot of the stuff that we’re using here, we are stealing from other people who are awesome, and so we are putting their details here. So like, please check out their work because they’ve got great work that we are highlighting in our course. I really like this idea of this little animated gif of the numbers. So thank you to Adam Geitgey for creating that. And I guess that was probably on Quora by the looks of this…Medium, oh, yes, it was – that terrific Medium post I remember. I’ve had a whole series of Medium posts So So, let’s look and see how we create one of these functions. And let’s start with the simplest function I know, “y=ax + b”. Okay. That’s a line, right? That’s a line. And the gradient of the line is here and the intercept of the line is here? Okay, so hopefully, when we said that you need to know high school math to do this course these are the things we’re assuming that you remember. If we do kind of mention some math thing which I’m assuming you remember and you don’t remember it, don’t freak out, right? Happens to all of us. Khan Academy is actually terrific. It’s not just for school kids. Go to Khan Academy, find the concept you need a refresher on and he explains things really. Well, so strongly recommend checking that out. You know, remember I’m just a philosophy student, right? So I, all the time, am trying to either remind myself about something or I never learnt something and so we have the whole Internet to teach us these things. So I’m going to rewrite this slightly y=a1 x + a2 So let’s just replace b with a2, just give it a different name. Okay. So there’s another way of saying the same thing. Another way of saying that would be if I could multiply a2 by the number 1, okay, this still is the same thing, okay? and So now at this point I’m actually going to say let’s not put the number 1 there but let’s put an x1 here And an x2 here and I’ll say x2 equals 1 okay? So far, this is, you know, this is pretty early high school math This is multiplying by 1 which I think we can handle, okay? So these two are equivalent, with a bit of renaming. Now, in machine learning, we don’t just have one equation, we’ve got lots, right? So if we’ve got some data that represents the temperature versus the number of ice creams sold, then we kind of have lots of dots. And, so, each one of those dots, we might hypothesize, you know, is based on this formula “y=a1x1 + a2x2” all right? And so basically there’s lots of (so this is our Y), (this is our X)… there’s lots of values of y so we can stick a little “i” here and There’s lots of values of x so we can stick a little “x” here, okay? So the way we kind of do that is a lot like numpy indexing, right? But rather than things in square brackets with pytorch indexing, we kind of put them down here in our kind of in the subscript of our equation. Ok? So this is now saying there’s actually lots of these different y(i)s based on lots of different x(i1) and x(i2), ok? But notice there’s only this is still only one of each of these. So these things here are called the “coefficients”, or the “parameters”. So this is our linear equation, and this is still, we’re going to say that every x(i2) is equal to 1, ok? Why did I do it that way? Because I want to do linear algebra. Why do I want to do in linear algebra? Well one reason is because Rachel teaches the world’s best linear algebra course. So if you’re interested check out ‘Computational Linear Algebra for Coders’, so it’s a good opportunity for me to throw in a pitch for this free course, which we make no money, but never mind But more to the point right now, it’s going to make life much easier, right? Because I hate writing loops. I hate writing code, right? I just want the computer to do everything for me. At anytime you see like these little “i” subscripts, that sounds like you’re going to have to do loops and all kind of stuff, but, what you might remember, from school, is that when you’ve got like two things being multiplied together, two things being multiplied together, and then they get added up, that’s called a “dot product”, and then if you do that for lots and lots of different numbers “i”, then that’s called a “matrix product”. So, in fact, this whole thing can be written like this. Rather than lots of different y(i)s, we can say there’s one vector, called ‘y’, which is equal to one matrix called “X” times one vector called “a”. Now at this point, I know a lot of you don’t remember that. So that’s fine, we have a picture to show you. I don’t know who created this. So now I do, somebody called Andre Stouts, created this fantastic thing called “matrixmultiplication.xyz” and here we have a matrix by a vector and we’re going to do a “matrix vector product”. Go! Pshoo… That times that times that, plus plus plus. That times that times that, plus plus plus. That times that times that, plus plus plus. Finished! That is what matrix vector multiplication does. In other words, It’s just that. Except his version is much less messy. Okay. So. This is actually an excellent spot to have a little break and find out what questions we have coming through our students. What are they asking, Rachel? Q. When generating new image data set, how do you know how many images are enough? What are ways to measure “enough”? Yeah, that’s a great question. So, another possible problem you have is you don’t have enough data. How do you know if you don’t have enough data? Because you found a good learning rate, (because if you make it higher, then it goes off into massive losses, if you make it lower it goes really slowly)… so you’ve got a good learning rate, and then you train for such a long time that your error starts getting worse, Okay? So, you know that you’ve trained for long enough. And you’re still not happy with the accuracy. It’s not good enough for the, you know, the ‘Teddy-bear cuddling level’ of safety you want. So, if that happens there’s a number of things you can do and we’ll learn about some of them during, er, pretty much all of them, during this course, but one of the easiest ones is: Get more data. If you get more data, then you can train for longer, get a higher accuracy, lower error rate – without overfitting. Unfortunately, there’s no shortcut. I wish there was. I wish there was some way to know ahead of time, how much data you need. But I will say this; most of the time you need less data than you think. So organizations very commonly spend too much time gathering data getting more data than it turned out they actually needed. So get a small amount first and see how you go. Q. What do you do if you have unbalanced classes such as 200 Grizzlies and 50 Teddies? A. Uh, nothing. Try it. It works. A lot of people ask this question about how do I deal with unbalanced data? I’ve done lots of analysis with unbalanced data over the last couple of years and I just can’t make it not work. It always works. So there’s a there’s actually a paper, that said, like, if you want to get it slightly better then the best thing to do is to take that uncommon class and just make a few copies of it (that’s called over sampling). But, like, I haven’t found a situation in practice where I needed to do that. I’ve found it always just works fine, for me. Q. Once you unfreeze and retrain with one cycle again, if your training loss is still lower than your validation loss (likely underfitting), do you retrain it unfrozen again (which will technically be more than one cycle) or do you redo everything with a longer epoch for the cycle? Hey, you guys asked me that last week! My answer’s still the same: I don’t know. I would find, if you do another cycle, then it’ll kind of maybe generalize a little bit better if you start again, do twice as long, it’s kind of annoying; Depends how patient you are. It won’t make much difference, you know? For me personally, I normally just train a few more cycles. But, yeah, it doesn’t make much difference.
Most of the time. Q. So showing the code sample where you were creating a CNN with resnet34 for the ‘Grizzly-Teddy’ classifier, it says this requires resnet34, which I find surprising. I had assumed that the model created by .save(), which is about 85 megabytes on disk, would be able to run without also needing a copy of resnet34. A.Yeah, I understand. We’re going to be learning all about this shortly. You don’t… There’s no ‘copy’ of resnet34. resnet34 is actually what we call an ‘architecture’ – we’re going to be learning a lot about this. It’s a functional form. Just like this is a ‘linear functional form’ – it doesn’t take up any room, it doesn’t contain anything – it’s just a function. resnett34 is just a function. It doesn’t contain anything, it doesn’t store anything. I think the confusion here is that we often use a ‘pre-trained’ neural net that’s been learned on ImageNet. In this case, we don’t need to use a pre-trained neural net. And actually, to entirely avoid that even getting created you can actually pass “pretrained=False” and that’ll ensure that nothing even gets loaded which will save you another 0.2 seconds, I guess. So, yeah. But we’ll be learning a lot more about this, so don’t worry if this is a bit unclear. But the basic idea is this this thing here is is basically equivalent of saying “is it a line?” ? Or “is it a quadratic?” or “is it a reciprocal?” This is just a function, this is the “resnet34 function” – It’s a mathematical function. It has no… doesn’t take any storage, it doesn’t have any numbers, doesn’t have to be loaded. As opposed to a pre-trained model and so that’s why when we used, when we did it at inference time the thing that took space is… This bit. Which is where we load our parameters which is basically saying, as we’re ready to find out, what are the values of “a” and “b”? We have to store those numbers. But for resnet34, you don’t just store two numbers, you store a few million. Or a few tens of millions of numbers. So, why did we do all this? Well, it’s because I wanted to be able to write it out like this. And the nice thing if we can write it out like this, is that we can now Do that in PyTorch, with no loops, single line of code, and it’s also going to run faster. PyTorch really doesn’t like loops, right? It really wants you to send it a whole equation to do all at once, which means you really want to try and specify things in these kind of linear algebra ways. So let’s go and take a look because what we’re going to try and do then is we’re going to try and take this, we’re going to call it an ‘architecture’, it’s like the tiniest world’s tiniest neural network. It’s got two parameters, you know, a1 and a2, we’re going to try and fit this architecture to some data. So let’s jump into a notebook and generate some dots right and see if we can get it to fit a line somehow. And the ‘somehow’ is going to be using something called S. G. D. What is s S.G.D.? Well, there’s two types of SGD. The first one is where I said, in Lesson 1, “Hey, you should all try building these models and try and come up with something cool.” And you guys all experimented and found really good stuff. So that’s where the ‘S’ would be Student. That would be Student Gradient Descent. So that’s version one of SGD. Version two of SGD, which is what I’m going to talk about today, is where we’re going to have a computer try lots of things and try and come up with a really good function and that will be called ‘Stochastic Gradient Descent’. So, the other one that you hear a lot on Twitter is ‘Stochastic Grad-student Descent’, so that’s the other one that you hear. So, we’re going to jump into “Lesson 2: SGD”. And, so we’re going to kind of go bottom up rather than top down. We’re going to create the simplest possible model we can, which is going to be a linear model, and the first thing that we need is we need some data. And so we’re going to generate some data. The data we’re going to generate looks like this. So this might represent temperature and this rate represent number of ice creams we sell or something like that, but we’re just going to create some synthetic data that we know is following a line. And so, as we build this we’re actually going to learn a little bit about PyTorch, as well. So basically the way we’re going to generate this data is by creating some coefficients. a1 will be 3 and a2 will be 2. And we’re going to create some… like we’ve looked at before, basically a column of numbers for each axis, and a whole bunch of ones.
And then we’re going to do this: [email protected] What is “[email protected]”? [email protected], in python, means a matrix product between x and a. It actually is even more general than that. It can be a vector-vector product, a matrix-vector product, a vector-matrix product or a matrix-matrix product. And then actually in PyTorch, specifically, it can mean even more general things where we get into higher rank tensors, which we will learn all about very soon. Right? But this is basically the key thing that’s going to go on in all of our deep learning. The vast majority of the time our computers are going to be basically doing this: multiplying numbers together and adding them up, which is a surprisingly useful thing to do. Ok, so we basically are going to generate some data by creating a line and then we’re going to add some random numbers to it. But let’s go back and see how we created “x” and “a”. So I mentioned that you know, we’ve basically got these two coefficients, 3 and 2, and you’ll see that we’ve wrapped it in this function called “tensor()”. You might have heard this word ‘tensor’ before. Who’s heard the word tensor before? About 2//3 of you. Okay, so it’s one of these words that sounds scary and apparently, if you’re a physicist, it actually is scary, but in the world of deep learning it’s actually not scary at all. Tensor means ‘array’. Okay? It means array. So specifically it’s an array of a regular shape, right? So it’s not an array where row 1 has two things and row 3 has three things and row 4 has one thing what you call a ‘jagged’ array. That’s not a tensor. A tensor is any array, which has a ‘rectangular’ or ‘cube’ or whatever… you know, a shape where every element every row is the same length, and then every column is the same length so a 4×3 matrix would be a tensor. A vector of length 4 would be a tensor. A 3D array of length 3 x 4 x 6 would be a tensor. That’s all a tensor is. Okay? And so we have these all the time. For example, an image is a three dimensional tensor. It’s got number of rows by number of columns by number of channels; normally red green blue. So for example, a kind of a VGA texture would be 640 by 480 by 3 or actually… we do things backwards, so when people talk about images they normally go width by height but when we talk mathematically we always go a number of rows by number of columns So it’d actually be 480 by 640 by 3 That will catch you out We don’t say ‘dimensions’ though, with tensors, we use one of two words: We either say ‘rank’ or or ‘axes’. ‘Rank’ specifically means how many axes are there? How many dimensions are there? So an image is generally a “rank 3 tensor”. So what we’ve created here is a “rank 1 tensor” or also known as a ‘vector’, right? But like, in math people come up with slightly different words or actually no; they come up with very different words for slightly different concepts. Why is a one dimensional array a ‘vector’ and a two dimensional array’s a ‘matrix’ and then a three dimensional array… Does that even have a name? Not really. It doesn’t have a name. Like, it doesn’t make any sense. We also you know with computers we try to have some simple consistent naming conventions. They’re all called ‘tensors’. Rank 1 tensor, rank 2 tensor, rank 3 tensor. You can certainly have a rank 4 tensor If you’ve got 64 images then that would be a rank 4 tensor of 64 x 480 x 640 x 3, for example. So tensors are very simple. They just mean arrays. And so, in PyTorch, you say tensor and you pass in some numbers and you get back, in this case, just a list. I got back a ‘vector’, okay? So this, then, represents our coefficients: the slope and the intercept of our line. And so, because remember, we’re not actually going to have a special case of “ax + b” instead, we’re going to say there’s always this second x value which is always 1 (you can see it here, always 1), which allows us just to do a simple ‘matrix vector product’. Ok, so that’s ‘a’ and then we wanted to generate this ‘x array’ of data which is going to have we’re going to put random numbers in the first column and a whole bunch of ones in the second column. So to do that, we basically say to PyTorch: “create a rank 2 tensor, Actually no, sorry, let’s say that again. We see to PyTorch: “we want to create a tensor of ‘n x 2’. So since we passed in a total of 2 things we get a rank 2 tensor. The number of rows will be ‘n’ and the number of columns will be 2. And in there, every single thing in it will be a 1.
That’s what torch.ones() means. And then, this is really important, you can index into that, just like you can index into a list in Python, but you can put a colon (:) anywhere. And a colon means – “every single value on that axis”. Or “every single value on that dimension”. So this here means every single row. And then this here means column 0. So this is every row of column 0, I want you to grab a uniform, random number. And here’s another very important concept: in PyTorch, anytime you’ve got a function that ends in an underscore, it means “don’t return to me that uniform random number but replace whatever this is being called on, with the result of this function”. So this takes column 0 and replaces it with a uniform random number between -1 and 1. So there’s a lot to unpack there, right? But the good news is those two lines of code, plus this one (which we’re coming to), cover 95% of what you need to know about PyTorch. How to create an array, how to change things in an array, and how to do matrix operations on an array, okay? So there’s a lot to unpack but these small number of concepts are incredibly powerful. So I can now print out the first 5 rows, okay? So “:5” is standard python ‘slicing’ syntax, to say ‘the first five rows’. So here are the first five rows, two columns looking like my random numbers, and my ones. So now I can do a matrix product of that x by my a, add in some random numbers to add a bit of noise, and then I can do a scatter plot. And I’m not really interested in my scatter plot in this column of ones, right? There just there to make my linear function more convenient, so I’m just going to plot my 0-index column against my “y”s and there it is. “plt” is what we universally use to refer to the plotting library ‘matplotlib’. And that’s what most people use for most of their plotting in python. In scientific python we use matplotlib. It’s certainly a library, you’ll want to get familiar with because being able to plot things is really important. There are lots of other plotting packages. Lots of them, the other packages, are better at certain things than matplotlib, but like matplotlib can do everything reasonably well. Sometimes it’s a little awkward, but you know, for me, I do pretty much everything in matplotlib because there’s really nothing it can’t do (even though some libraries can do other things a little bit better or a little bit prettier). But it’s really powerful, so once you know matplotlib, you can do everything. So here I’m asking matplotlib to give me a scatterplot with my x’s against my y’s and there it is, okay? So this is my my dummy data representing like, you know, of temperature and ice cream sales So ,now what we’re going to do is we’re going to pretend we were given this data and we don’t know that the values of our coefficients are 3 and 2. So we’re going to pretend that we never knew that we have to figure them out, okay? So how would we figure them out? How would we draw a line to fit to this data? And why would that even be interesting? Well, we’re going to look at more about why it’s interesting in just a moment, but the basic idea is this: if we can find (this is going to be kind of perhaps, really surprising) but if we can find a way to find those two parameters to fit that line to those (how many points were there? – ‘n’ was 100) if we can find a way to fit that line to those 100 points, we can also fit these arbitrary functions that convert from pixel values to probabilities. It’ll turn out that there’s techniques that we that we’re going to learn to find these two numbers, works equally well for the 50 million numbers in resnet34. So we’re actually going to use an almost identical approach. So that’s (this is the bit that I found in previous classes, people have the most trouble digesting), like, I often find even after week 4 or week 5, people will come up to me and say “I don’t get it, how do we actually train these models?” – and I’ll say “It’s SGD. It’s that thing we throw in the notebook with the 2 numbers”. It’s like “Yeah, but but we’re fitting a neural network”. So “I know, and we can’t print the 50 million numbers anymore, but it is literally, identically, doing the same thing”. And the reason this is hard to digest is that the human brain has a lot of trouble conceptualizing of what an equation with 50 million numbers looks like and can do.
So you just kind of, for now, will have to take my word for it. It can do things like recognize Teddy Bears. And all these functions turn out to be very powerful. Now we’re going to learn a little bit more in just a moment, about how to make them extra powerful, but for now, the thing we’re going to learn to fit these two numbers is the same thing that we’ve just been using to fit 50 million numbers. Okay, so we want to find what PyTorch calls ‘parameters’. Or in statistics, you’ll often hear called ‘coefficients’. These values a1 and a2. We want to find these parameters such that the line that they create minimizes the error between that line and the points. So in other words, you know, if we created, you know, if the a1 and a2 we came up with resulted in this line, then we’d look and we’d see like how far away is that line from each point? I would say “Oh, that’s quite a long way”. And so maybe there was some other a1 or a2 which resulted in this line and they would say, like, “oh, how far away is each of those points”? And then eventually we come up with Blue We come up with this line and it’s like, “Oh, in this case, each of those is actually very close”. All right? So you can see how in each case we can say how far away is the line at each spot away from its point and then we can take the average of all those and that’s called the ‘loss’. And that is the value of our loss, right? So you need some mathematical function that can basically say how far away is this line from those points? For this kind of problem, which is called a ‘regression’ problem ,a problem where your dependent variable Is ‘continuous’, so rather than being “Grizzlies” or “Teddies”, it’s like some number between -1 and 6, this is called a regression problem. And for regression the most common loss function is called ‘mean squared error’, which pretty much everybody calls ‘MSE’. You may also see RMSE just ‘Root Mean Squared Error’. And so the mean squared error is a loss, it’s the difference between some prediction that you’ve made, okay, which you know is like the value of the line, and the actual number of ice cream sales. And so, in the mathematics of this, people normally refer to the actual, they normally call it “y” and the prediction, they normally call it “y hat”, as in they they write it like that. And so what I try to do like when we’re writing something like, you know, mean squared error equation, there’s no point writing ice cream here and temperature here because we wanted to apply it to anything. So we tend to use these like mathematical placeholders. So the value of mean squared error is simply the difference between those two, squared! All right? And then we can take the mean. Because, remember, that is actually a ‘vector’ or what we now call it, a “rank 1 tensor” and that is actually a rank 1 tensor, so it’s the value of the number of ice cream sales at each place. And so when we subtract one vector from another vector, (and we’re going to be learning a lot more about this), but it does something called element-wise arithmetic in other words It subtracts each each one from each other, and so we end up with a vector of differences, and then if we take the square of that, it squares everything in that vector. And so then we can take the mean of that to find the average square of the differences between the actuals and the predictions. So, if you’re more comfortable with mathematical notation what we just wrote there was the “sum of…” (which way round did we do it?) y hat minus… y… squared, over… n”, right? So that equation is the same as that equation. So one of the things I’ll note here is, I don’t think this is, you know, more complicated or unwieldy than this, right? But the benefit of this is you can experiment with it like once you’ve defined it, you can use it you can send things into it and get stuff out of it and see how it works, alright? So, for me, most of the time I prefer to explain things with code rather than with math. Right? Because I can actually…they’re the same, they’re doing, in this case at least, in all the cases we’ll look at, they’re exactly the same, they’re just different notations for the same thing. But one of the notations is executable, it’s something that you can experiment with, and one of them is abstract, so that’s why I’m generally going to show code. So the good news is, if you’re a coder, with not much of a math background, actually, you do have a math background because code is math. Right? Now if you’ve got more of a math background and less of a code background, then actually a lot of the stuff that you learned from math is going to translate very directly into code, and now you can start to experiment really with your math. Okay, so this is a ‘loss function’. This is something that tells us how good our line is. So now, we have to kind of come up with: “What is the line that fits through here?” Remember, we don’t know (we’re going to pretend we don’t know) so what you actually have to do is you have to guess. You actually have to come up with a guess: what are the values of a1 and a2? So let’s say we guess that a1 and a2 are both 1. So this is our tensor. ‘a’ is (1.0, 1.0), right? So here is how we create that tensor. And I wanted to write it this way because you’ll see this all the time. Like, written out it should be “1.0…” (sorry…it should be -1)… Written out fully it would be “-1.0… 1.0”. Like that’s written out fully. We can’t write it without the point, because that’s now an ‘int’, not a floating point. So that’s going to “spit the dummy” if you try to do calculations with that in neural nets, all right? I’m lazy, I’m far too lazy to type “.0” every time. python knows perfectly well that if you add a dot next to any of these numbers, then the whole thing is now floats, right? So that’s why you’ll often see it written this way, particularly by lazy people like me. Okay, so ‘a’ is a tensor. You can see it’s floating-point – you see like, even PyTorch is lazy, they just put a “.” they don’t bother with a 0, right? But if you want to actually see exactly what it is. You can write “.type()” and you can see it’s a ‘float’ tensor, okay? And so now we can calculate our predictions with this, like, random guess [email protected] (matrix product of x and a), and we can now calculate the mean squared error of our predictions and their actuals and that’s our loss. Okay, so for this regression, our loss is 8.9.
And so we can now plot a scatter plot of x against y and we can plot the scatter plot of x against y-hat (our predictions) and there they are. Okay, so this is the (1 , -1) line …sorry, the (-1, 1) line and here’s actuals. So that’s not great, not surprising, it’s just a guess. so SGD, or “gradient descent” more generally (and anybody who’s done any engineering or probably computer science at school will have done plenty of this, like Newton’s method what all the stuff that you did… university – if you didn’t, don’t worry, we’re going to learn it now)… It’s basically about taking this guess and trying to make it a little bit better. So, how do we make it a little bit better? Well, there’s only two numbers right and the two numbers are and the two numbers are the intercept of that orange line and the gradient of that orange line. So what we’re going to do with gradient descent is we’re going to simply say: “What if we change those two numbers a little bit, what if we made the intercept a little bit higher…?” or a little bit lower? What if we made the gradient a little bit more positive or a little bit more negative? So there’s like four possibilities. And then we can just calculate the loss for each of those four possibilities and see what see what worked. Did lifting it up or down make it better? Did tilting it more positive or more negative make it better? And then all we do is we say, okay, well, whichever one of those made it better that’s what we’re going to do. And that’s it. Right? But here’s the cool thing for those of you that remember calculus – you don’t actually have to move it up and down and round about, you can actually calculate the ‘derivative’.
The derivative is the thing that tells you… Would moving it up or down make it better or would rotating it this way or that way make it better? Okay, so the good news is if you didn’t do calculus or you don’t remember calculus, I just told you everything you need to know about it, right? Which is that it tells you how changing one thing changes the function, right? That’s what the derivative is. Kind of, not quite strictly speaking right, close enough, also called the ‘gradient’. Okay, so the gradient or the derivative, tells you how changing a1, up or down, would change our MSE, how changing a2 up or down will change your MSE and this does it more quickly. Does it more quickly than actually moving it up and down? Okay? So, in school, unfortunately, they forced us to sit there and calculate these derivatives by hand. We have computers! Computers can do that for us. We are NOT going to calculate them by hand. Instead, we’re going to call “.grad”. On our computer that will calculate the gradient for us. So here’s what we’re going to do – we’re going to create a loop, we’re going to loop through 100 times and we’re going to call a function called .update(). That function is going to calculate y-hat (our prediction), It is going to calculate loss (our mean squared error). From time to time it will print that out so we can see how we’re going. It will then calculate the gradient and in PyTorch calculating the gradient is done by using a method called .backward(). So you’ll see something really interesting which is, mean squared error was just a simple standard mathematical function, PyTorch, for us, keeps track of how it was calculated and lets us calculate the derivatives. So if you do a mathematical operation on a tensor in PyTorch, you can call .backward() to calculate the derivative. What happens to that derivative? That gets stuck inside an attribute called .grad. So I’m going to take my coefficients ‘a’ and I am going to subtract from them my gradient. And this underscore here… Why? Because that’s going to do it in place. So it’s going to actually update those coefficients a to subtract the gradients from them, right? So, why do we subtract? Well because the gradient tells us if I move the whole thing downwards, the loss goes up. If I move the whole thing upwards, the loss goes down. So I want to like do the opposite of the thing that makes it go up, right? So because our loss, we want to loss to be small. So that’s why we have to subtract. And then there’s something here called “lr”. “lr” is our learning rate. And so literally all it is is the thing that we multiply by the gradient. Why is there any ‘lr’ at all? Let me show you why. Let’s take a really simple example. A quadratic. All right, and let’s say your algorithm’s job was to find where that quadratic was at its lowest point. And so, well, how could it do this? Well, just like what we’re doing now, the starting point would just be to pick some x value at random. And then, pop up here to find out what the value of y is. Okay? That’s its starting point. And so then it can calculate the gradient and the gradient is simply the slope. Right? It tells you moving in which direction is going to make you go down. And so the gradient tells you you have to go this way. So, if the gradient was really big, you might jump this way a very long way. So you might jump all the way over to… Here. Maybe even here. Right? And so if you jumped over to there… Then that’s actually not going to be very helpful because then, you see, well where does that take us to? Oh! It’s now worse. Right? We jumped too far. So we don’t want to jump too far, so maybe we should just jump a little bit. Maybe to here. And the good news is that is actually a little bit closer. And so then we’ll just do another little jump; see what the gradient is and do another little jump. That takes us to here. And another little jump. That takes us to here. Here. Yeah, right. So in other words, we find our gradient to tell us kind of what direction to go and like, do we have to go a long way or not too far? But then we multiply it by some number, less than one, so we don’t jump too far. And so, hopefully at this point, this might be reminding you of something. Which is ‘what happened when our learning rate was too high’? So do you see why that happened now? Our learning rate was too high, meant that we jumped all the way past the right answer further than we started with and it got worse and worse and worse. So that’s what a ‘learning rate too high’ does. On the other hand, if our learning rate is too low then you just take tiny little steps and so, eventually you’re going to get there but you’re doing lots and lots of calculations along the way. So you really want to find something where it’s kind of either like this Or maybe it’s kind of a little bit backwards and forwards, maybe it’s kind of like this… Something like that, you know, you want something that kind of gets in there quickly, but not so quickly it jumps out and diverges. Not so slowly that it takes lots of steps. So that’s why we need a good learning rate. And so that’s all it does. So if you look inside the source code of any deep learning library, you will find this. You will find something that says coefficients.subtract(learning rate) * gradient. And we’ll learn about some minor…not minor… We’ll learn about some easy but important optimizations we can do to make this go faster. But that’s basically it. There’s a couple of other little minor issues that we don’t need to talk about now one involving zeroing out the gradients and another involving making sure that you turn gradient calculation off when you do the SGD update. If you’re interested we can discuss them on the forum or you can do our “Introduction to Machine Learning” course, which covers all the mechanics of this in more detail. But this is the basic idea. So if we run update() 100 times, printing out the loss from time to time you can see it starts at 8.9, and it goes down down down down down down down. And so we can then print out scatter plots and there it is. That’s it. Believe it or not, that’s gradient descent. So we just need to start with a function that’s a bit more complex than [email protected] But as long as we have a function that can represent things like ‘is this a teddy bear?’, we now have a way to fit it. Okay? So let’s now take a look at this as an animation and this is one of the nice things that you can do with… This is one of the nice things that you can do with matplotlib is you can take any plot and turn it into an animation. That and so you can now actually see it updating each step. So let’s see what we did here. We simply said, as before, create a scatter plot, but then rather than having a loop, we used matplotlib FuncAnimation() so call 100 times, this function. And this function just called that update() that we created earlier and then updated the ‘y’ data in our line. And so did that 100 times… waiting 20 milliseconds after each one ,and there it is. Right? So you might think that, like, visualizing your algorithms with animations is some amazing and complex thing to do, but actually now, you know It’s 1 2 3 4 5 6 7 8 9 10 11 lines of code. Okay? So I think that is pretty damn cool. So that is SGD visualized. And so we can’t visualize as conveniently what updating 50 million parameters in a resnet34 looks like, but it’s basically doing the same thing, okay? And so studying these simple versions is actually a great way to get an intuition. So you should try running this notebook with a really big learning rate, with a really small learning rate, and see what this animation looks like, right, and try get a feel for it. Maybe you can even try a 3d plot. I haven’t tried that yet, but I’m sure it would work fine too. So the only difference between Stochastic Gradient Descent and this is something called ‘minibatches’. You’ll see what we did here was we calculated the value of the loss on the whole data set on every iteration. But if your data set is one and a half million images in ImageNet, that’s going to be really slow, right? Just to do a single update of your parameters you’ve got to calculate the loss on one and a half million images. You wouldn’t want to do that. So what we do is we grab 64 images or so at a time, at random, and we calculate the loss on those 64 images, and we update our weights. And then we grab another 64 random images. We update the weights. So in other words, the loop basically looks exactly the same, but at this point here – so it’d basically be y square bracket and some random indexes here, you know, and some random indexes here and we’d basically do the same thing and well actually, sorry, it would be there, right, so some random indexes on our x and some random indexes on our y to do a minibatch at a time, and that would be the basic difference. And, so, once you add those, you know, grab a random few points each time, those random few points accord your minibatch and that approach is called SGD, or Stochastic Gradient Descent. Okay, so there’s quite a bit of vocab we’ve just covered, right? So let’s just remind ourselves: the ‘learning rate’ is a thing that we multiply our gradient by, to decide how much to update the weights by. An ‘epoch’ is one complete run through all of our data points (all of our images). So for the non-stochastic gradient descent we just did, every single loop, we did the entire data set. But if you’ve got a data set with a thousand images and your mini-batch size is 100 then it would take you ten iterations to see every image once, so that would be one ‘epoch’. Epochs are important because if you do lots of epochs, then you’re looking at your images lots of times, and so every time you see an image there’s a bigger chance of overfitting. So we generally don’t want to do too many epochs. A ‘minibatch’ is just a random bunch of points that you use to update your weights. SGD is just gradient descent using minibatches. Architecture and model kind of mean the same thing. In this case, our architecture is y=Xa All right? The architecture is the mathematical function that you’re fitting the parameters to. And we’re going to learn either today or next week, what the mathematical function of things like resnet34, actually is. But it’s basically pretty much what you’ve just seen. It’s a bunch of matrix products. ‘Parameters’, also known as coefficients, also known as weights, are the numbers that you’re updating. And then ‘loss function’ is the thing that’s telling you how far away or how close you are to the correct answer. Any questions? All right. So, these models, these predictors, these Teddy Bear Classifiers, are functions that take pixel values and return probability. They start with some functional form, like y=Xa, and they fit the parameters, ‘a’, using SGD, to try and do the best to calculate your predictions. So far we’ve learned how to do regression, which is a single number. Next week, we’ll learn how to do the same thing for classification where we have multiple numbers. But it’s basically the same. In the process, we had to do some math. We had to do some linear algebra and we had to do some calculus. And a lot of people get a bit scared at that point and tell us “I am NOT a math person”. If that is you, that’s totally okay, but you’re wrong. You are a math person. In fact, it turns out that when in the actual academic research around this, there are not math people and non-math people. It turns out to be entirely a result of culture and expectations. So you should check out Rachel’s talk “There’s No Such Thing As Not a Math Person”, where she will introduce you to some of that academic research. And so if you think of yourself as not a math person you should watch this so that you learn that you’re wrong, that your thoughts are actually there because somebody has told you ‘you’re not a math person’, but there’s actually no academic research to suggest that there is such a thing. In fact, there are some cultures, like Romania and China, where the ‘not a math person’ concept never even appeared. It’s almost unheard of in some cultures for somebody to say “I’m not a math person” because that just never entered that cultural identity. So, don’t freak out if words like ‘derivative’ and ‘gradient’ and ‘matrix product’ are things that you’re kind of scared of, it’s something you can learn. It’s something you’ll be okay with. Okay? So the last thing that we’re going to close with today… Oh, I just got a message from Simon Willison. Ah! Simon’s telling me he’s actually not that special, lots of people won medals. So, That’s the worst part about Simon. Not only is he really smart he’s also really modest which I think it’s just awful. I mean if you’re going to be that smart, at least be a horrible human being and, you know, make it okay. Okay, so, the last thing I want to close with is the idea of (and we’re going to look at this more next week) underfitting and overfitting. We just fit a line to our data. But imagine that our data wasn’t actually line ‘shaped’, right? And so if we try to fit something which was, like “constant + constant * x”, a line to it, then it’s never going to fit very well. Right? No matter how much we change these two coefficients, it’s never going to get really close. On the other hand, we could fit some much bigger equation, so in this case, it’s a higher degree polynomial, with lots of lots of wiggly bits like so. Right? But if we did that, it’s very unlikely we go and look at some other place to find out the temperature that it is and how much ice cream they’re selling and that will get a good result, because, like, the wiggles are far too wiggly. So this is called ‘overfitting’. We’re looking for some mathematical function that fits “just right”, to stay with the teddy bear analogy. So you might think, if you have a statistics background, the way to make things fit “just right” is to have exactly the right number of parameters. To use a mathematical function that doesn’t have too many parameters in it. It turns out that’s actually completely not the right way to think about it. There are other ways to make sure that we don’t overfit, and in general, this is called ‘regularization’. Regularization are all the techniques to make sure that when we train our model, that it’s going to work not only well on the data its seen but on the data it hasn’t seen yet. So, the most important thing to know when you’ve trained a model, is actually ‘how well does it work on data that it hasn’t been trained with’? And so as we’re going to learn a lot about next week, that’s why we have this thing called a ‘validation set’. So what happens with a validation set, is that we do our minibatch SGD training loop with one set of data (with one set of teddy bears, grizzlies,
black bears) and then when we’re done, we check the loss function and the accuracy to see how good is it on a bunch of images which were not included in the training. And so, if we do that, then if we have something which is too wiggly, it’ll tell us. “Oh, your loss function and your error is really bad”, because on the bears that it hasn’t been trained with, the wiggly bits are in the wrong spot. Where if it was underfitting, it would also tell us that your validation set’s really bad. So, like, even for people that don’t go through this course and don’t learn about the details of deep learning, like if you’ve got managers or colleagues or whatever, at work, who are kind of wanting to, like, learn about AI, the only thing that you really need to be teaching them is about the idea of a validation set. Because that’s the thing they can then use to figure out, you know, if somebody’s selling them snake oil or not, you know, they’re like, hold back some data and then they get told, like, “oh here’s a model that we’re going to roll out” and then you say “okay, fine… I’m just going to check it on this held out data to see whether it generalizes.” There’s a lot of details to get right when you design your validation set. We will talk about them, briefly, next week, but a more full version would be in Rachel’s piece on the fast.ai blog called “How (and why) to create a good validation set”. And this is also one of the things we go into in a lot of detail in the ‘Intro to Machine Learning’ course. So we’re going to try and give you enough to get by, for this course, but it is certainly something that’s worth deeper study as well. Any questions or comments before we wrap up? Okay, good. All right, well, thanks everybody. I hope you have a great time building your web applications. See you next week.