Lesson 2: Deep Learning 2019 – Data cleaning and production; SGD from scratch


Welcome to Lesson 2 where we’re going to be taking a deeper dive into computer vision applications and Taking some of the amazing stuff that you’ve all been doing during the week and going even further so let’s take a look before we do a reminder that We have these two really important topics on the forums They’re pinned at the top of the forum. Category One is FAQ, Resources and Official Course Updates This is where if there’s something useful for you to know during the course. We will post there Nobody else can reply to that thread So if you set that thread to watching and notifications You’re not going to be bugged by anybody else except stuff that we think you need to know for the course And it’s got all the official information about how to get set up on each platform Please note a lot of people post all kinds of other tidbits about How they’ve set up things on previous solutions or previous courses or other places I don’t recommend you use those because these are the ones that we’re testing Everyday and that the folks involved in these platforms are testing every day and they definitely work Okay, so so I would strongly suggest you follow those Tips and if you do have a question about using one of these platforms Please use these discussions not some other Topic that you create because this way people that are involved in these Platforms will be able to see it and things won’t get messy and then secondly For every lesson, there will be an ‘official updates thread’ for that lesson. So ‘Lesson One Official Updates’ and the same thing only Fast.ai people will be posting to that so You can you can watch it safely and we’ll have all the things like the videos the notebooks and so forth And they’re all wiki threads so you can help us to make them better as well So I mentioned the idea of watching a thread So this is a really good idea is that you can go to a thread like particularly those official update ones and click at the bottom… ‘Watching’ Ok, and if you do that that’s going to enable notifications or any updates to that thread Secondly, if you go in to click on your little user name in the top, right? Preferences and turn this on that’ll get gives you an email as well. Okay, so any of you that have missed some of the updates so far go back and have a look through because we’re really trained to Make sure that we keep you updated with anything that we think’s important One thing which can be more than a little overwhelming is even now after just one week the most popular thread has 1.1 thousand (1100) replies So that’s that’s an intimidatingly large number I’ve actually read every single one of them and I know Rachel has and I know ( ) has and I think Francisco has But you shouldn’t need to what you should do is click ‘Summarize This Topic’ and It’ll appear like this which is all of the most liked Ones will appear and then they’ll be view 31 hidden replies or whatever in between. So that’s how you navigate these giant topics that also way it’s important you click the like button because that’s the thing that’s going to cause people to To see it in this recommended view So when you come back to work, hopefully you’ve realized by now that on the official course website, ‘course.Fast.ai’ v3 you will click ‘Returning to work’, you will click the name of the platform you’re using, and you will then follow the two steps. Step one will be how to make sure that you’ve got the latest notebooks and Step two will be how to make sure you’ve got the latest Python library software Okay, they all look pretty much like this, but they’re slightly different from platform to platform So, please don’t use some different set of commands you read somewhere else – only use the commands that you read about here. And that will make everything very smooth If things aren’t working for you, if you get some into some kind of messy situation, which we all do, and uh, just delete your instance and start again. Unless you’ve got mission-critical stuff there It’s the easiest way just to get out of a sticky situation And you know, if you follow the instructions here, you really should find it works fine So this is what I really wanted to talk about Most of all is what people have been doing this week If you’ve noticed and a lot of you have so there’s have been a hundred and sixty-seven people Sharing their work and this is really cool because it’s pretty intimidating to put yourself out there and say like I’m new to all this But here’s what I’ve done And so example four things. I thought was really interesting was figuring out who’s talking. Is it Ben Affleck or Joe Rogan I thought this is really interesting. This is like actually very practical I wanted to clean up while whatsapp downloaded images to get rid of memes. So I actually built a little neural network I mean, how cool is that to say like? Oh, yeah. I’ve got something that cleans up my whatsapp. It’s a deep learning application I wrote last week why not like it’s so easy. Now you can do stuff like this And then there’s been some really interesting Projects one was looking at the the Sounds data that was used in this paper. And in this paper, they were trying to figure out what kind of sound things were and they got, as you would expect since they published the paper, they got a state of the art of nearly 80% accuracy Ethan Sutan then tried using the Lesson 1 techniques and got 80.5 % accuracy. So I think this is pretty awesome Best as we know it’s a new state of the art for this problem. Now, maybe somebody since then has published something, we haven’t found it yet. Then take all of these with a slight grain of salt, but I’ve mentioned them on Twitter and lots of people on Twitter follow me, so if everybody knew that there was a much better approach, I’m sure somebody would have said so. This one is pretty cool; Suvash has a new state of the art accuracy for for devanagari text recognition, I think he’s got it even higher than this now And this is actually confirmed by the person on twitter who created the data set. Like, I don’t think he had any idea – he just posted ‘Here’s a nice thing I did’ and this guy on Twitter was like, “oh I made that data set. Congratulations. You’ve got a new record.” So that was pretty cool. Um, I really liked this post from Alena Harley. She describes in quite a bit of detail about the issue of metastasizing cancers And the use of point mutations and why that’s a challenging important problem And she’s got some nice pictures describing like what she wants to do with this and like how she can go about turning this into pictures See, this is the cool trick, right? It’s the same with this this ‘sound’ one, turning sounds into pictures and then using the Lesson 1 approach and here it’s turning point mutations into pictures and then using the Lesson 1 approach and What did she find? It seems that she’s got a new state of the art result by more than 30% beating the previous best somebody on twitter who’s a VP at a genomics analysis company looked at this as well and you know Thought it looked to be a state of the art in this particular point mutation one as well. So that’s pretty exciting so you can see you know, when we talked about last week this idea that This simple process is something which can take you a long way. It really can I will mention that You know something like this one in particular is is using a lot of domain expertise like it’s figuring out what picture To create I wouldn’t know How to do that because I don’t even really know what a point mutation is let alone how to create You know something that visually is meaningful that a CNN could recognize but the actual big learning side Is is actually pretty straightforward. Another very cool result from Simon Willison and Natalie Downe. They Created a ‘cougar or not’ web application Over the weekend and won the ‘Science Hack Day’ award in San Francisco And so I think that’s pretty pretty fantastic So lots of examples of people doing really interesting work Hopefully this will be Inspiring to you to think well, this is this is cool that I can do this with what I’ve learned it can also be intimidating to think like wow, these people are doing amazing things, but It’s important to realize that out of thousands of people during this course You know, I’m just picking out the kind of a few of the really amazing ones and in fact Simon is one of these very annoying people like Christine Payne who talked about last week who seems to be good at everything he does He created Django when it’s the world’s most popular web frameworks. He founded a very successful startup and bla bla bla bla bla so You know one of these really annoying people who? Tends to keep being good at things now turns out he’s good at deep learning as well So, you know, that’s fine. You know Simon can go and win a hackathon on his first week of playing with deep learning Maybe it’ll take you two weeks to win your first hackathon. That’s okay um, and I think like it’s important to mention this because there was this really inspiring blog post this week from James Dellinger who talked About how he created a bird classifier using the techniques from lesson one But what I really found interesting was at the end He said he he nearly didn’t start on deep learning at all because he went through the scikit-learn website Which is one of the most important libraries of python and he saw this And he described in this blog post or how he was just like ‘that’s not something I can do, It’s not something I understand’ and then this kind of realization of like, oh I can do useful things without reading the Greek so I thought that was a really Cool message and I really want to highlight actually Daniel Armstrong on the forum. I think really Shows is a great role model here which was here saying I want to contribute to the library and I looked at the docs and I just started overwhelming and the next message one day later was… ‘I don’t know what any of this is, I didn’t know how much there was to it, it caught me off guard, My brain shut down But I love the way it forces me to learn so much and then one day later, I just submitted my first pull request So I think that’s also right. It’s just kind of like it’s okay to feel intimidated There’s a lot, right? But just pick one piece and dig into it you know, try and try and push a piece of code or a documentation update or create a classifier or whatever So here’s lots of cool classifiers people have built. It’s been really really inspiring Trinidad and Tobago Islander versus Masquerader classifier. A Zucchini versus Cucumber classifier, This one was really nice. This was taking the dog breeds dog and cat breeds thing from last week and actually doing some exploratory work to see what the main features were and discovered that they could have a ‘hairiness classifier’ and so here we have the most hairied dogs and the most bald cats. So there are you know Interesting things you can do with interpretation Somebody else in the forum took that and did the same thing for anime to find that they had accidentally discovered an anime haircolor classifier We can now detect the new versus the old Panamanian buses correctly. Apparently these are the new ones I much prefer the old ones, but maybe that’s just me This was a really interesting Henri Pallacci discovered that he can recognize, with 85% accuracy, which of 110 cities…, sorry, which of 110 countries, a satellite image is of. Which you know, has definitely got to be Beyond human performance of just about anybody like I can’t imagine anybody Who can do that in practice. So that was fascinating. Batik cloth classification with a hundred percent accuracy Those rewarded this interesting one we actually went a little bit further using some techniques We’ll be discussing in the next couple of courses to build something that can recognize ‘complete or incomplete foundation buildings’ and actually plot them on aerial satellite view So lots and lots of fascinating Projects. So don’t worry, it’s only been one week. It doesn’t mean everybody has to have had a project out yet A lot of the folks who already have a project out have done a previous course, so they’ve got a bit of a head start But we’ll see today how you can definitely create your own classifier this week. So, from today after we dig a bit deeper into really how to make these computer vision classifiers in particular, work well We’re then going to look at the same thing for text. We’re then going to look at the same thing for tabular data So they’re kind of like more like spreadsheets and databases Then we’re going to look at ‘collaborative filtering’ So we’re going to recommendation systems that’s going to take us into a topic called embeddings, which is basically a key underlying platform behind these applications That will take us back into more computer vision and then back into more NLP so the idea here is that it turns out that it’s it’s much better for learning if you Kind of see things multiple times. So rather than being like, okay that’s computer vision You won’t see it again for the rest of the course We’re actually going to come back to the two key applications NLP and computer vision a few weeks apart and that’s going to force your brain to realize like ‘oh I have to remember this’ It’s not just something I can throw away So We are You know for people who have more of a hard sciences kind of background in particular a lot of folks find this Hey, ‘here’s some code – type it in start running it approach’ rather than here’s lots of theory approach Confusing and surprising and odd at first and so for those of those of you I just wanted to remind you you know, this basic tip, which is Keep going. You’re not expected to remember everything. Yet. You’re not expected to understand everything. Yet. You’re not expected to know why everything works. Yet. You just want to be in a situation where you can: enter the code and you can run it and you can get something happening and then you can start to experiment and you kind of get a feel for what’s going on and then, Push on. Right? Most of the people who have done the course and have gone on to be really successful watch the videos at least three times. So they kind of go through the whole lot and then go through it slowly the second time, then they go through it really slowly the third time and I consistently hear them say ‘I get a lot more out of it each time I go through’ so don’t pause at lesson one and stop until you can continue So, um this approach is based on a lot of a research, academic research, into learning theory and one guy in particular David Perkins from Harvard, has this really great analogy. He’s a researcher into learning theory He describes this approach of the whole game, which is basically if you’re teaching a kid to play soccer you don’t you know first of all teach them about you know how the friction between a ball and grass works and then teach them how to sew a soccer ball with their bare hands and then teach them the mathematics of parabolas when you kick something in the air… No. They say “Here’s a ball. Let’s watch some people playing soccer.” Okay now we’ll play soccer and Then you you know gradually over the following years learn more and more so that you can get better and better at it So this is kind of what we’re trying to get you to do is to play soccer which in our case is to type code And look at the inputs and look at the outputs Okay So let’s dig into Our first notebook, which is called “lesson2-download” and what we’re going to do is we’re going to see how to create your own Classifier with your own images, so it’s going to be a lot like last week’s pet detector But it’ll detect whatever you like. So it’ll be like those some of those examples we just saw. How would you create your own “Panama Bus Detector”, from scratch? So this is inspired the approaches inspired by Adrian Rosebrock, who has a terrific website called “pyimagesearch” and He has this nice explanation of how to create a data set using Google images So that was definitely an inspiration for some of the techniques we use here. So, thank you to Adrian And you should definitely check out his site. It’s a really it’s full of lots of good resources So So here we are so we are going to try to create a “teddy bear” detector Thanks We’re going to try and make a teddy bear detector, and we’re going to try and separate teddy bears from black bears from grizzly bears. Now, this is very important I have a three year old daughter and She needs to know what she’s dealing with. In our house, you would be surprised at the number of monsters, lions, and other terrifying threats that are around, particularly around Halloween, And so we always need to be on the lookout to make sure that the thing we’re about to cuddle Is in fact a genuine teddy bear, okay So let’s deal with that with that situation as best as we can So our starting point is to find some pictures of teddy bears so we can learn What they look like so I go to images.google.com and I type in “teddy bear” and I just scroll through until I kind of find a goodly bunch of them. and it’s like, okay that Looks like plenty of teddy bears to me. So then I’ll go back to Here. So you can see it says “search and scroll go to Google Images” and search and the next thing we need to do is to get a list of all of the URLs, there, and so to do that you Back in your google images you hit ‘ctrl- shift-J’ or ‘command-option-J’ And you paste this [highlighted text] into the window that appears. So I’ve got Windows so I go ‘ctrl-shift-J’ paste in that code. So this is the JavaScript console, for those of you you haven’t done any JavaScript before, I hit enter and It downloads my file for me. So I would call this “teddies.txt” and press Save. Okay, so I now have a file of Teddies, or URLs of Teddies, so then I would repeat that process for Black bears and for brown bears since that’s a classifier I would want, and i would put each one in a file with an appropriate name. So that’s step one. So step two is we now need to Download those URLs to our server. Because remember, when we’re using Jupyter notebooks, it’s not running on our computer It’s running on Sagemaker or Kressel or Google Cloud or whatever So to do that we start running some Jupyter cells. So let’s grab the Fast.ai library and Let’s start with black bears. I’ve already got my black bears URL, so I click on this cell for black bears and I’ll run it See here how I’ve got three different cells, doing the same thing but different information? This is this is one way I like to work with Jupyter notebook – it’s something that a lot of kind of People with a more strict scientific background are horrified by. This is not reproducible research. I actually click here and I run this cell to create a folder called “black” and a file called “urls_black” for my black bears, I skip the next two cells and then I run this cell to create that folder, okay? And then I go down to the next section and I run the next cell which is ‘download_images()’ for ‘black_bears’, right? So that’s just going to download my black bears to that folder and then I’ll go back and I’ll click on ‘teddys’, and I’ll run that cell and then scroll back down and I’ll run this cell and so that way I’m just going backwards and forwards to download each of the classes that I want Very manual, but for me, I’m very iterative I’m very experimental, that works well for me. If you’re better at kind of planning ahead than I am, you can, you know, Write a proper loop or whatever and and do it that way. So but when you see My notebooks and see things where there’s kind of like configuration cells doing the same thing in different places This is a strong sign that I didn’t run this in order. Right? I clicked one place went to another around that went back went back went back and for me, I Just I’m an experimentalist. I really like to experiment in my notebook – I treat it like a lab journal. I try things out and I see what happens… And so this is how my notebooks end up looking it’s a really controversial topic like for a lot of people they feel this is, like, “Wrong.” That you should only ever run things top to bottom. Everything you do should be reproducible For me, I don’t think that’s the best way of using human creativity. I think human creativity is best inspired by trying things out, seeing what happens, and fiddling around. So you can see how you go. See what works for you So that will download the images to your server. It’s going to use multiple processes to do so and One problem there is if is if something goes wrong It’s a bit hard to see what went wrong so you can see in the next section There’s a commented out section that says “max_workers=0” and that’ll do it without spitting up a bunch of processes and will tell you the errors better. So if things aren’t downloading try using the second version. Okay, so it takes so I you know grabbed a small number of each and then the next thing that I found I needed to do was to remove the Images that aren’t actually images at all.
And this happens all the time. There’s always a few images in every batch that are corrupted for whatever reason, you know Google Image tried to Told us that this URL had an image, but actually it doesn’t anymore So I’ve got we’ve got this thing in the library called “verify_images()” Which will check all of the images in a path And will tell you if there’s a problem if you say ‘delete=True’, it will actually delete it for you Okay, so that’s a really nice easy way to end up with a clean data set. So at this point I now have a bears folder containing a /grizzly folder and a /teddys folder and a /black folder in Other words I have the basic structure we need to create an image data bunch to start doing some deep learning So let’s go ahead and do that. Now Very often when you get when you download a data set from, like, Kaggle or from some academic data set, there will often be a folder called /train and a folder called /valid and a folder called /test right containing the different data sets. In this case, we don’t have a separate validation set because we just grab these images from Google search, right? But you still need a validation set Otherwise, you don’t know how well your model is going and we’ll talk about more about this in a moment So whenever you create a data bunch If you don’t have a separate training and validation set then you can just say okay well the training set is in the ‘current’ folder because by default it looks in a folder called /train and I want you to set aside 20 percent of the data, please. So this is going to create a validation set for you automatically and randomly. You’ll see that whenever I create a validation set randomly, I always set my random seed to something fixed beforehand. This means that every time I run this code, I’ll get the same validation set so in general I’m not a fan of making my machine learning experiments reproducible; i.e. ensuring I get exactly the same result every time. The randomness is to me really important a really important part of planning out is your solution stable, you know Is it going to work like each time you run it but what is important is that you always have the same? Validation set. But otherwise when you’re trying to decide has this hyper parameter change improved my model But you’ve got a different set of data you’re testing it on then you don’t know maybe that set of data It just happens to be a bit easier. Okay? So that’s why I always said the random seed here. So we’ve now gone (let’s run that cell), so we’ve now got a data bunch And so you can look inside at the data.classes and you’ll see These are the folders that we created So it knows that the classes or you know, so by classes we main all the possible labels black bear grizzly bear or teddy bear We can run show batch and we can take a little look And it tells us straight away that some of these are going to be a little bit tricky So this is not a photo, for instance, some of them kind of cropped funny Some of them might be tricky like if you ended up with a black bear standing on top of a grizzly bear that might be tough Anyway, so you can kind of double check here, ‘data.classes’, here they are They are (remember ‘.c’ is the attribute which the classifiers tells us how many possible labels there are?). We’ll learn about some other more specific meanings of ‘.c’ later. We can see how many things are in our training set, we can see how many things are in our validation set. So we’ve got 473 training set, 141 validation set So at that point we can go ahead you’ll see all these commands are identical to the pet classifier from last week we can create our CNN (our convolutional neural network) using that data I tend to default using a ‘resnet34’ and let’s print out the error rate each time and run ‘fit_one_cycle()’ four times and see how we go and We have a 2% error rate So that’s pretty good. I Personally, I mean sometimes it’s easy for me to recognize a black bear from a grizzly bear But sometimes it’s a bit tricky. This one seems to be doing pretty well Okay, so After I kind of make some progress with my model and things looking good I always like to save where I’m up to, to save me the 54 seconds of going back and doing it again and, As very usual we unfreeze() the rest of our model. We’re going to be learning more about what that means during the course And then we run the ‘learning rate finder’ and plot it (it tells you exactly what to type) and we take a look. Now, we’re going to be learning about learning rate, today, actually. But for now, here’s what you need to know: on the learning rate finder, what you’re looking for is the strongest downward slope That’s kind of sticking around for quite awhile, right? So this one here looks more like a bump, but this looks like an actual downward slope, to me So it’s kind of like it’s something you’re going to have to practice with and get a feel for, like what bit works so like if you’re not sure is it this bit or this bit, try both learning rates and see which one works better. Okay, but I’m I’ve been doing this for a while and I’m pretty sure this looks like where it’s really learning properly. So I would pick something Okay here it’s not so steep. So I would probably pick something back here for my learning rate. So you can see I picked 3×10^-5 so, you know somewhere around here. That sounds pretty good. So that’s for my bottom learning rate So my top learning rate I normally pick you know 1×10^-4 or 3×10^-4 It’s kind of like I don’t really think about it too much That’s a rule of thumb it always works pretty well One of the things you’ll realize is that Most of these parameters don’t actually matter that much, in detail. If you just copy the numbers that I use each time It’ll the vast majority the time it’ll just work fine and we’ll see places where it doesn’t today. Okay, so we’ve got a 1.4% error rate after doing another couple of epochs. So that’s looking great So we’ve downloaded some images from Google Image Search And created a classifier. We’ve got a 1.4% error rate. Let’s save it. And then as per usual We can use the ClassificationInterpretation class to have a look at what’s going on. And in this case, we made one mistake. There was one black bear classified as grizzly bear. So that’s That’s a really good step. We come a long way but possibly you could do even better if your data set was less noisy like maybe Google Image Search Didn’t give you exactly the right images all the time So how do we fix that? And so we want to clean it up. And so Combining a human expert with a computer learner is a really good idea almost not no-nobody, but very very few people publish on this very very few people teach this but to me It’s like the most useful skill particularly for you, you know Most of the people watching this are domain experts not computer science experts. And so this is where you can use your knowledge of You know ‘point mutations’ in genomics or ‘Panamanian buses’ or whatever. So, let’s see how that would work. What I’m going to do is, do you remember the .plot(top_losses) from last time where we saw the images which it was, like, either the most wrong about or the least confident about we’re going to look at those and decide which of those are noisy like if you think about it, it’s very unlikely that If there is some mislabeled data that it’s going to be predicted correctly and with high confidence. That that’s really unlikely to happen. So we’re going to focus on the on the ones which the model is saying either ‘it’s not confident of’ or it was ‘confident of but it was wrong about’. They are the things which might be mislabeled. So a big shout-out to the San Francisco Fast.ai Study Group who created this new widget this week called the FileDeleter. So that’s Zach and Jason and Francisco Built this thing where we basically can take the top_losses() from that interpretation object we just created, right, and then what we’re going to do is we’re going to say okay that returns top losses… there’s not just .plot(top_losses), but there’s also just .top_losses() and .top_losses() returns two things the ‘losses’ of the things that were the worst and the Indexes into the data set the things that were the worst and if you don’t pass anything at all It’s going to actually return the entire data set, but sorted, so the first things will be the highest losses. As we learned during the course or will keep seeing during the course, every data set In Fast.ai has an X and a Y. And the X contains the things that are used to, in this case, get the images. So this is the image file names and the Y’s will be the labels So if we grab the indexes and pass them into the data set X this is going to give us the file names of the data set Ordered by which ones had the highest loss so which ones it was either confident and wrong about or not confident about. And so we can pass that to this new widget that they’ve created called the FileDeleter widget So just to clarify, this top_loss_paths contains all of the file names in our data set and when I say ‘in our data set’ and this particular one is in our validation data set, so what this is going to do is it’s going to clean up mislabeled Images or images that shouldn’t be there And we’re going to remove them from the validation set so that our metrics will be more correct. You then need to rerun these two steps replacing .valid_ds with .train_ds, to clean up your training set to get the noise out of that as well. So it’s a good practice to do both We’ll talk about test sets later as well. If you also have a test set you would then repeat the same thing. So we run FileDeleter() passing in that sorted list of paths. And so what pops up is Basically the same thing as plot_top_losses. So in other words, these are the ones which is either wrong about Or the least confident about and so, not surprisingly, this one here does not appear to be a teddy bear, or a black bear or a brown bear. Right? So this shouldn’t be in our data set So what I do is I whack on the Delete button Okay, and all the rest do look, indeed, like bears and then so I can click confirm and it’ll bring up another five What’s that? That’s not a bear is it? So anybody know what that is? I’m going to say that’s not a bear. Delete. Confirm. Oh Not there. Well, that’s a teddy bear I’ll leave that. That’s not really, I’ll get rid of that one.
Confirm. Okay. So what I tend to do when I do this is I’ll keep going Confirm until I get to a couple of screens all The things that all look, okay, and that suggests to me that I’ve kind of got past the worst bits of the data Okay, and that’s it And so now you can go back once you do it for the training set as well and retrain your model. So I’ll just note here that what our San Francisco study group did here was that they actually built a little app inside Jupyter notebook, which you might not have realized is possible, but not only is it possible, It’s actually Surprisingly straightforward and just like everything else you can hit double question mark to find out their secrets So here is the source code. Okay, and Really if you’ve done any GUI programming before it’ll look incredibly normal, you know there’s there’s basically callbacks for what happens when you click on a button where you just do standard Python things and to actually render it you just use widgets and you can lay it out using standard boxes and whatever so It’s it this idea of creating Applications inside notebooks is like it’s really underused but it’s super neat because it lets you create tools for your fellow practitioners to your fellow experimenters, right and you could definitely envisage Taking this a lot further. In fact by the time you’re watching this on the MOOC You will probably find that there’s a whole lot more buttons here because we’ve already got a long list of to do that We’re going to add to this particular thing So so, that’s it so I think like I’d love for you to have to think about now that you know, it’s possible to write applications in your notebook, what are you going to write? And if you google for ‘ipywidgets’… you can learn about the little GUI framework To find out what kind of widgets you can creation what they look like and how they work and so forth and you’ll find it’s you know, it’s actually a pretty You know complete GUI programming environment you can play with and this will all work nicely with your models and so forth It’s not a great way to Productionize an application because it is sitting inside a notebook This is really for things which are going to help other practitioners other experimentalists and so forth for productionizing things You need to actually build a production web app, which we’ll look at next. Okay, so After you have cleaned up your noisy images You can then retrain your model and hopefully you’ll find it’s a little bit more accurate one thing you might be interested to discover when you do this is It actually doesn’t matter, most of the time, very much, now, on the whole, these models are pretty good at dealing with moderate amounts of noisy data. The problem would occur is if your data was not randomly noisy, but biased noisy, So I guess the main thing I’m saying is if you go through this process of cleaning up your data and then rerun your model and It’s like .001 % better, that’s normal. Okay, it’s fine. But it’s still a good idea just to make sure that you don’t have too much noise in your data in case it is biased. So at this point we’re ready to put our model in production. And this is where I hear a lot of people ask me about you know, which mega Google Facebook highly distributed serving system they should use and how do they use a thousand GPUs at the same time and whatever else, For the vast vast vast majority of things that you all do You will want to actually run in production on a CPU Not a GPU. Why is that? Because the GPU is good at doing lots of things at the same time But unless you have a very busy website. It’s pretty unlikely that you’re going to have 64 images to classify at the same time to put into a batch into a GPU And if you did you’ve got to deal with all that queuing, and running it all together, all of your users have to wait until that batch has got filled up and run. It’s a whole lot of hassle, right? And then if you want to scale that there’s another whole lot of hassle It’s much easier if you just wrap one thing throw it at a CPU to get it done and it comes back again so yes, it’s going to take You know, maybe 10 or 20 times longer right, so maybe it’ll take 0.2 seconds rather than 0.01 seconds, that’s about the kind of times we talk about But it’s so easy to scale. All right, you can chuck it on any standard serving infrastructure. It’s going to be cheap as hell You can horizontally scale it really easily. Okay? So most people I know who are running apps that aren’t kind of at Google scale based on deep learning are using CPUs And the term we use is inference, right? So when you’re running when you’re not training a model, but you’ve got a trained model and you’re getting to predict things, we call that inference. So that’s why we say here ‘You probably want to use CPUs for inference’. So at inference time, you’ve got your pre-trained model, you saved those weights, and how are you going to use them to create something like Simon relations? a ‘Cougar detector’? Well first thing you’re going to need to know is what were the classes that you trained with? Right? You need to not know not just what are they, but what were the order? Okay, so you will actually need to like serialize that or just type them in or in some way make sure you’ve got exactly the same classes that you trained with. If you don’t have a GPU on your server, it will use the CPU automatically. If you want to test if you have a GPU machine and you want to test using a CPU You can just uncomment this line and that tells First.ai that you want to use CPU, by passing it back to pytorch. So here’s an example, we don’t have a cougar detector, we have a ‘teddy bear detector’ and my daughter Claire is about to decide whether to cuddle his friend Okay so what she does is she takes Daddy’s deep learning model and she gets a picture of this, and here’s a picture that she’s uploaded to the web app, okay, and here’s a picture of the potentially cuddlesome object, And so we’re going to store that in a variable called ‘img’ So open_image() is how you open an image in fastai, oddly enough, Here is that list of classes that we saved earlier And so as per usual we created a DataBunch, but this time we’re not going to create a DataBunch from a folder full of images, we’re going to create a special kind of DataBunch, which is one that’s going to grab one single image at a time So we’re not actually passing it any data. The only reason we pass it a path is so that it knows where to load our model from, right? That’s just the path that’s the folder that the model is going to be in. But what we do need to do is that we need to pass it the same information that we trained with. So the same transforms, the same size, the same normalization. This is all stuff we’ll learn more about, but just make sure it’s the same stuff that you used before. And so now you’ve got a DataBunch that actually doesn’t have any data in it at all. It’s just something that knows how to transform a new image in the same way that you trained with so that you can now do inference. So you can now create a CNN with this kind of fake DataBunch. And again, you would use exactly the same model that you trained with, you can now load in those saved weights, okay? And so, this is the stuff that you do once, just once when your web app’s starting up, okay, and it takes you know, 0.1 of a second to run this code. And then, you just go learn.predict() learn.predict(img), and it’s lucky we did that because it is not a teddy bear. This is actually a black bear so, thankfully, due to this excellent deep learning model, my daughter will avoid having a very embarrassing black bear cuddle incident. So, what does this look like in production? Well, I took Simon Willison’s code and shamelessly stole it, made it probably a little bit worse, And but basically it’s going to look something like this. So Simon used a really cool web app toolkit called Starlette, if you’ve ever used Flask, this will look extremely similar, but it’s kind of a more modern approach. By ‘modern’ what I really mean is that You can use ‘await()’. It’s basically means that you can wait for something that takes a while Such as grabbing some data, without using up a process. So for things like ‘I want to get a prediction’ or ‘I want to load up some data’ or whatever It’s really great to be able to use this modern Python 3 asynchronous stuff. So Starlette would come highly recommended for creating your web app. And so yeah, you just create a route as per usual in a web app, and in that you say this is ‘async’ to ensure that it doesn’t steal the process while it’s waiting for things. You open your image, you call .predict() and you return that response and then you can use, you know, whatever, JavaScript client or whatever to to show it. And that’s it. That’s basically the the main contents of your web app. So Give it a go, right? You know this week even if you’ve never created a web application before, there’s a lot of you know, nice little tutorials online, and kind of starter code you know, if in doubt, why don’t you try Starlette? There’s a free hosting that you can use there’s one called ‘PythonAnywhere’, for example. The one that Simon’s used (we’ll mention that on the forum) it’s something you can basically package it up as a docker thing and shoot it off and it’ll serve it up for you. So it doesn’t even need to cost you any money. And so all these classifiers that you’re creating, you can turn them into web applications. So I’ll be really interested to see what you’re able to make of that. That will be really fun. Okay, so let’s take a break we’ll come back at 7:35. See you then. Okay. So let’s move on. So I mentioned that most of the time, the kind of rules of thumb I’ve shown you will probably work and if you look at the ‘Share Your Work’ thread you’ll find most of the time people are posting things saying “I downloaded these images, I tried this thing..” “…They worked much better than expected.” Well, that’s cool. And then like 1 out of 20 says, like, “Ah,…” “I had a problem.” So let’s have a talk about what happens when you have a problem. And this is where we’re gonna start getting into a little bit of theory because in order to understand why we have these problems and how we fix them, it really helps to know a little bit about what’s going on. So first of all, let’s look at examples of some problems. The problems basically will be either “your learning rate is too high or low” or “your number of epochs is too high or low”. So we’re going to learn about what those mean and why they matter but first of all, because we’re experimentalists, let’s try them. All right? So let’s go with our teddy bear detector and let’s make our learning rate really high. The default learning rate is 0.003 that works most of the time, so what if we try a learning rate of 0.5? That’s huge! What happens? Our validation loss gets pretty damn high. Remember, this is normally something that’s underneath 1, right? So if you see your validation loss do that, right? Before we even learn what validation loss is, just know this: if it does that, your learning rate’s too high. That’s all you need to know. Okay? Make it lower. Doesn’t matter how many epochs you do. And if this happens, there’s no way to undo this you have to go back and create your neural net again and .fit() from scratch with a lower learning rate. So that’s “Learning rate (LR) too high”. “Learning rate too low”… What if we use a learning rate not of 0.003 but 1e-5, so 0.00001, right? So this is just I’ve just copied and pasted what happened when we trained before with a default error Right now without default learning rate and within one epoch we were down to a 2 or 3% error rate. With this really low learning rate, our error rate does get better, but very very slowly. Right? And you can plot it. If you go to learn.recorder() is an object which is going to keep track of lots of things happening while you train. You can call .plot_losses() to print to plot out the validation and training loss and you can just see them just like gradually going down so slow, right? So if you see that happening, then you have a learning rate which is too small. Okay? So bump it up by 10 or bump it up by 100 and try again. The other thing you’ll see if your learning rate is too small is that your training loss will be higher than your validation loss. You never want a model where your training loss is higher than your validation loss. That always means you haven’t fitted enough which means either your learning rate is too low or your number of epochs is too low. So if you have a model like that, train it some more or train it with a higher learning rate. Okay? “Too few epochs”. So what if we train for just one epoch? Our error rate certainly better than random, 5%. But look at this. The difference between training loss and validation loss. The training loss is much higher than the validation loss. So too few epochs and too low a learning rate look very similar, right? And so you can just try running more epochs and if it’s taking forever you can try a higher learning rate. Where we try a higher learning rate and the loss goes off to 100,000 million then put it back to where it was and try a few more epochs. That’s the balance, right? That’s basically all you care about 99% of the time. And this is only the one in 20 times that the defaults don’t work for you. Okay, “Too many epochs” (we’re going to be talking more about this) create something called “overfitting”. If you train for too long, as we’re going to learn about, it will learn to recognize your particular teddy bears, but not teddy bears in general. Here’s the thing: Despite what you may have heard, it’s very hard to overfit with deep learning. So we were trying today to show you an example of overfitting and I turned off And I turned off everything. I turned (we’re going to learn all about these terms soon), I turned off all the data augmentation, I turned off dropout, I turned off weight decay, I tried to make it over fit as much as I can, I trained it on a small-ish earning rate, I trained it for a really long time and like maybe I started to get it to overfit, maybe. But So the only thing that tells you that you’re overfitting is that the error rate improves for a while and then starts getting worse again. You will see a lot of people, even people that claim to understand machine learning, tell you that if your training loss is lower than your validation loss then you are overfitting. As you will learn today in more detail, and during the rest of the course, that is absolutely not true. Any model is trained correctly will always have training loss lower than validation loss. That is not a sign of overfitting. That is not a sign you’ve done something wrong. That is a sign you have done something right. Okay. The sign that you are overfitting is that your error starts getting worse, because that’s what you care about, right? You want your model to have a low error? So as long as your training and your model error is improving, you are not overfitting. How could you be? Okay? So there’s basically the four possible, they’re the main four things that can go wrong. There are some other details that we will learn about during the rest of this course, but honestly, if you stopped listening now, (please don’t, that would be embarrassing) and you just, like “Okay. I’m going to go and download images…” “…I’m going to create CNNs with resnet34 or resnet50…” “…I’m going to make sure that my learning rate and number of epochs is okay,…” “and then I’m going to chuck them up in a in a Starlette Web API.” Most of the time, you’re done. Okay? At least for computer vision. Hopefully you’ll stick around because you want to learn about NLP and collaborative filtering and tabular data and segmentation and stuff like that as well. Let’s now understand what’s actually going on? What does it mean? ‘Loss’ mean? What does an ‘epoch’ mean? What does ‘learning rate’ mean? Because for you to really understand these ideas you need to know what’s going on and so we’re going to go all the way to the other side rather than creating a state-of-the-art ‘cougar detector’, we’re going to go back and create the simplest possible linear model. Okay? So we’re going to actually start seeing We’re actually going to start seeing a little bit of math. Okay? But don’t be turned off. It’s okay, right? We’re going to do a little bit of math, but it’s going to be totally fine, even if math’s not your thing. Because the first thing we’re going to realize is that when we see a picture Like this number eight. It’s actually just a bunch of numbers. It’s a matrix of numbers. For this grayscale one, it’s a matrix of numbers, if it was a color image, it would be have a third dimension. So when you add an extra dimension, we call it a ‘tensor’ rather than a matrix. It would be a 3D tensor of numbers: red, green, and blue. So when we created that teddy bear detector, what we actually did was we created a mathematical function that took the numbers from the images of the teddy bears and the mathematical function converted those numbers into, in our case, three numbers. A number for the probability that it’s a teddy, a probability that it’s a grizzly, and the probability is a black bear In this case, there’s some hypothetical function that’s taking the pixels representing a handwritten digit and returning ten numbers. The probability for each possible outcome: the numbers from 0 to 9. And so what you’ll often see in in our code and other deep learning code is that you’ll find a bunch of probabilities and then you’ll find something called .max or .arg_max attached to it a function called, and so what that function is doing is it’s saying find the highest number (the highest probability) and tell me what the index is. So np.arg_max or torch.arg_max of this array would return this number here. Okay, we return index “8”.
That makes sense? In fact, Let’s try it. So we know that the function to predict something is called learn.predict() Okay, so we can check: two question marks before it or after it to get the source code, And here it is, right? pred equals res (result).argmax() and then, what is the class? Well you just pass that into the classes array. So like you should find that the source code in the fastai library can both kind of strengthen your understanding of the concepts and make sure that you know, you know what’s going on and and really help you here. You’ve got a question. Come on over. Q: “Can we have a definition of the error rate being discussed and how it is calculated? I assume it’s cross validation error.” Sure So one way to answer the question of ‘How is error rate calculated?’ would be to type ‘error_rate??’ (question mark) and look at the source code. And it is “1 – accuracy”. Fair enough. And so then a question might be ‘What is accuracy?” accuracy?? (question mark) It is argmax. So we now know that means ‘find out which particular thing it is’ and then look at how often that equals the target. So in other words the actual value and take the mean. So that’s basically what it is. And so then the question is, okay, well, what does that being applied to? and always in fastai, metrics (so these things that we pass in, we call them metrics) are always going to be applied to the validation set. Okay So anytime you put a metric here, it’ll be applied to the validation set because that’s your best practice, right? That’s like, that’s what you always want to do, is make sure that you’re checking your performance on data that your model hasn’t seen and we’ll be learning more about the validation set shortly. Remember, you can also type doc(term-to-look-up) If the source code is not what you want which it might not well be, you actually want the documentation. That will both give you a summary of the types in and out of the function and a link to the full documentation where you can find out all about how metrics work, and what other metrics there are and so forth. And generally speaking you’ll also find links to more information. Where, for example, you will find complete runs through and sample code and so forth showing you how to use all these things So don’t forget that the doc() function is your friend. Okay? And also in the documentation both in the doc function and in the documentation, you’ll see a source link. This is like ??, but what the source link does is it takes you into the exact line of code in github. So you can see exactly how that’s implemented, and what else is around it so lots of good stuff there. Q. Why were you using 3s for your learning rates earlier with 3e-5 and 3e-4? We found that 3e-3 is just a really good default learning rate. It works most of the time. For your initial fine-tuning, before you unfreeze. And then I tend to kind of just multiply from there. So I generally find then that the the next stage I will pick ten times lower than that for the second part of the slice and whatever the LR_finder() found for the first part of the slice. The second part of the slice doesn’t come from the LR_finder(), it’s just a rule of thumb which is like 10 times less than your your first part which defaults to 3e-3, and then the first part of the slice is what comes out of the LR_finder() and we’ll be learning a lot more about these learning rate details both today and in the coming lessons. But yeah for now all you need to remember is that in your you know, your basic approach looked like this it was learn.fit_one_cycle(), some number of epochs (I often pick four) and some learning rate which defaults to 3e-3. I’ll just type it up fully so you can see, and then we do that for a bit and then we unfreeze it, right? And then we learn some more and so this is a bit where I just take whatever I did last time and divide it by 10, and then I also… Right? Like that? And then I have to put one more number in here…and that’s the number that I get from the learning_rate_finder a bit where it’s got the strongest slope. So that’s kind of the Kind of “don’t have to think about it, don’t really have to know what’s going on” Rule of Thumb that works most of the time. But let’s now dig in and actually understand it more completely. So we’re going to create this mathematical function that takes the numbers that represent the pixels and spits out probabilities for each possible [?] And by the way, a lot of the stuff that we’re using here, we are stealing from other people who are awesome, and so we are putting their details here. So like, please check out their work because they’ve got great work that we are highlighting in our course. I really like this idea of this little animated gif of the numbers. So thank you to Adam Geitgey for creating that. And I guess that was probably on Quora by the looks of this…Medium, oh, yes, it was – that terrific Medium post I remember. I’ve had a whole series of Medium posts So So, let’s look and see how we create one of these functions. And let’s start with the simplest function I know, “y=ax + b”. Okay. That’s a line, right? That’s a line. And the gradient of the line is here and the intercept of the line is here? Okay, so hopefully, when we said that you need to know high school math to do this course these are the things we’re assuming that you remember. If we do kind of mention some math thing which I’m assuming you remember and you don’t remember it, don’t freak out, right? Happens to all of us. Khan Academy is actually terrific. It’s not just for school kids. Go to Khan Academy, find the concept you need a refresher on and he explains things really. Well, so strongly recommend checking that out. You know, remember I’m just a philosophy student, right? So I, all the time, am trying to either remind myself about something or I never learnt something and so we have the whole Internet to teach us these things. So I’m going to rewrite this slightly y=a1 x + a2 So let’s just replace b with a2, just give it a different name. Okay. So there’s another way of saying the same thing. Another way of saying that would be if I could multiply a2 by the number 1, okay, this still is the same thing, okay? and So now at this point I’m actually going to say let’s not put the number 1 there but let’s put an x1 here And an x2 here and I’ll say x2 equals 1 okay? So far, this is, you know, this is pretty early high school math This is multiplying by 1 which I think we can handle, okay? So these two are equivalent, with a bit of renaming. Now, in machine learning, we don’t just have one equation, we’ve got lots, right? So if we’ve got some data that represents the temperature versus the number of ice creams sold, then we kind of have lots of dots. And, so, each one of those dots, we might hypothesize, you know, is based on this formula “y=a1x1 + a2x2” all right? And so basically there’s lots of (so this is our Y), (this is our X)… there’s lots of values of y so we can stick a little “i” here and There’s lots of values of x so we can stick a little “x” here, okay? So the way we kind of do that is a lot like numpy indexing, right? But rather than things in square brackets with pytorch indexing, we kind of put them down here in our kind of in the subscript of our equation. Ok? So this is now saying there’s actually lots of these different y(i)s based on lots of different x(i1) and x(i2), ok? But notice there’s only this is still only one of each of these. So these things here are called the “coefficients”, or the “parameters”. So this is our linear equation, and this is still, we’re going to say that every x(i2) is equal to 1, ok? Why did I do it that way? Because I want to do linear algebra. Why do I want to do in linear algebra? Well one reason is because Rachel teaches the world’s best linear algebra course. So if you’re interested check out ‘Computational Linear Algebra for Coders’, so it’s a good opportunity for me to throw in a pitch for this free course, which we make no money, but never mind But more to the point right now, it’s going to make life much easier, right? Because I hate writing loops. I hate writing code, right? I just want the computer to do everything for me. At anytime you see like these little “i” subscripts, that sounds like you’re going to have to do loops and all kind of stuff, but, what you might remember, from school, is that when you’ve got like two things being multiplied together, two things being multiplied together, and then they get added up, that’s called a “dot product”, and then if you do that for lots and lots of different numbers “i”, then that’s called a “matrix product”. So, in fact, this whole thing can be written like this. Rather than lots of different y(i)s, we can say there’s one vector, called ‘y’, which is equal to one matrix called “X” times one vector called “a”. Now at this point, I know a lot of you don’t remember that. So that’s fine, we have a picture to show you. I don’t know who created this. So now I do, somebody called Andre Stouts, created this fantastic thing called “matrixmultiplication.xyz” and here we have a matrix by a vector and we’re going to do a “matrix vector product”. Go! Pshoo… That times that times that, plus plus plus. That times that times that, plus plus plus. That times that times that, plus plus plus. Finished! That is what matrix vector multiplication does. In other words, It’s just that. Except his version is much less messy. Okay. So. This is actually an excellent spot to have a little break and find out what questions we have coming through our students. What are they asking, Rachel? Q. When generating new image data set, how do you know how many images are enough? What are ways to measure “enough”? Yeah, that’s a great question. So, another possible problem you have is you don’t have enough data. How do you know if you don’t have enough data? Because you found a good learning rate, (because if you make it higher, then it goes off into massive losses, if you make it lower it goes really slowly)… so you’ve got a good learning rate, and then you train for such a long time that your error starts getting worse, Okay? So, you know that you’ve trained for long enough. And you’re still not happy with the accuracy. It’s not good enough for the, you know, the ‘Teddy-bear cuddling level’ of safety you want. So, if that happens there’s a number of things you can do and we’ll learn about some of them during, er, pretty much all of them, during this course, but one of the easiest ones is: Get more data. If you get more data, then you can train for longer, get a higher accuracy, lower error rate – without overfitting. Unfortunately, there’s no shortcut. I wish there was. I wish there was some way to know ahead of time, how much data you need. But I will say this; most of the time you need less data than you think. So organizations very commonly spend too much time gathering data getting more data than it turned out they actually needed. So get a small amount first and see how you go. Q. What do you do if you have unbalanced classes such as 200 Grizzlies and 50 Teddies? A. Uh, nothing. Try it. It works. A lot of people ask this question about how do I deal with unbalanced data? I’ve done lots of analysis with unbalanced data over the last couple of years and I just can’t make it not work. It always works. So there’s a there’s actually a paper, that said, like, if you want to get it slightly better then the best thing to do is to take that uncommon class and just make a few copies of it (that’s called over sampling). But, like, I haven’t found a situation in practice where I needed to do that. I’ve found it always just works fine, for me. Q. Once you unfreeze and retrain with one cycle again, if your training loss is still lower than your validation loss (likely underfitting), do you retrain it unfrozen again (which will technically be more than one cycle) or do you redo everything with a longer epoch for the cycle? Hey, you guys asked me that last week! My answer’s still the same: I don’t know. I would find, if you do another cycle, then it’ll kind of maybe generalize a little bit better if you start again, do twice as long, it’s kind of annoying; Depends how patient you are. It won’t make much difference, you know? For me personally, I normally just train a few more cycles. But, yeah, it doesn’t make much difference.
Most of the time. Q. So showing the code sample where you were creating a CNN with resnet34 for the ‘Grizzly-Teddy’ classifier, it says this requires resnet34, which I find surprising. I had assumed that the model created by .save(), which is about 85 megabytes on disk, would be able to run without also needing a copy of resnet34. A.Yeah, I understand. We’re going to be learning all about this shortly. You don’t… There’s no ‘copy’ of resnet34. resnet34 is actually what we call an ‘architecture’ – we’re going to be learning a lot about this. It’s a functional form. Just like this is a ‘linear functional form’ – it doesn’t take up any room, it doesn’t contain anything – it’s just a function. resnett34 is just a function. It doesn’t contain anything, it doesn’t store anything. I think the confusion here is that we often use a ‘pre-trained’ neural net that’s been learned on ImageNet. In this case, we don’t need to use a pre-trained neural net. And actually, to entirely avoid that even getting created you can actually pass “pretrained=False” and that’ll ensure that nothing even gets loaded which will save you another 0.2 seconds, I guess. So, yeah. But we’ll be learning a lot more about this, so don’t worry if this is a bit unclear. But the basic idea is this this thing here is is basically equivalent of saying “is it a line?” ? Or “is it a quadratic?” or “is it a reciprocal?” This is just a function, this is the “resnet34 function” – It’s a mathematical function. It has no… doesn’t take any storage, it doesn’t have any numbers, doesn’t have to be loaded. As opposed to a pre-trained model and so that’s why when we used, when we did it at inference time the thing that took space is… This bit. Which is where we load our parameters which is basically saying, as we’re ready to find out, what are the values of “a” and “b”? We have to store those numbers. But for resnet34, you don’t just store two numbers, you store a few million. Or a few tens of millions of numbers. So, why did we do all this? Well, it’s because I wanted to be able to write it out like this. And the nice thing if we can write it out like this, is that we can now Do that in PyTorch, with no loops, single line of code, and it’s also going to run faster. PyTorch really doesn’t like loops, right? It really wants you to send it a whole equation to do all at once, which means you really want to try and specify things in these kind of linear algebra ways. So let’s go and take a look because what we’re going to try and do then is we’re going to try and take this, we’re going to call it an ‘architecture’, it’s like the tiniest world’s tiniest neural network. It’s got two parameters, you know, a1 and a2, we’re going to try and fit this architecture to some data. So let’s jump into a notebook and generate some dots right and see if we can get it to fit a line somehow. And the ‘somehow’ is going to be using something called S. G. D. What is s S.G.D.? Well, there’s two types of SGD. The first one is where I said, in Lesson 1, “Hey, you should all try building these models and try and come up with something cool.” And you guys all experimented and found really good stuff. So that’s where the ‘S’ would be Student. That would be Student Gradient Descent. So that’s version one of SGD. Version two of SGD, which is what I’m going to talk about today, is where we’re going to have a computer try lots of things and try and come up with a really good function and that will be called ‘Stochastic Gradient Descent’. So, the other one that you hear a lot on Twitter is ‘Stochastic Grad-student Descent’, so that’s the other one that you hear. So, we’re going to jump into “Lesson 2: SGD”. And, so we’re going to kind of go bottom up rather than top down. We’re going to create the simplest possible model we can, which is going to be a linear model, and the first thing that we need is we need some data. And so we’re going to generate some data. The data we’re going to generate looks like this. So this might represent temperature and this rate represent number of ice creams we sell or something like that, but we’re just going to create some synthetic data that we know is following a line. And so, as we build this we’re actually going to learn a little bit about PyTorch, as well. So basically the way we’re going to generate this data is by creating some coefficients. a1 will be 3 and a2 will be 2. And we’re going to create some… like we’ve looked at before, basically a column of numbers for each axis, and a whole bunch of ones.
And then we’re going to do this: [email protected] What is “[email protected]”? [email protected], in python, means a matrix product between x and a. It actually is even more general than that. It can be a vector-vector product, a matrix-vector product, a vector-matrix product or a matrix-matrix product. And then actually in PyTorch, specifically, it can mean even more general things where we get into higher rank tensors, which we will learn all about very soon. Right? But this is basically the key thing that’s going to go on in all of our deep learning. The vast majority of the time our computers are going to be basically doing this: multiplying numbers together and adding them up, which is a surprisingly useful thing to do. Ok, so we basically are going to generate some data by creating a line and then we’re going to add some random numbers to it. But let’s go back and see how we created “x” and “a”. So I mentioned that you know, we’ve basically got these two coefficients, 3 and 2, and you’ll see that we’ve wrapped it in this function called “tensor()”. You might have heard this word ‘tensor’ before. Who’s heard the word tensor before? About 2//3 of you. Okay, so it’s one of these words that sounds scary and apparently, if you’re a physicist, it actually is scary, but in the world of deep learning it’s actually not scary at all. Tensor means ‘array’. Okay? It means array. So specifically it’s an array of a regular shape, right? So it’s not an array where row 1 has two things and row 3 has three things and row 4 has one thing what you call a ‘jagged’ array. That’s not a tensor. A tensor is any array, which has a ‘rectangular’ or ‘cube’ or whatever… you know, a shape where every element every row is the same length, and then every column is the same length so a 4×3 matrix would be a tensor. A vector of length 4 would be a tensor. A 3D array of length 3 x 4 x 6 would be a tensor. That’s all a tensor is. Okay? And so we have these all the time. For example, an image is a three dimensional tensor. It’s got number of rows by number of columns by number of channels; normally red green blue. So for example, a kind of a VGA texture would be 640 by 480 by 3 or actually… we do things backwards, so when people talk about images they normally go width by height but when we talk mathematically we always go a number of rows by number of columns So it’d actually be 480 by 640 by 3 That will catch you out We don’t say ‘dimensions’ though, with tensors, we use one of two words: We either say ‘rank’ or or ‘axes’. ‘Rank’ specifically means how many axes are there? How many dimensions are there? So an image is generally a “rank 3 tensor”. So what we’ve created here is a “rank 1 tensor” or also known as a ‘vector’, right? But like, in math people come up with slightly different words or actually no; they come up with very different words for slightly different concepts. Why is a one dimensional array a ‘vector’ and a two dimensional array’s a ‘matrix’ and then a three dimensional array… Does that even have a name? Not really. It doesn’t have a name. Like, it doesn’t make any sense. We also you know with computers we try to have some simple consistent naming conventions. They’re all called ‘tensors’. Rank 1 tensor, rank 2 tensor, rank 3 tensor. You can certainly have a rank 4 tensor If you’ve got 64 images then that would be a rank 4 tensor of 64 x 480 x 640 x 3, for example. So tensors are very simple. They just mean arrays. And so, in PyTorch, you say tensor and you pass in some numbers and you get back, in this case, just a list. I got back a ‘vector’, okay? So this, then, represents our coefficients: the slope and the intercept of our line. And so, because remember, we’re not actually going to have a special case of “ax + b” instead, we’re going to say there’s always this second x value which is always 1 (you can see it here, always 1), which allows us just to do a simple ‘matrix vector product’. Ok, so that’s ‘a’ and then we wanted to generate this ‘x array’ of data which is going to have we’re going to put random numbers in the first column and a whole bunch of ones in the second column. So to do that, we basically say to PyTorch: “create a rank 2 tensor, Actually no, sorry, let’s say that again. We see to PyTorch: “we want to create a tensor of ‘n x 2’. So since we passed in a total of 2 things we get a rank 2 tensor. The number of rows will be ‘n’ and the number of columns will be 2. And in there, every single thing in it will be a 1.
That’s what torch.ones() means. And then, this is really important, you can index into that, just like you can index into a list in Python, but you can put a colon (:) anywhere. And a colon means – “every single value on that axis”. Or “every single value on that dimension”. So this here means every single row. And then this here means column 0. So this is every row of column 0, I want you to grab a uniform, random number. And here’s another very important concept: in PyTorch, anytime you’ve got a function that ends in an underscore, it means “don’t return to me that uniform random number but replace whatever this is being called on, with the result of this function”. So this takes column 0 and replaces it with a uniform random number between -1 and 1. So there’s a lot to unpack there, right? But the good news is those two lines of code, plus this one (which we’re coming to), cover 95% of what you need to know about PyTorch. How to create an array, how to change things in an array, and how to do matrix operations on an array, okay? So there’s a lot to unpack but these small number of concepts are incredibly powerful. So I can now print out the first 5 rows, okay? So “:5” is standard python ‘slicing’ syntax, to say ‘the first five rows’. So here are the first five rows, two columns looking like my random numbers, and my ones. So now I can do a matrix product of that x by my a, add in some random numbers to add a bit of noise, and then I can do a scatter plot. And I’m not really interested in my scatter plot in this column of ones, right? There just there to make my linear function more convenient, so I’m just going to plot my 0-index column against my “y”s and there it is. “plt” is what we universally use to refer to the plotting library ‘matplotlib’. And that’s what most people use for most of their plotting in python. In scientific python we use matplotlib. It’s certainly a library, you’ll want to get familiar with because being able to plot things is really important. There are lots of other plotting packages. Lots of them, the other packages, are better at certain things than matplotlib, but like matplotlib can do everything reasonably well. Sometimes it’s a little awkward, but you know, for me, I do pretty much everything in matplotlib because there’s really nothing it can’t do (even though some libraries can do other things a little bit better or a little bit prettier). But it’s really powerful, so once you know matplotlib, you can do everything. So here I’m asking matplotlib to give me a scatterplot with my x’s against my y’s and there it is, okay? So this is my my dummy data representing like, you know, of temperature and ice cream sales So ,now what we’re going to do is we’re going to pretend we were given this data and we don’t know that the values of our coefficients are 3 and 2. So we’re going to pretend that we never knew that we have to figure them out, okay? So how would we figure them out? How would we draw a line to fit to this data? And why would that even be interesting? Well, we’re going to look at more about why it’s interesting in just a moment, but the basic idea is this: if we can find (this is going to be kind of perhaps, really surprising) but if we can find a way to find those two parameters to fit that line to those (how many points were there? – ‘n’ was 100) if we can find a way to fit that line to those 100 points, we can also fit these arbitrary functions that convert from pixel values to probabilities. It’ll turn out that there’s techniques that we that we’re going to learn to find these two numbers, works equally well for the 50 million numbers in resnet34. So we’re actually going to use an almost identical approach. So that’s (this is the bit that I found in previous classes, people have the most trouble digesting), like, I often find even after week 4 or week 5, people will come up to me and say “I don’t get it, how do we actually train these models?” – and I’ll say “It’s SGD. It’s that thing we throw in the notebook with the 2 numbers”. It’s like “Yeah, but but we’re fitting a neural network”. So “I know, and we can’t print the 50 million numbers anymore, but it is literally, identically, doing the same thing”. And the reason this is hard to digest is that the human brain has a lot of trouble conceptualizing of what an equation with 50 million numbers looks like and can do.
So you just kind of, for now, will have to take my word for it. It can do things like recognize Teddy Bears. And all these functions turn out to be very powerful. Now we’re going to learn a little bit more in just a moment, about how to make them extra powerful, but for now, the thing we’re going to learn to fit these two numbers is the same thing that we’ve just been using to fit 50 million numbers. Okay, so we want to find what PyTorch calls ‘parameters’. Or in statistics, you’ll often hear called ‘coefficients’. These values a1 and a2. We want to find these parameters such that the line that they create minimizes the error between that line and the points. So in other words, you know, if we created, you know, if the a1 and a2 we came up with resulted in this line, then we’d look and we’d see like how far away is that line from each point? I would say “Oh, that’s quite a long way”. And so maybe there was some other a1 or a2 which resulted in this line and they would say, like, “oh, how far away is each of those points”? And then eventually we come up with Blue We come up with this line and it’s like, “Oh, in this case, each of those is actually very close”. All right? So you can see how in each case we can say how far away is the line at each spot away from its point and then we can take the average of all those and that’s called the ‘loss’. And that is the value of our loss, right? So you need some mathematical function that can basically say how far away is this line from those points? For this kind of problem, which is called a ‘regression’ problem ,a problem where your dependent variable Is ‘continuous’, so rather than being “Grizzlies” or “Teddies”, it’s like some number between -1 and 6, this is called a regression problem. And for regression the most common loss function is called ‘mean squared error’, which pretty much everybody calls ‘MSE’. You may also see RMSE just ‘Root Mean Squared Error’. And so the mean squared error is a loss, it’s the difference between some prediction that you’ve made, okay, which you know is like the value of the line, and the actual number of ice cream sales. And so, in the mathematics of this, people normally refer to the actual, they normally call it “y” and the prediction, they normally call it “y hat”, as in they they write it like that. And so what I try to do like when we’re writing something like, you know, mean squared error equation, there’s no point writing ice cream here and temperature here because we wanted to apply it to anything. So we tend to use these like mathematical placeholders. So the value of mean squared error is simply the difference between those two, squared! All right? And then we can take the mean. Because, remember, that is actually a ‘vector’ or what we now call it, a “rank 1 tensor” and that is actually a rank 1 tensor, so it’s the value of the number of ice cream sales at each place. And so when we subtract one vector from another vector, (and we’re going to be learning a lot more about this), but it does something called element-wise arithmetic in other words It subtracts each each one from each other, and so we end up with a vector of differences, and then if we take the square of that, it squares everything in that vector. And so then we can take the mean of that to find the average square of the differences between the actuals and the predictions. So, if you’re more comfortable with mathematical notation what we just wrote there was the “sum of…” (which way round did we do it?) y hat minus… y… squared, over… n”, right? So that equation is the same as that equation. So one of the things I’ll note here is, I don’t think this is, you know, more complicated or unwieldy than this, right? But the benefit of this is you can experiment with it like once you’ve defined it, you can use it you can send things into it and get stuff out of it and see how it works, alright? So, for me, most of the time I prefer to explain things with code rather than with math. Right? Because I can actually…they’re the same, they’re doing, in this case at least, in all the cases we’ll look at, they’re exactly the same, they’re just different notations for the same thing. But one of the notations is executable, it’s something that you can experiment with, and one of them is abstract, so that’s why I’m generally going to show code. So the good news is, if you’re a coder, with not much of a math background, actually, you do have a math background because code is math. Right? Now if you’ve got more of a math background and less of a code background, then actually a lot of the stuff that you learned from math is going to translate very directly into code, and now you can start to experiment really with your math. Okay, so this is a ‘loss function’. This is something that tells us how good our line is. So now, we have to kind of come up with: “What is the line that fits through here?” Remember, we don’t know (we’re going to pretend we don’t know) so what you actually have to do is you have to guess. You actually have to come up with a guess: what are the values of a1 and a2? So let’s say we guess that a1 and a2 are both 1. So this is our tensor. ‘a’ is (1.0, 1.0), right? So here is how we create that tensor. And I wanted to write it this way because you’ll see this all the time. Like, written out it should be “1.0…” (sorry…it should be -1)… Written out fully it would be “-1.0… 1.0”. Like that’s written out fully. We can’t write it without the point, because that’s now an ‘int’, not a floating point. So that’s going to “spit the dummy” if you try to do calculations with that in neural nets, all right? I’m lazy, I’m far too lazy to type “.0” every time. python knows perfectly well that if you add a dot next to any of these numbers, then the whole thing is now floats, right? So that’s why you’ll often see it written this way, particularly by lazy people like me. Okay, so ‘a’ is a tensor. You can see it’s floating-point – you see like, even PyTorch is lazy, they just put a “.” they don’t bother with a 0, right? But if you want to actually see exactly what it is. You can write “.type()” and you can see it’s a ‘float’ tensor, okay? And so now we can calculate our predictions with this, like, random guess [email protected] (matrix product of x and a), and we can now calculate the mean squared error of our predictions and their actuals and that’s our loss. Okay, so for this regression, our loss is 8.9.
And so we can now plot a scatter plot of x against y and we can plot the scatter plot of x against y-hat (our predictions) and there they are. Okay, so this is the (1 , -1) line …sorry, the (-1, 1) line and here’s actuals. So that’s not great, not surprising, it’s just a guess. so SGD, or “gradient descent” more generally (and anybody who’s done any engineering or probably computer science at school will have done plenty of this, like Newton’s method what all the stuff that you did… university – if you didn’t, don’t worry, we’re going to learn it now)… It’s basically about taking this guess and trying to make it a little bit better. So, how do we make it a little bit better? Well, there’s only two numbers right and the two numbers are and the two numbers are the intercept of that orange line and the gradient of that orange line. So what we’re going to do with gradient descent is we’re going to simply say: “What if we change those two numbers a little bit, what if we made the intercept a little bit higher…?” or a little bit lower? What if we made the gradient a little bit more positive or a little bit more negative? So there’s like four possibilities. And then we can just calculate the loss for each of those four possibilities and see what see what worked. Did lifting it up or down make it better? Did tilting it more positive or more negative make it better? And then all we do is we say, okay, well, whichever one of those made it better that’s what we’re going to do. And that’s it. Right? But here’s the cool thing for those of you that remember calculus – you don’t actually have to move it up and down and round about, you can actually calculate the ‘derivative’.
The derivative is the thing that tells you… Would moving it up or down make it better or would rotating it this way or that way make it better? Okay, so the good news is if you didn’t do calculus or you don’t remember calculus, I just told you everything you need to know about it, right? Which is that it tells you how changing one thing changes the function, right? That’s what the derivative is. Kind of, not quite strictly speaking right, close enough, also called the ‘gradient’. Okay, so the gradient or the derivative, tells you how changing a1, up or down, would change our MSE, how changing a2 up or down will change your MSE and this does it more quickly. Does it more quickly than actually moving it up and down? Okay? So, in school, unfortunately, they forced us to sit there and calculate these derivatives by hand. We have computers! Computers can do that for us. We are NOT going to calculate them by hand. Instead, we’re going to call “.grad”. On our computer that will calculate the gradient for us. So here’s what we’re going to do – we’re going to create a loop, we’re going to loop through 100 times and we’re going to call a function called .update(). That function is going to calculate y-hat (our prediction), It is going to calculate loss (our mean squared error). From time to time it will print that out so we can see how we’re going. It will then calculate the gradient and in PyTorch calculating the gradient is done by using a method called .backward(). So you’ll see something really interesting which is, mean squared error was just a simple standard mathematical function, PyTorch, for us, keeps track of how it was calculated and lets us calculate the derivatives. So if you do a mathematical operation on a tensor in PyTorch, you can call .backward() to calculate the derivative. What happens to that derivative? That gets stuck inside an attribute called .grad. So I’m going to take my coefficients ‘a’ and I am going to subtract from them my gradient. And this underscore here… Why? Because that’s going to do it in place. So it’s going to actually update those coefficients a to subtract the gradients from them, right? So, why do we subtract? Well because the gradient tells us if I move the whole thing downwards, the loss goes up. If I move the whole thing upwards, the loss goes down. So I want to like do the opposite of the thing that makes it go up, right? So because our loss, we want to loss to be small. So that’s why we have to subtract. And then there’s something here called “lr”. “lr” is our learning rate. And so literally all it is is the thing that we multiply by the gradient. Why is there any ‘lr’ at all? Let me show you why. Let’s take a really simple example. A quadratic. All right, and let’s say your algorithm’s job was to find where that quadratic was at its lowest point. And so, well, how could it do this? Well, just like what we’re doing now, the starting point would just be to pick some x value at random. And then, pop up here to find out what the value of y is. Okay? That’s its starting point. And so then it can calculate the gradient and the gradient is simply the slope. Right? It tells you moving in which direction is going to make you go down. And so the gradient tells you you have to go this way. So, if the gradient was really big, you might jump this way a very long way. So you might jump all the way over to… Here. Maybe even here. Right? And so if you jumped over to there… Then that’s actually not going to be very helpful because then, you see, well where does that take us to? Oh! It’s now worse. Right? We jumped too far. So we don’t want to jump too far, so maybe we should just jump a little bit. Maybe to here. And the good news is that is actually a little bit closer. And so then we’ll just do another little jump; see what the gradient is and do another little jump. That takes us to here. And another little jump. That takes us to here. Here. Yeah, right. So in other words, we find our gradient to tell us kind of what direction to go and like, do we have to go a long way or not too far? But then we multiply it by some number, less than one, so we don’t jump too far. And so, hopefully at this point, this might be reminding you of something. Which is ‘what happened when our learning rate was too high’? So do you see why that happened now? Our learning rate was too high, meant that we jumped all the way past the right answer further than we started with and it got worse and worse and worse. So that’s what a ‘learning rate too high’ does. On the other hand, if our learning rate is too low then you just take tiny little steps and so, eventually you’re going to get there but you’re doing lots and lots of calculations along the way. So you really want to find something where it’s kind of either like this Or maybe it’s kind of a little bit backwards and forwards, maybe it’s kind of like this… Something like that, you know, you want something that kind of gets in there quickly, but not so quickly it jumps out and diverges. Not so slowly that it takes lots of steps. So that’s why we need a good learning rate. And so that’s all it does. So if you look inside the source code of any deep learning library, you will find this. You will find something that says coefficients.subtract(learning rate) * gradient. And we’ll learn about some minor…not minor… We’ll learn about some easy but important optimizations we can do to make this go faster. But that’s basically it. There’s a couple of other little minor issues that we don’t need to talk about now one involving zeroing out the gradients and another involving making sure that you turn gradient calculation off when you do the SGD update. If you’re interested we can discuss them on the forum or you can do our “Introduction to Machine Learning” course, which covers all the mechanics of this in more detail. But this is the basic idea. So if we run update() 100 times, printing out the loss from time to time you can see it starts at 8.9, and it goes down down down down down down down. And so we can then print out scatter plots and there it is. That’s it. Believe it or not, that’s gradient descent. So we just need to start with a function that’s a bit more complex than [email protected] But as long as we have a function that can represent things like ‘is this a teddy bear?’, we now have a way to fit it. Okay? So let’s now take a look at this as an animation and this is one of the nice things that you can do with… This is one of the nice things that you can do with matplotlib is you can take any plot and turn it into an animation. That and so you can now actually see it updating each step. So let’s see what we did here. We simply said, as before, create a scatter plot, but then rather than having a loop, we used matplotlib FuncAnimation() so call 100 times, this function. And this function just called that update() that we created earlier and then updated the ‘y’ data in our line. And so did that 100 times… waiting 20 milliseconds after each one ,and there it is. Right? So you might think that, like, visualizing your algorithms with animations is some amazing and complex thing to do, but actually now, you know It’s 1 2 3 4 5 6 7 8 9 10 11 lines of code. Okay? So I think that is pretty damn cool. So that is SGD visualized. And so we can’t visualize as conveniently what updating 50 million parameters in a resnet34 looks like, but it’s basically doing the same thing, okay? And so studying these simple versions is actually a great way to get an intuition. So you should try running this notebook with a really big learning rate, with a really small learning rate, and see what this animation looks like, right, and try get a feel for it. Maybe you can even try a 3d plot. I haven’t tried that yet, but I’m sure it would work fine too. So the only difference between Stochastic Gradient Descent and this is something called ‘minibatches’. You’ll see what we did here was we calculated the value of the loss on the whole data set on every iteration. But if your data set is one and a half million images in ImageNet, that’s going to be really slow, right? Just to do a single update of your parameters you’ve got to calculate the loss on one and a half million images. You wouldn’t want to do that. So what we do is we grab 64 images or so at a time, at random, and we calculate the loss on those 64 images, and we update our weights. And then we grab another 64 random images. We update the weights. So in other words, the loop basically looks exactly the same, but at this point here – so it’d basically be y square bracket and some random indexes here, you know, and some random indexes here and we’d basically do the same thing and well actually, sorry, it would be there, right, so some random indexes on our x and some random indexes on our y to do a minibatch at a time, and that would be the basic difference. And, so, once you add those, you know, grab a random few points each time, those random few points accord your minibatch and that approach is called SGD, or Stochastic Gradient Descent. Okay, so there’s quite a bit of vocab we’ve just covered, right? So let’s just remind ourselves: the ‘learning rate’ is a thing that we multiply our gradient by, to decide how much to update the weights by. An ‘epoch’ is one complete run through all of our data points (all of our images). So for the non-stochastic gradient descent we just did, every single loop, we did the entire data set. But if you’ve got a data set with a thousand images and your mini-batch size is 100 then it would take you ten iterations to see every image once, so that would be one ‘epoch’. Epochs are important because if you do lots of epochs, then you’re looking at your images lots of times, and so every time you see an image there’s a bigger chance of overfitting. So we generally don’t want to do too many epochs. A ‘minibatch’ is just a random bunch of points that you use to update your weights. SGD is just gradient descent using minibatches. Architecture and model kind of mean the same thing. In this case, our architecture is y=Xa All right? The architecture is the mathematical function that you’re fitting the parameters to. And we’re going to learn either today or next week, what the mathematical function of things like resnet34, actually is. But it’s basically pretty much what you’ve just seen. It’s a bunch of matrix products. ‘Parameters’, also known as coefficients, also known as weights, are the numbers that you’re updating. And then ‘loss function’ is the thing that’s telling you how far away or how close you are to the correct answer. Any questions? All right. So, these models, these predictors, these Teddy Bear Classifiers, are functions that take pixel values and return probability. They start with some functional form, like y=Xa, and they fit the parameters, ‘a’, using SGD, to try and do the best to calculate your predictions. So far we’ve learned how to do regression, which is a single number. Next week, we’ll learn how to do the same thing for classification where we have multiple numbers. But it’s basically the same. In the process, we had to do some math. We had to do some linear algebra and we had to do some calculus. And a lot of people get a bit scared at that point and tell us “I am NOT a math person”. If that is you, that’s totally okay, but you’re wrong. You are a math person. In fact, it turns out that when in the actual academic research around this, there are not math people and non-math people. It turns out to be entirely a result of culture and expectations. So you should check out Rachel’s talk “There’s No Such Thing As Not a Math Person”, where she will introduce you to some of that academic research. And so if you think of yourself as not a math person you should watch this so that you learn that you’re wrong, that your thoughts are actually there because somebody has told you ‘you’re not a math person’, but there’s actually no academic research to suggest that there is such a thing. In fact, there are some cultures, like Romania and China, where the ‘not a math person’ concept never even appeared. It’s almost unheard of in some cultures for somebody to say “I’m not a math person” because that just never entered that cultural identity. So, don’t freak out if words like ‘derivative’ and ‘gradient’ and ‘matrix product’ are things that you’re kind of scared of, it’s something you can learn. It’s something you’ll be okay with. Okay? So the last thing that we’re going to close with today… Oh, I just got a message from Simon Willison. Ah! Simon’s telling me he’s actually not that special, lots of people won medals. So, That’s the worst part about Simon. Not only is he really smart he’s also really modest which I think it’s just awful. I mean if you’re going to be that smart, at least be a horrible human being and, you know, make it okay. Okay, so, the last thing I want to close with is the idea of (and we’re going to look at this more next week) underfitting and overfitting. We just fit a line to our data. But imagine that our data wasn’t actually line ‘shaped’, right? And so if we try to fit something which was, like “constant + constant * x”, a line to it, then it’s never going to fit very well. Right? No matter how much we change these two coefficients, it’s never going to get really close. On the other hand, we could fit some much bigger equation, so in this case, it’s a higher degree polynomial, with lots of lots of wiggly bits like so. Right? But if we did that, it’s very unlikely we go and look at some other place to find out the temperature that it is and how much ice cream they’re selling and that will get a good result, because, like, the wiggles are far too wiggly. So this is called ‘overfitting’. We’re looking for some mathematical function that fits “just right”, to stay with the teddy bear analogy. So you might think, if you have a statistics background, the way to make things fit “just right” is to have exactly the right number of parameters. To use a mathematical function that doesn’t have too many parameters in it. It turns out that’s actually completely not the right way to think about it. There are other ways to make sure that we don’t overfit, and in general, this is called ‘regularization’. Regularization are all the techniques to make sure that when we train our model, that it’s going to work not only well on the data its seen but on the data it hasn’t seen yet. So, the most important thing to know when you’ve trained a model, is actually ‘how well does it work on data that it hasn’t been trained with’? And so as we’re going to learn a lot about next week, that’s why we have this thing called a ‘validation set’. So what happens with a validation set, is that we do our minibatch SGD training loop with one set of data (with one set of teddy bears, grizzlies,
black bears) and then when we’re done, we check the loss function and the accuracy to see how good is it on a bunch of images which were not included in the training. And so, if we do that, then if we have something which is too wiggly, it’ll tell us. “Oh, your loss function and your error is really bad”, because on the bears that it hasn’t been trained with, the wiggly bits are in the wrong spot. Where if it was underfitting, it would also tell us that your validation set’s really bad. So, like, even for people that don’t go through this course and don’t learn about the details of deep learning, like if you’ve got managers or colleagues or whatever, at work, who are kind of wanting to, like, learn about AI, the only thing that you really need to be teaching them is about the idea of a validation set. Because that’s the thing they can then use to figure out, you know, if somebody’s selling them snake oil or not, you know, they’re like, hold back some data and then they get told, like, “oh here’s a model that we’re going to roll out” and then you say “okay, fine… I’m just going to check it on this held out data to see whether it generalizes.” There’s a lot of details to get right when you design your validation set. We will talk about them, briefly, next week, but a more full version would be in Rachel’s piece on the fast.ai blog called “How (and why) to create a good validation set”. And this is also one of the things we go into in a lot of detail in the ‘Intro to Machine Learning’ course. So we’re going to try and give you enough to get by, for this course, but it is certainly something that’s worth deeper study as well. Any questions or comments before we wrap up? Okay, good. All right, well, thanks everybody. I hope you have a great time building your web applications. See you next week.

25 Comments

  1. Excellent introduction and many practical technical tips. One thing, I suspect at 1:55:51, those 3 charts may come from Andrew Ng Machine Learning MOOC at Coursera. Unsure if the Quora author properly credited things.

  2. This python library can be used to download Google images.

    https://github.com/hardikvasa/google-images-download

  3. 48:31 in what universe, do you think the error rate gets better?
    With default lr the error rate was 2percent and with 1e-5 lr, the error rate is 40percent

  4. I think your analogy at 16:00 about soccer is far off. First, you don't give the kid a ball to watch soccer, but you do give them some initial ideas on how the game works. Second, when you suggest someone might go academic on soccer, you're exaggerating to the absurd. Of course we're not suggesting you teach us basic arithmetic 1 + 1 = 2 so that we can learn deep learning, but it is certainly handy to have an overview of a neural network. After all, if we're going to exaggerate to the absurd, your videos are basically magic shows without Houdini…

    EDIT: and need I say that most people don't expect to spend years learning this stuff before they write production-level code (which your analogy to learning soccer for years suggests)?

  5. Hi Jeremy. The javascript doesn't seem to generate the text file for me. I am using google chrome.

    "urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
    window.open('data:text/csv;charset=utf-8,' + escape(urls.join('n')));"

    This is the script that I am running. Am I doing anything wrong?

  6. Thank you very much!! I just have one question.
    How did loss get the backward function? It is a torch tensor and does not have a grad function

    Thanks

  7. Thanks! What do I need to know to create my own Python deep learning framework? What are the books and courses to get knowledge for this?

  8. How did Jeremy infer the learning_rate to be 3e-5 by looking at the graph? @28:40
    I'm surely missing something. Any help is appreciated. Thanks!

  9. I did different types of cucumbers (English, Field, and Lemon) by getting the images from Google. So fun! Sadly the GCP web deploy didn't work due to some fastai library changes I think. Would be nice if it worked out of the box so it should probably be updated.

  10. I made a YouTube video of a YouTube video specifically this video lol I made an image classifier with a custom dataset in google Colab if anyone is interested.
    https://youtu.be/ubY_x2MMPuQ

  11. hi the java scrict code is not working in my crome. i even disabled adblocked. it is downloading a file with no urls in it.it litterally 0kb file .please help me some one.

  12. I love this course.

    However, I think that human creativity should play a role in experimentation while you experiment. When you are done experimenting and you want to show your results, you should make the notebook so that it runs top to bottom, or at least making it clear how to run the code.

    Second, setting random.seed so that anyone can reproduce the same result as you did may be very important. What if you didn't set it, you got good results and someone else tries to reproduce it but is not able to. He has now a hard time to tell if he has some bug in his code or the worse results are just because the randomness. First you should make it reproducible. Then you can see if it is robust. I think.

  13. I have a doubt, any help is much appreciated.
    I build the model to classify memes vs not memes. It is classifying with 0% error rate. But I'm not sure how to use this model on the whole new test folder(full of new images). any idea?

Leave a Reply

Your email address will not be published. Required fields are marked *