ml5.js: Train a Neural Network with Pixels as Input

And you thought we were done
with the ML5 neural network tutorials. But no. There is one more because
I am leading to something. I am going to– you will
soon see in this playlist a section on convolutional
neural networks. But before I get to
convolutional neural networks, I want to look at reasons
why a convolutional layer. I have to answer this question
like, what is a convolution? I’ve got to get to that. But before I get to that,
I want to just see why they exist in the first place. So I want to start
with another scenario for training your
own neural network. That scenario is an
image classifier. Now you might
rightfully be sitting there saying to yourself,
you’ve done videos on image classifiers before. And in fact, I have. The very beginning
of this whole series was about using a pre-trained
model for an image classifier. And guess what? That pre-trained model had
convolutional layers in it. So I want to now take the time
to unpack what that means more and look at how you could train
your own convolutional neural network. Again, first though,
let’s just think about how we would make
an image classifier with what we have so far. We have an image. And that image is being sent
into an ML5 neural network. And out of that neural network
comes either a classification or regression. And in fact, we could
do an image regression. And I would love to do that. But let me start
with a classifier because I think it’s a
lot simpler to think about and consider. So maybe it comes out
with one of two things, either a cat or a dog and
some type of confidence score. I previously zoomed in
on the ML5 neural network and looked at what’s
inside, right? We have this hidden
layer with some number of units and an output
layer, which, in this case, would have just two if
there’s two classes. Everything is connected, and
then there are the inputs. With post net, you might
recall, there were 34 inputs because there were
17 points on my body, each with an xy position. What are these? Let’s just say, for
the sake of argument, that this image is
10 by 10 pixels. So I could consider
every single pixel to be an individual input
into this ML5 neural network. But each pixel has
three channels, and R, G, and B. So that would
make 100 times three inputs, 300 inputs. That’s reasonable. So this is actually what
I want to implement. Take the idea of a two
layer neural network to perform classification,
the same thing I’ve done in previous videos, but,
this time, use as the input the actual raw pixels. Can we get meaningful
results from just doing that? After we do that, I want
to return back to here and talk about why this
is inadequate or not going to say inadequate but how
this can be improved on by adding another layer. So this layer won’t– sorry. The inputs will still be there. We’re always going
to have the inputs. The hidden layer
will still be there. And the output layer
will still be there. But I want to
insert right in here something called a
convolutional layer. And I want to do a two
dimensional convolutional layer. So I will come back. If you want to just
skip to that next video, if and when it
exists, that’s when I will start talking about that. But let’s just get this working
as a frame of reference. I’m going to start with
some prewritten code. All this does, it’s
a simple P5JS sketch that opens a connection
to the web cam, resizes it to 10 by
10 pixels, and then draws a rectangle in the canvas
for each and every pixel. So this could be
unfamiliar to you. How do you look at an
image in JavaScript in P5 and address every single
pixel individually? If that’s unfamiliar
to you, I would refer to my video on that topic. That’s appearing over
next to me right now. If you go take a look at
that and then come back here. But really, this is just looking
at every x and y position, getting the R, G, B values,
filling a rectangle, and drawing it. So what I want to do
next is think about, how do I configure this
ML5 neural network, which expects that 10 by
10 image as its input? I’m going to make a
variable called pixel brain. And pixel brain will be
a new ML5 neural network. I should have mentioned that you
could find the link to the code that I’m starting
with, in case you wanted to code along with
me, both the finished code and the code I’m
starting with will be in this video’s description. So to create a neural network,
I call the neural network function and give
it a set of options. One thing I should mention
is while in all the videos I’ve done so far,
I’ve said that you need to specify the
number of inputs and the number of outputs to
configure your neural network. The truth is ML5
is set up to infer the total number of
inputs and outputs based on the data
you’re training it with. But to be really
explicit about things and make the tutorial
as clear as possible, I’m going to write
those into the options. So how many inputs? Think about that for a second. The number of columns times
the number of the rows times R, G, B. Maybe I would
have a grayscale image. Maybe I could just
make it I don’t need a separate input for R,
G, and B. But let’s do that. Why not? I have the 10 by 10 in a
variable called video size. So let’s make that video size
times video size times three. Let’s just make a really
simple classifier that’s like I’m here or not here. So I’m going to make that two. The task is classification. And I want to see debugging
when I train the model. Now I have my pixel
brain, my neural network. Oops. That should be three. Let’s go with my usual
typical, terrible interface, meaning no interface. And I’m just going to train
the model based on when I press keys on the keyboard. So I’ll add a key
press function. And then let me just
a little goofy here, which I’m just going to
say when I press the key, add example key. So I need a new function
called add example. Label. So basically, I’m going to make
the key that I press the label. So I’m going to
press a bunch of keys when I’m standing
in front the camera and then press a different
key when I’m not standing in front of the camera. Now comes the harder work. I need to figure out how
to make an array of inputs out of all of the pixels. Luckily for me,
this is something that I have done before. And in fact, I
actually have some code that I could pull
from right in here, which is looking at how to
go through all the pixels to draw them. But here’s the thing. I am going to do something
to flatten the data. I am not going to keep the
data in its original columns and rows orientation. I’m going to take the
pixels and flatten them out into one single array. Guess what? This is actually
the problem that convolutional neural
networks will address. It’s bad to flatten the data
because its spatial arrangement is meaningful. I’ll start by creating an
empty array called inputs. Then I’ll loop through
all of the pixels. And to be safe,
I should probably say video dot load pixels. The pixels may already
be loaded because I’m doing that for down here. And I could do something
where if I’m drawing them, I might as well
create the data here. But I’m going to be
redundant about it. And I’m going to say– ah, but this is weird. Here’s the weird thing. I thought I wasn’t going to
talk about the pixel array in this video and just refer
you to the previous one. But I can’t escape it right now. For every single pixel
in an image in P5JS, there are four spots in
the array, a red value, a green value, a blue
value, and an alpha value. Alpha value for transparency. The alpha value, I can
ignore because it’s going to be 255 for everything. There’s no transparency. If I wanted to
learn transparency, I could make that an input
and have 10 by 10 times 4. But I don’t need
to do that here. So in other words, pixel
zero starts here, 0, 1, 2, 3. And the second pixel
starts at index four. So as I’m iterating
over all of the pixels, I want to move through the
array four spaces at a time. There’s a variety of ways
I could approach this, but that’s going to make
things easiest for me. So that means right
over here, this should be plus equals four. Then I can say the red value
is video dot pixels index I. The green value
is at I plus one. And the blue value
is at I plus two. And just to be
consistent, I’m going to just put a plus zero in there
so everything lines up nicely. So that’s the R,
G, and B values. Then I want those
R, G, and B values for this particular pixel
to go in the inputs array. The chat is making
a very good point, which is that I have all of
the stuff in an array already. And all I’m really doing is
making a slightly smaller array that’s removing
every fourth element. I could do that with
the filter function or some kind of
higher order function or maybe just use
the original array. I’m not really sure why
I’m doing it this way. But I’m going to emphasize
this data preparation step. So I look forward to
hearing your comments about and maybe reimplementations
of this that just use the pixel array directly. But I’m going to keep it
this way for right now. So I’m taking the R, G,
and B and putting them all into my new array. Then the target
is just the label, a single label in an array. And I can now add
this as training data, pixel brain add
data inputs target. Let’s console log something just
to see that this is working. So I’m going to
console log the inputs. And let’s also console
log the target, just to see that
something is coming out. So, a, yeah. We can see there’s
an array there. And there’s the a. And now if I do b, I’m getting
a different array with b there. So I’m going to assume
this is working. I could say inputs dot
length to make sure that that’s the right idea. Yeah. It’s got 300 things in it. OK. Next step is to train the model. So I’m going to say, if
the key pressed is T, don’t add an example but
rather train the model. And let’s give it
train it over 50 epochs and have a callback when
it’s finished training. Let’s also add an
option to save the data, just in case I want to stop
and start a bunch of times and not collect the data again. And I’m ready to go, except
I missed something important. I have emphasized
before that when working with neural
networks, it’s important to
normalize your data, to take the data that you’re
using as inputs or outputs, look at its range,
and standardize it to some specific range,
typically between zero and one or maybe between
negative one and one. And it is true that ML5
will do this for you. I could just call
normalized data. But this is a nice opportunity
to show that I can just do the normalization myself. For example, I know–
this is another reason to make a separate
array sort of. I know that the range
of any given pixel color is between zero and 255. So let me take the opportunity
to just divide every R, G, B value by 255 to squash
it, to normalize it between zero and one. Let’s see if this works. I’m going to collect it. So I’m going to press–
this is a little bit silly, but I’m going to
press H for me being here in front of the camera. Then I’m going to
move off to the side, and I’m going to use N for not
being in front of the camera. So I’m not here. And I’m just going to do
a little bit right now, and then I’m going
to hit T for train. And loss function going crazy. But eventually, it gets down. It’s a very small amount of
data that I gave it to train. But we can see that I’m
getting a low loss function. If I had built in the
inference stage to the code, it would start to
guess Dan or no Dan. So let’s add that in. When I’m finished training,
then I’ll start classifying. The first thing I need to do if
I’m going to classify the video is pack all of those pixels
into an input array again. Then I can call
classify on pixel brain and add a function to
receive the results. Let’s do something fun
and have it say hi to me. So I’m going to make this label
a global variable with nothing in it. And then I’ll say, label
equals results label. After I draw the pixels,
let’s either write hi or not write hi. So just to see that this
works, let’s make the label H to start. It says hi. Now let’s not make
it H. And let’s go through the whole process. Train the model. And it says hi. Oh, I forgot to
classify the video again after I get the results. So it classified it only once. And I want to then
recursively continue after I get the results to
classify the video again. Just so we can finish
this out, I actually saved all of the data
I collected to a file called data dot JSON. And now I can say, pixel
brain load data data dot JSON. And when the data is loaded,
then I can train the model. So now I’ve eliminated
the need to collect the data every single time. Let’s run the sketch. It’s going to train the model. I don’t really even
need to see this. When it gets to the end, hi. Hooray. I’m pleased that that worked. I probably shouldn’t,
but I just want to try having three outputs. So let’s try something
similar to what I did in my previous videos
using teachable machine to train an image classifier. And we’ll look at this
ukulele, coding train notebook, and a Rubik’s cube. So let me collect a
whole lot of data. I’m going to press U for
ukulele, R for Rubik’s cube, and N for notebook. Save the date in case I need
it later and train the model. All right, so now ukulele,
U, N for notebook. And can we get an R? I stood to the side when I
was doing the Rubik’s cube, so that is pretty important. So it’s not working so well. So that’s not a surprise. I don’t expect it
to work that well. This is why I want to
make another video that covers how to take this
very simplistic approach and improve upon it
by adding something called a convolutional layer. So what is a convolution? What are the elements of
a convolutional layer? How do I add one
with the ML5 library? That’s what I’m going to start
looking at in the next section of videos. But before I go, I
can’t resist just doing one more thing
because I really want to look at and demonstrate
to you what happens if you change from using pixel input
to perform a classification to a regression. So I took code from my previous
examples that just demonstrated how ML5 in regression
works, and I changed the task to regression. I had to lower
the learning rate. Thank you to the live chat
who helped me figure this out after like over
an hour of debugging. I had to lower the learning
rate to get this to work. I trained the model
with me standing in different
positions associated with a different frequency
that P5 sound library played. And you can see some examples
of me training it over here. And now, I am going to run
it and see if it works, and that’ll be the
end of this video. So I had saved the data. And now it’s training the model. And as soon as it
finishes training, you’ll be able to hear. All right, so I will leave
that to you as an exercise. I’ll obviously include
the link to the code for this in the
video’s description on the web page on
the with this particular video. I can come back
and implement it. You can go find the
link to a Livestream where I spend over an
hour implementing it. But I’ll leave that
to you as an exercise. See if you followed this video
and have image classification working, can you change
it to a regression and have it control something
with continuous output? OK, if you made it this far,
[KISSING NOISE] thank you. And I will be back
and start to talk about convolutional
neural networks, what they mean in the next video. [MUSIC PLAYING]


  1. Long time no visit to this channel. Actually I stop coding😭 found out I can't code for life. One field less to chase for my passion/profession. But I still love your videos so much.

  2. Any chance to have the full completed code available? For some reason my Visor is not showing any graph and I can’t understand why. Thanks!

  3. What about a neural network that have 2 input and an image as it's output.
    What about the inputs are the percentages of being a dog or a cat.
    It will then normally create the exacly percentaged image of what it would mean for us?

  4. using a similar method I created a neural network that can guess a digit within a 28×28 image, I did the whole thing with Unity

  5. 3blue1brown made a good video about turning a 2d image into a 1d "array" that people use to "listen" to images.

  6. A really good intro before CNN. Kudos, you did it again…much better than other channels.Waiting for CNN video…please upload it as soon as possible.

  7. You are explaining many complex topics in an incredibly simple way, even if you often say that you are not expert in the subject. you are a great teacher and friend. Thanks to you, I both learned and applied the things I feared to learn. I had the opportunity to understand some other things better intuitively. Thank you very much for your efforts. It is very important to me.
    Both the website and the channel look a bit complicated. I often adhere to a playlist, but then I see that the videos I watch are often the product of another process/timeline. I do not know in which order I should watch the videos to make sure I do not miss anything.
    So Do you have any suggestions to help me with this?

  8. Interesting question. What would have a better outcome, a 10x10x3 RGB image which contains the color information as further attributes useful for classification or a 30x30x1 grayscale image which would require the same size of network, but has – logically – a higher resolution?

  9. I have been following you for years now, and recently got super attracted to the teachable machine models because of you. I just wanted to let you know that I won a hackathon price because of you in my senior year of college. Thank you for spreading knowledge, you inspire and motivate people everyday : )

  10. Code a bottle flip challenge plz plz please do it . And I am 10 years old , I like to code so much . I watch your every Video please do it plz plz 😀😀😍😍😍😀😃😘🥰

  11. the code for this video just results in the page not responding… and then i have to force quit my browser!
    It has to do with the classifcation loop but i dont know how to fix it

  12. Unfortunately, this encourages copy-paste reuse of code (the worst kind).

    It doesn't require that much more to use functional decomposition. And the code also becomes more expressive and easier to reason about and understand.

Leave a Reply

Your email address will not be published. Required fields are marked *