Data Analysis 7: Clustering – Computerphile

Today we’re going to talk about clustering Do you ever find when you’re on YouTube you’ll watch a video on something and then suddenly you’re being recommended a load of other videos That you hadn’t even heard of that are actually kind of similar. This happens to me I watched some video on some new type of saw trying to learn it because you know don’t know what I’m doing and suddenly I’m Being recommended videos on turning verses on wooden lathes and all kinds of weird stuff And what’s happening is I’m being clustered into groups of people Who are liking those and watching those kind of videos or these kind of videos are all being clustered together as similar, right? So clustering is it’s one of the sort of core technologies at the heart of this kind of stuff in fairness I did end up watching a bunch of those woodturning videos We’ve talked about the different kinds of datasets you might have right and until up to now we’ve been talking about things like cleaning data transforming data and reducing data Now what we want to do is start trying to derive some knowledge now sort of a typical way to do This would be something like a classification algorithm or maybe a regression ad with them But today we’re going to talk about how do you separate out data and group data together? When we don’t have any labels for that data, so this is an example of unsupervised learning Different types of machine learning that we can use one is unsupervised. This is where we don’t have any labels Then we have supervised learning where we have labels so for example a supervised learning task might be one where you have some labels for you for your products or your videos and you’re Trying to classify them like that. So maybe you’re trying to classify videos into a genre right or Unsupervised learning we don’t have any labels Maybe we’ve just got a load of products and we’re trying to group them into into similar categories So these are all the tools and these will be electronic products And these are all the toys right and maybe we want to do that without having to have a person go through and click on them all so why wouldn’t we just Label everything and then use that to perform nice powerful machine learning algorithms like classification Well, sometimes it’s just too expensive to produce labels if you’re running a music recommendation system I like the wonder power Spotify maybe going through and defining what genre everything is by hand is going to take Absolutely ages and one waste of time right way if we can do this automatically so sometimes you aren’t gonna have labels for data And it’s too expensive to obtain. It’s too time-consuming. It’s too difficult Maybe you don’t people disagree over what John or two pieces of music are so in that case you are going to have labels and so Clustering is a good option right Let’s group things together with things that are similar in terms of their attributes without actually knowing what they are What we’re going to try and do then is take all the attributes and all the instances objects and use them to group them into similar objects, but the question is what is similar like well Let’s think back to the way that we structure our data we’re going to have rows as our instances and we’re going to have columns As our attributes and another way We remember to think about that is that these actually are data points in some space Where the position of each point or the position of each instance depends on the attributes? So, for example, we could have a three attribute data set So maybe we have Row 1 2 3 4 and we have attribute a B and C, right? So, I don’t know maybe one has a value for a in a value for B and a value for C So this means this is a sort of three dimensional data set with our axes a B and C So we’re going to have something like this and then see you guys office So this is a this is B, and this is C So maybe in this data set, you know instance 1 appears here and 2 over here and 3 over here and 4 over here Right you’ve just I’m imagine These are in some sort of 3d space, you know, perhaps intuitively we can say well ok One is maybe closer to 4 than 2 is because this is shorter distance But of course, this is a three dimensional space is hard to visualize but doesn’t matter how many attributes we have We can still say well ok, in this space. These instances are closer to these instances and then we can start grouping things together So maybe I mean actually 2 is very far away from free So maybe we sort of group these two up and group these two up or something like this So typically we’re going to use you’re clearly in distance for this Right, which is going to be this best distance here between points 1 & 2 in this 3 dimensional space There’s obviously going to be questions about how many groups are we grouping them into? It’s 3 too far away to be in any of these groups These are things we have to think about but this applies to any number of dimensions as just simply the only thing holding you back is just how fast your computer is and how fast Go food is we’re gonna look at two different Clustering algorithms, right? The first is going to be k-means and then we’re going to look at pam Alright, which is slightly different now We talked about k-means in computer file before but we’re going to talk about it here for this course So just you know to keep things simple for my arm and drawing i’m gonna think the two dimensions here and we’ve got some sort Of data, right which is sort of looks like this and there may be some more data over here And you know to us we can sort of say well maybe there’s two groups But we want to sort of formalize this process and you’ve got to consider that in two dimensions Maybe it’s quite clear that there’s two groups if you’ve got a sort of n-dimensional space maybe a thousand dimensions or ten thousand dimensions Picking out where the groups are is not something you want to be doing by hand So what k-means does is it splits some data into K groups, right? So I’m going to specify them Oh strike straight away but K is 2 in this case because I think there are two classes here Now if I get that wrong, obviously, that’s a problem. We’ll talk about that. Later But what we’re gonna do is we’re gonna pick two random points in this space. So let’s say this one here and This one here So we’ve got two classes and we’re going to start to assign each of these points based on whichever of these means is closer So these are the center points for our new groups generally speaking. Obviously, this is going to be clearly in distance So essentially a circle in this case So we’re going to sort of look sort of like this and the blue one’s going to come around So kind of like this kind of like this and these will probably be red because they’re slightly closer So now but all these are red. What we’re going to do is we’re going to label these all red I can only do one iteration of this because now painted all over my picture we start by assigning all of them now We might be finished. But let’s imagine. We’re not what we want to try and do is Reevaluate where the positions of our clusters are based on this information So we take the mean or the average position of this group here the red group and we can say well, okay It’s sort of bang in the middle here. So we get rid of this one. I’m gonna this above our pen. Oh it worked Here’s my new center position here Right, the blue one, which I’m going to have to scribble out is going to move to our about there something like this So that’s iteration one right now. We’ve we calculated these center points so this blue region of what’s going to be classified as blue and what’s going to be classified as red it’s kind of going to Move this way a little bit. So I guess we’re going to maybe reevaluate and this is going to become blue Ooh, this is going to be an iterative process we’re going to keep recalculating these means based on the points that have moved back and forth between these two groups and Eventually, these means should begin to converge and stop moving around as things settle down And usually this actually happens pretty quickly. I even in a large dimensional space k-means is a very popular algorithm. It’s got a few drawbacks One is that let’s imagine. We had a single point way over here an outlier right now Hopefully you’ve got rid of most of our lives from the previous video But if you haven’t and you’ve got an outlier here that you weren’t expecting Then what’s going to happen is this is going to be assigned it in the first iteration to be blue It’s going to pull the mean of this group this way which means that more of them are going to be assigned red and Red is going to go this way as well and it’s just going to move the means around and cause a bit of a problem We might get away of it in this case But you can imagine if you’ve got a large high dimensional space and you’re trying to cluster lots and lots of clusters Getting the means in the wrong position could cause a bit of instability cause the wrong plate things to be classified and clustered together There’s a couple more issues one is that you know Where you start your means on the first iteration is obviously quite important if you place it at random There’s a charge you’re going to put it right up here and things could take a lot longer to converge or could settle on some Clustering that you’re not happy with so this outlaw is going to be a problem, right? It’s going to make K means struggle slightly So as an alternative we can use which is called Pam or partitioning around meds by or Kay meds Whatever you want to call it instead of calculating a mean for our cluster and moving those means around what we’re going to do is Use actual points from our cluster So what we do is we start off exactly the same as k-means but instead of picking two random positions we pick two random points So for example, what we’ll do is we’ll pick this red one here and we’ll pick this blue one here now These are treated exactly like the means in kami So we’ve in cluster our data around these two points and then we calculate an error for each cluster That is the distance from all the other points. We assign to it into that cluster so you can imagine hopefully if this point has been chosen in the middle of a cluster then the distance will be Quite small because everything will be tightly bound together if it’s we’re over here as an outlier It’s going to be a huge error because the distance to all of these points is massive So then what we do is we pick a group at random and we move the center to another point So we okay we were here let’s move to here and we Repartition our data and we calculate a new error per distance to all our new clusters based on this new position that we just picked And if it’s better, we permanently move our center point there if it’s not we go back to where we were before we pick a new cluster at random and a new point at random and We repeat this process. So in k-means you move both means in fact, however, many Group clusters, you’ve got you’re going to move all the means at the same time, right? Because you repartition the data all the means are going to move around and then you reposition the data and you repeat like this in Pam you just move one mean or one Exemplar or meadow at a time? So let’s say you pick the red one first You move that and maybe pick the red one again and you move that and then it’s blues turn you move that And obviously this is gonna take a little while to do over time Hopefully what will happen is you find that? More and more of a time you try and move and it doesn’t work because you just increase the error because you settled on something really helpful and Also eventually if you take long enough doing this you’re gonna visit it all your points and then you might as well stop as well so typically What you would do is stop after you Fail a number of times to move somewhere better Because really you actually found somewhere pretty good this neatly avoids our problem of outliers because this one here won’t affect the position of this cluster because if we ever chose it to Be a center it will be immediately discarded because the error is so large As opposed to it actually affecting the mean and pulling this cluster this direction So there’s one last problem and that is the problem of how did we get? This – I Said that I thought there were two clusters in this data and happily there were and that worked out really nicely But if you’ve got, you know a huge data set There’s no way to guess how many clusters this is going to be And or if you do maybe that’s not the optimal number of clusters So for example, if you’re trying to cluster up songs and Spotify, I mean how many clusters is that? I have no idea like lots so you put 80 in and it’s okay But is that should you go up should you do 100 or should you do 60? I don’t know so there are approaches like DB scan which will try and bring in the concept of a neighborhood and have the ability to increase or Decrease the number of clusters as appropriate for your data. All right So what’s going to happen is they’ll say this looks good But if we split this in two and had two clusters here instead That will be a better fit right so these are very useful Technique so you can use if you want something a little bit more powerful Now it wouldn’t be a date of an artist course if we didn’t look at the iris dataset at least once this is a classic Data set everyone uses and it’s good for clustering nice and small and we can have a look and this data set We’ve got three different species of flower. We’ve got so Tosa versicolor and Virginica, we’ve got four attributes. We’ve got several length sepal width petal length petal width just for this occasion I looked up what a sepal is and it’s the green bit that covers the flower when it’s folded up right now I don’t know much about these flowers, but they are subtly different. One of them is a little bit more different than the others So it makes for a good clustering problem because we’re hoping for three distinct clusters the iris dataset is one of the ones that’s built into our you can literally call data iris and It’ll load it up for you now Let’s have a quick look at what we’ve got because they’re lovely function in are called pairs Which just shows us a load of scatter plots of different attributes so if I run this This is only going to work for a few attributes before the whole thing becomes very difficult to look at So we’ve got things like sepal length sepal width and the correlations of these and these are colored by the different class of flower so you can see if the three class is one of them is actually quite different a lot of the time and then some of Them bees this red and green class. They’ve got quite a lot of overlap So clustering nose is going to be a little bit more difficult bearing in mind. We’re using four dimensions to do it Not the two you’re seeing in any individual scatter plot. Okay. So let’s just start off with standard k-means so we’re going to call km3 k-means with three clusters is K-means, there’s a function for this in R on the iris data set all of the rows 1 to 4 So not the species of plant We’re not going to custom on that three clusters and we’re going to allow it to go 400 iterations K-means will stop early if it doesn’t improve itself, but if it keeps going maybe it’s just going back and forth a little bit It’s time to stop that did not take very long This object returned by the k-means function is going to have an integer determining which of our instances have been assigned to which cluster so all of these first ones have been assigned to cluster two and The Centers for all of our clusters as well So remember that in our we only have a data frame like this iris we can add other columns to it So we’re going to just add our k-means result back into our it’s data frame so we can keep track of it So we’re going to say iris km3 is equal to km3 dollar Cluster that’s gonna be in there. Okay, so let’s put it in a table We’ll have a look at how our clusters match up to our actual number of flowers We’ve got so it’s going to be a table of the irf species versus the iris clusters from k-means Alright, so if we have a look at that the first thing we’ll see is that it doesn’t make absolutely much sense because for example say Tozer which is our class 1 in some Sense has been assigned to cluster 3. So what we’re going to do is we’re going to reorder these columns so that the Correct. Classifications are down the diagonal much like a confusion matrix. So we have a function to do that that we’re going to call and If we look at this result, we can see that the results are an 89% Classification accuracy, there were 50 of each plant in this dataset 48 of these plants have been correctly assigned to cluster two together But two of them were in cluster 1 along with the other virginities and finally the virginica has been 36 of 50 Correctly assigned to cluster 1 and 14 have been incorrectly clustered into cluster 2, right so it worked pretty well. It’s not perfect Bearing in mind if you really want to separate out plants. Maybe you need more than 4 dimensions Maybe you can’t absolutely tell what a plant is just based on 4 dimensions All right, some of these plants are similar enough, but the clustering isn’t very well defined So perhaps we can make our life a little bit easier by using principal component analysis to do dimensionality reduction Or just to reframe our data onto some different axes to get better clustering result. So we’re going to do a very similar thing We’re going to run PCA on The iris dataset and we’re going to project our points into that new Principal component space and then we’re going to take only the first two dimensions So this is principal component 1 and principal component 2 as we covered in the principal component video Then what we’re going to do is we’re going to apply caming stavos results rather than the original data So what we’ve done is we’ve transformed our 4 dimensions of sepal width sepal length Petal length and petal width onto our principal component axes and then we’ve discarded the last two and kept just two axes So I’m going to run that that didn’t take very long Ok We’re going to sign that back to our iris data set just like we did with the results of k-means and then we can bring Up another table and see how the results compare table to and then we’ll order that again by the diagonal Results were almost exactly the same. I think it was 88% 89% something like this You can see that one extra Versicolor was put into cluster 2 when it shouldn’t have been I but this is with only 2 dimensions instead of 4 dimensions So we’ve Harbor number of dimensions but by using PCA we’ve got almost the exact same result for datasets that You don’t have labels for maybe the labels are too hard to get or you don’t know what they would be I Think clustering is a good way to group up data and start to derive some knowledge the knowledge we can derive or what what items are similar to each other by which products in our database are similar to each other so that we can start using them for a Recommender system, you know, what movies are like each other what songs are like each other like what flowers are like each other? So the ideas that were clustering data up and by doing that we can look at these clusters and start to gain some knowledge Don’t forget also that each of these is going to have a prediction as well so this one here attribute one is going to have let’s say like a label if we did play tennis or this person is Healthy or this person has this disease. It depends on you


  1. Just submitted my last paper for my masters degree that HEAVILY relies on clustering algorithms. I really wish this video was released 2 years ago.

  2. where can i get this browserizor software?
    btw, centeric clustering fails in clustering two long objects that are near. or when one object is surrounded by another.

  3. Is there a degree to which the dimensions are weighted when you cluster? Or would you apply the weighting to your data before clustering them?

  4. Only at 11 min in, but erm confused. You're clustering, but you don't have any idea of what you're clustering on? You aren't clustering on a dimension? I.e., music genre
    What gives you a red/blue division in the first place – "it just looks like here's a cluster… And that, there, is a cluster. Now, let's make 'em fit…"? I'm not understanding.

  5. To hopefully make it clearer to everyone, the iris labels were hidden from the clustering algorithms and then only used after the fact to see how well the clusters recovered the true labels. So the task was still unsupervised because the supervision was hidden. This is a standard technique for evaluating unsupervised machine learning algorithms.

  6. 13:58 "It's worked pretty well; it's not perfect."

    I feel like that should be the slogan for this course and for data science in general.

Leave a Reply

Your email address will not be published. Required fields are marked *