visit
Training Set vs Test Set
Machine Learning algorithms adapt the model based on a set of training data. Training data is a data set that contains all of the variables we have available as well as the correct classification. Training sets can be developed in a variety of ways but in this tutorial, we’ll be using a training set that was classified by a human expert. It’s important to remember that machine learning models are only as good as the training data. The more accurate your training data and the more of it you have the better. In other words — garbage in, garbage out.A test set is typically a subset of the training data in that it also contains all variables and the correct classifications. The difference is in how we use it. While the training set helps to develop the model, the test set tries it out in a real world scenario and sees how well it fares. There are lots of complicated ways to measure error and test models but as long as you get the basic idea we can keep going.Classification Models
A Classification Model is simply a mathematical tool to determine what category or class of something you’re dealing with based on a set of variables or inputs. For example, if I wanted to classify whether an animal was a cat or a fish, I might use variables such as whether or not the animal swims, whether or not it has fur, and whether or not it eats to determine which class it falls under. You’ll notice two things. Firstly, the more variables you have the better. With more information, you can be more confident that your classification is correct. Secondly, some variables are more useful or predictive than others. Take the last example, whether or not the animal eats. The casual observer knows that both fish and cats eat, so having this piece of data isn’t useful in determining the class of the animal. The goal of machine learning in this context is to create the most useful classification model given the available data and to weed out the inputs that don’t improve the effectiveness of the model.K Nearest Neighbors
K-Nearest Neighbors (KNN) is a specific type of Classification Model. The intuition is simple to understand. The model takes all of the data available about an unknown data point and compares it to a training set of data to determine which points in that training set the unknown point is most similar, or closest, to. The idea is that the unknown data point will most likely fall under the same class as the known data points it is most similar to. KNN is simply a mathematical way to determine the similarity between two data points.The Iris Data Set
For this tutorial, we’ll be using a classic data set used to teach machine learning called . This is a collection of data about three species of the Iris flower and four pieces of data about them: sepal length, sepal width, petal length, and petal width. The data set has already been prepared to make it easy for beginners to jump right in. You can download the data in a compatible excel format by clicking “download zip” in the top right and opening the contents in Excel.=RAND()
=RANK(F2, $F$2:$F$15)
=IF(G2<=105,”Training”, “Test”)
The Concept of Distance
Distance is the way mathematicians determine which points are most similar in an n-dimensional space. The intuition is that the smaller the distance between the points the more similar they are. Most of us are used to calculating distance in a 2-dimensional space, such as an x,y coordinate system or using longitude and latitude. There are several ways to calculate distance but to keep it simple we’re going to use the Euclidean distance. Below is a visualization of the Euclidean distance formula in a 2-dimensional space. As you can see, the formula works by creating a right triangle between two points and determining the length of the hypotenuse, the longest side of the triangle, as identified by the arrow.Calculating The Distance
In our workbook, create a new worksheet called “Distance.” Our goal for this sheet is to create a 45X105 matrix of the distances between each data point in the test set and the training set. In our case, each row will correspond to one data point in the test set and each column will correspond to one data point in the training set. Starting in A2 and working down line by line until you hit A46, fill each cell with the numbers 1–45. Again, the fill handle is useful here so you don’t have to type the numbers one by one. Now, working from B1 and then column by column horizontally across until you hit DB1, fill each column with the numbers 1–105. Your matrix should look something like the screenshot below which shows a small portion of it.=SQRT(((VLOOKUP(NUMBERVALUE(Distance_Table[[#Headers],[1]]), ‘Training Set’!$A$1:$F$106, 2, FALSE)-VLOOKUP(Distance_Table[@[Test ID]:[Test ID]], ‘Test Set’!$A$1:$F$46, 2, FALSE)) ^ 2+(VLOOKUP(NUMBERVALUE(Distance_Table[[#Headers],[1]]), ‘Training Set’!$A$1:$F$106, 3, FALSE)-VLOOKUP(Distance_Table[@[Test ID]:[Test ID]], ‘Test Set’!$A$1:$F$46, 3, FALSE)) ^ 2+(VLOOKUP(NUMBERVALUE(Distance_Table[[#Headers],[1]]), ‘Training Set’!$A$1:$F$106, 4, FALSE)-VLOOKUP(Distance_Table[@[Test ID]:[Test ID]], ‘Test Set’!$A$1:$F$46, 4, FALSE)) ^ 2+(VLOOKUP(NUMBERVALUE(Distance_Table[[#Headers],[1]]), ‘Training Set’!$A$1:$F$106, 5, FALSE)-VLOOKUP(Distance_Table[@[Test ID]:[Test ID]], ‘Test Set’!$A$1:$F$46, 5, FALSE)) ^ 2))
Finding Nearest Neighbors
At this stage we have calculated the distance between every point in our test set and every point in our training set. Now we need to identify the closest neighbors to each point in our test set. Create a new worksheet called “Nearest Neighbors” and starting at A2 work down line by line to fill the cells with the numbers 1–45 to correspond with the points in our Test Set. Our columns are not going to represent the Training Set like they have on previous sheets. Instead, these are going to represent the 6 closest neighbors, starting with the 1st closest and then the second closest and so on. The 1st closest neighbor has the smallest distance, the 2nd closest neighbor has the second smallest distance and so on. Your sheet should look like this:=INDEX(Distance_Table[#Headers], MATCH(SMALL(Distance!$B2:$DB2, 1), Distance!2:2, FALSE))
=INDEX(Distance_Table[#Headers], MATCH(SMALL(Distance!$B2:$DB2, 2), Distance!2:2, FALSE))
=VLOOKUP(NUMBERVALUE(INDEX(Distance_Table[#Headers], MATCH(SMALL(Distance!$B2:$DB2, 1), Distance!2:2, FALSE))), ‘Training Set’!$A$1:$F$106, 6, FALSE)
Taking A Step Back
Let’s take a step back and look at what we’ve accomplished. You’ve now identified for each point in your test set the classification for the 6 nearest neighbors. You will likely notice that for all or almost all of your data points the 6 nearest neighbors will all fall into the same classification. This means that our data set his highly clustered. In our case, our data is highly clustered for two reasons. Firstly, as we discussed at the start of the tutorial the data set is designed to be easy to work with. Secondly, this is a low-dimensional data set since we are only working with 4 dimensions. As you deal with real-world data, you will typically find that it is far less clustered especially as the number of dimensions increases. The less clustered your data, the larger the training set will need to be to build a useful model.Too Few or Too Many Neighbors
Intuitively, it’s important to understand why this problem is tricky. It is possible to look at too few neighbors and also too many neighbors. Especially as the number of dimensions increase, it is possible that the nearest neighbor is not always the correct classification. Looking at too few neighbors limits the amount of information your model has available to make its determination. Considering too many neighbors will actually degrade the quality of the information your model uses as an input. This is because as more neighbors are introduced you are also introducing noise to the data. Just think about it — it wouldn’t make sense to consider all 104 neighbors in our example! See a visual representation of this concept below.Using Your Test Set
For this tutorial, we’ll use a very simple process of trial & error to determine the optimal K value. Before we move on, I recommend looking at your Nearest Neighbors worksheet and making a guess as to what the best k value might be, just for fun. We’ll find out soon enough if you’re right!Setting Up The Algorithm
An algorithm is just a set of steps for a computer to repeat over and over again according to a defined set of rules. In this case, we will tell the computer to try different K values, calculate the rate of error for each one using our test set, and then ultimately return the value that produces the lowest error rate. To do this we’ll need to create a new worksheet called “KNN Model.” We’ll set it up as follows, labeling rows A4 through A48 with 1–45 for each of our test data points.=’Nearest Neighbors’!B2
=IFERROR(INDEX(‘Nearest Neighbors’!B2:C2,MODE(MATCH(‘Nearest Neighbors’!B2:C2,’Nearest Neighbors’!B2:C2,0))), ‘Nearest Neighbors’!B2)
=IFS($B$1=1, ‘Nearest Neighbors’!B2, $B$1=2, IFERROR(INDEX(‘Nearest Neighbors’!$B$2:$C$2,MODE(MATCH(‘Nearest Neighbors’!$B$2:$C$2,’Nearest Neighbors’!$B$2:$C$2,0))), ‘Nearest Neighbors’!B2), $B$1=3, IFERROR(INDEX(‘Nearest Neighbors’!$B$2:$D$2,MODE(MATCH(‘Nearest Neighbors’!$B$2:$D$2,’Nearest Neighbors’!$B$2:$D$2,0))), ‘Nearest Neighbors’!B2), $B$1=4, IFERROR(INDEX(‘Nearest Neighbors’!$B$2:$E$2,MODE(MATCH(‘Nearest Neighbors’!$B$2:$E$2,’Nearest Neighbors’!$B$2:$E$2,0))), ‘Nearest Neighbors’!B2), $B$1=5, IFERROR(INDEX(‘Nearest Neighbors’!$B$2:$F$2,MODE(MATCH(‘Nearest Neighbors’!$B$2:$F$2,’Nearest Neighbors’!$B$2:$F$2,0))), ‘Nearest Neighbors’!B2),$B$1=6, IFERROR(INDEX(‘Nearest Neighbors’!$B$2:$G$2,MODE(MATCH(‘Nearest Neighbors’!$B$2:$G$2,’Nearest Neighbors’!$B$2:$G$2,0))), ‘Nearest Neighbors’!B2))
=VLOOKUP(A4, ‘Test Set’!$A$1:$F$46, 6, FALSE)
=IF(B4=C4, 0, 1)
=SUM(D4:D48)/COUNT(D4:D48)
Running the Algorithm
We’re now ready to run our algorithm for different K values. Because we’re only testing 6 values, we could do it by hand. But that would be no fun and more importantly doesn’t scale. You’ll need to enable the Solver Add-In for Excel .Now, navigate to the Data ribbon and click the Solver button. The solver button does the trial and error for us automatically according to our instructions. You’ll have a dialogue box of parameters, or instructions, which you’ll want to set up as shown below. We’re setting it up so that it seeks to minimize the error rate while testing values between 1 and 6, only testing values.Interpreting The Error Rate and Solver Solution
Many optimization algorithms have multiple solutions due to the fact that the data has multiple minima or maxima. This happened in my case. In fact, in my particular case, all integer values 1 through 6 represent minima with an error rate of approximately 2%. So what do we do now?A few things run through my head. First, this test set isn’t very good. The model didn’t gain any optimization benefits from the test set and as such, I would probably re-do the test set and try again to see if I get different results. I’d also consider using more At an error rate this low in my test set, I also start to worry about over-fitting. Over-fitting is a problem that occurs in machine learning when a model is too tailored to the nuances of a particular training or test data set. When a model is over-fit it is not as predictive or effective when encountering new data in the wild. Of course, with an academic data set like this we’d expect our error rate to be fairly low.The next consideration is which value to choose if I have identified several minima. While the test wasn’t effective in this particular example, generally I would pick the lowest number of neighbors that is at a minima to conserve computing resources. My model will run faster if it has to consider fewer neighbors. It won’t make a difference with a small data set but decisions like this conserve substantial resources at scale.Previously published at