k-NN

The k-NN algorithm is among the simplest of all machine learning algorithms. In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression.

  • In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
  • In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.
[https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]

Classification

K-nearest-neighbor (kNN) classification is one of the most fundamental and simple classification methods and should be one of the first choices for a classification study when there is little or no prior knowledge about the distribution of the data. K-nearest-neighbor classification was developed from the need to perform discriminant analysis when reliable parametric estimates of probability densities are unknown or difficult to determine. To demonstrate a k-nearest neighbor analysis, let’s consider the task of classifying a new object (query point) among a number of known examples. This is shown in the figure below, which depicts the examples (instances) with the plus and minus signs and the query point with a red circle. Our task is to estimate (classify) the outcome of the query point based on a selected number of its nearest neighbors. In other words, we want to know whether the query point can be classified as a plus or a minus sign.

To proceed, let’s consider the outcome of KNN based on 1-nearest neighbor. It is clear that in this case KNN will predict the outcome of the query point with a plus (since the closest point carries a plus sign). Now let’s increase the number of nearest neighbors to 2, i.e., 2-nearest neighbors. This time KNN will not be able to classify the outcome of the query point since the second closest point is a minus, and so both the plus and the minus signs achieve the same score (i.e., win the same number of votes). For the next step, let’s increase the number of nearest neighbors to 5 (5-nearest neighbors). This will define a nearest neighbor region, which is indicated by the circle shown in the figure above. Since there are 2 and 3 plus and minus signs, respectively, in this circle KNN will assign a minus sign to the outcome of the query point.

[http://www.statsoft.com/textbook/k-nearest-neighbors]

Between-sample geometric distance

The k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training samples. Let xi be an input sample with p features (xi1,xi2,…,xip) , n be the total number of input samples (i=1,2,…,n) and p the total number of features (j=1,2,…,p) . The Euclidean distance between sample xi and xl (l=1,2,…,n) is defined as

d(xi,xl)=(xi1−xl1)2+(xi2−xl2)2+⋯+(xipxlp)2

The tessellation shows 19 samples marked with a “+”, and the Voronoi cell, R , surrounding each sample. A Voronoi cell encapsulates all neighboring points that are nearest to each sample and is defined as

Ri={x∈Rp:d(x,xi)≤d(x,xm),∀im} ,

where Ri is the Voronoi cell for sample xi , and x represents all possible points within Voronoi cell Ri . Voronoi tessellations primarily reflect two characteristics of a coordinate system: i) all possible points within a sample’s Voronoi cell are the nearest neighboring points for that sample, and ii) for any sample, the nearest sample is determined by the closest Voronoi cell edge. Using the latter characteristic, the k-nearest-neighbor classification rule is to assign to a test sample the majority category label of its k nearest training samples. In practice, k is usually chosen to be odd, so as to avoid ties. The k = 1 rule is generally called the nearest-neighbor classification rule.

It warrants noting that kNN is a “supervised” classification method in that it uses the class labels of the training data. Unsupervised classification methods, or “clustering” methods, on the other hand, do not employ the class labels of the training data.


Example

Consider a machine learning study to classify 19 samples with 2 features, X and Y . Table 1 lists the pairwise Euclidean distance between the 19 samples. Assume that sample x11 is being used as a test sample while all remaining samples are used for training. For k=4, the four samples closest to sample x11 are sample x10 (blue class label), samplex12 (red class label), x13 (red class label), and x14(red class label).

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18
x2 1.5
x3 1.4 1.6
x4 1.6 1.4 1.3
x5 1.7 1.4 1.5 1.5
x6 1.3 1.4 1.4 1.5 1.4
x7 1.6 1.3 1.4 1.4 1.5 1.8
x8 1.5 1.4 1.6 1.3 1.7 1.6 1.4
x9 1.4 1.3 1.4 1.5 1.2 1.4 1.3 1.5
x10 2.3 2.4 2.5 2.3 2.6 2.7 2.8 2.7 3.1
x11 2.9 2.8 2.9 3.0 2.9 3.1 2.9 3.1 3.0 1.5
x12 3.2 3.3 3.2 3.1 3.3 3.4 3.3 3.4 3.5 3.3 1.6
x13 3.3 3.4 3.2 3.2 3.3 3.4 3.2 3.3 3.5 3.6 1.4 1.7
x14 3.4 3.2 3.5 3.4 3.7 3.5 3.6 3.3 3.5 3.6 1.5 1.8 0.5
x15 4.2 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 1.7 1.6 0.3 0.5
x16 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 1.6 1.5 0.4 0.5 0.4
x17 5.9 6.2 6.2 5.8 6.1 6.0 6.1 5.9 5.8 6.0 2.3 2.3 2.5 2.3 2.4 2.5
x18 6.1 6.3 6.2 5.8 6.1 6.0 6.1 5.9 5.8 6.0 3.1 2.7 2.6 2.3 2.5 2.6 3.0
x19 6.0 6.1 6.2 5.8 6.1 6.0 6.1 5.9 5.8 6.0 3.0 2.9 2.7 2.4 2.5 2.8 3.1 0.4
Table Euclidean distance matrix D listing all possible pairwise Euclidean distances between 19 samples.

[http://scholarpedia.org/article/K-nearest_neighbor]


Problems and Solution

Problem 1. Given a point P and other N points in two dimensional space, find K points out of the N points which are nearest to P.

Solution 1 make heap of size K and collect points by minimal distance O(NLogK) complexity.

Solution 2: Take and array of size N and Sort by distance. Should be used QuickSort (Hoare modification). As answer take first K points. This is too NlogN complexity but it is possible optimize to approximate O(N). If skip sorting of unnecessary sub arrays. When you split array by 2 sub arrays you should take only array where Kth index located. complexity will be : N +N/2 +N/4 + … = O(N).

Solution 3: search Kth element in result array and takes all point lesser then founded. Exists O(N)alghoritm, similar to search of median.

Notes: better use sqr of distance to avoid of sqrt operations, it will be greater faster if point has integer coordinates.

As interview answer Solution 2 or 3 is more appropriate.

[http://stackoverflow.com/questions/20398799/find-k-nearest-points-to-point-p-in-2-dimensional-plane]

                                                   ————————– X ————————-

Problem 2. Given m station and n house, output nearest k stations for each house

Solution 1. There are m stations and n houses, (x,y) coordinates of each station and house are given, output nearest station for each house.

Later, the question was generalised to finding k nearest stations from each house.

For every house, build a heap of distances(bottom up) to stations and then pop k. Do the same for all houses. O(n*(m+klogm));

Alternatively,for every house, build an order statistic tree to stations and then look for node wih rank and traverse the entire tree below that node. Do the same for all houses. O(n*(mlogm+logm+k)

Solution2 : This sounds like an excellent spot to use a k-d tree, quadtree, or other space partitioning tree. The problem of “find the k objects nearest some test point” is called the k-nearest-neighbors problem and these two data structures solve it remarkably efficiently. They’re also reasonably simple to implement.

Specifically: build a k-d tree or quadtree out of the stations. Then, for each house, do a k-nearest-neighbors query on that house in the data structure to find the nearest stations.

[http://stackoverflow.com/questions/23190421/interview-q-given-m-station-and-n-house-output-nearest-k-stations-for-each-hou]

                                                 —————————— X ——————————–

Problem 3: Print all nodes at distance K from a given node

Solution1: Given a binary tree, a target node in the binary tree, and an integer value k, print all the nodes that are at distance k from the given target node. No parent pointers are available.

Consider the tree shown in diagram Input: target = pointer to node with data 8.        root = pointer to node with data 20.       k = 2.Output : 10 14 22 If target is 14 and k is 3, then output should be “4 20”

There are two types of nodes to be considered. 1) Nodes in the subtree rooted with target node. For example if the target node is 8 and k is 2, then such nodes are 10 and 14. 2) Other nodes, may be an ancestor of target, or a node in some other subtree. For target node 8 and k is 2, the node 22 comes in this category.

Finding the first type of nodes is easy to implement. Just traverse subtrees rooted with the target node and decrement k in recursive call. When the k becomes 0, print the node currently being traversed. Here we call the function as printkdistanceNodeDown().

How to find nodes of second type? For the output nodes not lying in the subtree with the target node as the root, we must go through all ancestors. For every ancestor, we find its distance from target node, let the distance be d, now we go to other subtree (if target was found in left subtree, then we go to right subtree and vice versa) of the ancestor and find all nodes at k-d distance from the ancestor.

[http://www.geeksforgeeks.org/print-nodes-distance-k-given-node-binary-tree/]

698 total views, 2 views today