knnsearch

Find k-nearest neighbors using data

Syntax

IDX = knnsearch(X,Y)
[IDX,D] = knnsearch(X,Y)
[IDX,D] = knnsearch(X,Y,'Name',Value)

Description

IDX = knnsearch(X,Y) finds the nearest neighbor in X for each point in Y. IDX is a column vector with my rows. Each row in IDX contains the index of nearest neighbor in X for the corresponding row in Y.

[IDX,D] = knnsearch(X,Y) returns an my-by-1 vector D containing the distances between each observation in Y and the corresponding closest observation in X. That is, D(i) is the distance between X(IDX(i),:) and Y(i,:).

[IDX,D] = knnsearch(X,Y,'Name',Value) accepts one or more optional comma-separated name-value pair arguments. Specify Name inside single quotes.

knnsearch does not save a search object. To create a search object, use createns.

Input Arguments

X

An mx-by-n numeric matrix. Rows of X correspond to observations and columns correspond to variables.

Y

An my-by-n numeric matrix of query points. Rows of Y correspond to observations and columns correspond to variables.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

'K'

Positive integer specifying the number of nearest neighbors in X for each point in Y. Default is 1. IDX and D are my-by-K matrices. D sorts the distances in each row in ascending order. Each row in IDX contains the indices of the K closest neighbors in X corresponding to the K smallest distances in D.

'IncludeTies'

A logical value indicating whether knnsearch includes all the neighbors whose distance values are equal to the Kth smallest distance. If IncludeTies is true, knnsearch includes all these neighbors. In this case, IDX and D are my-by-1 cell arrays. Each row in IDX and D contains a vector with at least K numeric numbers. D sorts the distances in each vector in ascending order. Each row in IDX contains the indices of the closest neighbors corresponding to these smallest distances in D.

Default: false

'NSMethod'

Nearest neighbors search method. Value is either:

  • 'kdtree' — Creates and uses a Kd-tree to find nearest neighbors. This is the default value when the number of columns of X is less than 10, X is not sparse, and the distance measure is one of the following measures. 'kdtree' is only valid when the distance measure is one of the following:

    • 'euclidean'

    • 'cityblock'

    • 'minkowski'

    • 'chebychev'

  • 'exhaustive' — Uses the exhaustive search algorithm by computing the distance values from all the points in X to each point in Y to find nearest neighbors.

'Distance'

A string or a function handle specifying the distance metric. The value can be one of the following:

  • 'euclidean' — Euclidean distance (default).

  • 'seuclidean' — Standardized Euclidean distance. Each coordinate difference between rows in X and the query matrix is scaled by dividing by the corresponding element of the standard deviation computed from X, S=nanstd(X). To specify another value for S, use the Scale argument.

  • 'cityblock' — City block distance.

  • 'chebychev' — Chebychev distance (maximum coordinate difference).

  • 'minkowski' — Minkowski distance. The default exponent is 2. To specify a different exponent, use the 'P' argument.

  • 'mahalanobis' — Mahalanobis distance, computed using a positive definite covariance matrix C. The default value of C is nancov(X). To change the value of C, use the Cov parameter.

  • 'cosine' — 1 minus the cosine of the included angle between observations (treated as vectors).

  • 'correlation' — One minus the sample linear correlation between observations (treated as sequences of values).

  • 'spearman' — One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

  • 'hamming' — Hamming distance, which is the percentage of coordinates that differ.

  • 'jaccard' — One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

  • custom distance function — A distance function specified using @ (for example, @distfun). A custom distance function must

    • Have the form function D2 = distfun(ZI, ZJ)

    • Take as arguments:

      • A 1-by-n vector ZI containing a single row from X or from the query points Y

      • An m2-by-n matrix ZJ containing multiple rows of X or Y

    • Return an m2-by-1 vector of distances D2, whose jth element is the distance between the observations ZI and ZJ(j,:)

For more information on these distance metrics, see Distance Metrics.

'P'

A positive scalar, p, indicating the exponent of the Minkowski distance. This parameter is only valid if the Distance is 'minkowski'. Default is 2.

'Cov'

A positive definite matrix indicating the covariance matrix when computing the Mahalanobis distance. This parameter is only valid when Distance is 'mahalanobis'. Default is nancov(X).

'Scale'

A vector S containing nonnegative values, with length equal to the number of columns in X. Each coordinate of X and each query point is scaled by the corresponding element of S when computing the standardized Euclidean distance. This argument is only valid when Distance is 'seuclidean'. Default is nanstd(X).

'BucketSize'

The maximum number of data points in the leaf node of the kd-tree. This argument is only meaningful when using the kd-tree search method. Default is 50.

Examples

expand all

Classify Using k-Nearest Neighbors

Find the 10 nearest neighbors in x to each point in y using first the 'minkowski' distance metric with a p value of 5, and then using the 'chebychev' distance metric.

Load Fisher's iris data set

load fisheriris
x = meas(:,3:4);
y = [5 1.45;6 2;2.75 .75];

Perform a knnsearch between x and the query points in y, using first Minkowski then Chebychev distance metrics.

[n,d]=knnsearch(x,y,'k',10,'distance','minkowski','p',5);
[ncb,dcb] = knnsearch(x,y,'k',10,...
   'distance','chebychev');

Visualize the results of the two different nearest neighbors searches. Plot the training data. Plot an X for the query points. Use circles to denote the Minkowski nearest neighbors. Use pentagrams to denote the Chebychev nearest neighbors.

gscatter(x(:,1),x(:,2),species)
line(y(:,1),y(:,2),'marker','x','color','k',...
   'markersize',10,'linewidth',2,'linestyle','none')
line(x(n,1),x(n,2),'color',[.5 .5 .5],'marker','o',...
   'linestyle','none','markersize',10)
line(x(ncb,1),x(ncb,2),'color',[.5 .5 .5],'marker','p',...
   'linestyle','none','markersize',10)
legend('setosa','versicolor','virginica','query point',...
'minkowski','chebychev','Location','best')

More About

expand all

Tips

  • For a fixed positive integer K, knnsearch finds the K points in X that are nearest each point in Y. In contrast, for a fixed positive real value r, rangesearch finds all the points in X that are within a distance r of each point in Y.

Algorithms

For information on a specific search algorithm, see Distance Metrics.

References

[1] Friedman, J. H., Bentely, J., and Finkel, R. A. (1977) An Algorithm for Finding Best Matches in Logarithmic Expected Time, ACM Transactions on Mathematical Software 3, 209.

Was this topic helpful?