TY - JOUR

T1 - A study of the neighborhood counting similarity

AU - Wang, Hui

AU - Murtagh, Fionn

PY - 2008/4/1

Y1 - 2008/4/1

N2 - The neighborhood counting measure is the number of all common neighborhoods between a pair of data points. It can be used as a similarity measure for different types of data through the notion of neighborhood: multivariate, sequence, and tree data. It has been shown that this measure is closely related to a secondary probability G, which is defined in terms of a primary probability P of interest to a problem. It has also been shown that the G probability can be estimated by aggregating neighborhood counts. The following questions can be asked: What Is the relationship between this similarity measure and the primary probability P, especially for the task of classification? How does this similarity measure compare with the euclidean distance, since they are directly comparable? How does the G probability estimation compare with the popular kernel density estimation for the task of classification? These questions are answered in this paper, some theoretically and some experimentally. It is shown that G is a linear function of P and, therefore, a G-based Bayes classifier is equivalent to a P-based Bayes classifier. It is also shown that a weighted k-nearest neighbor classifier equipped with the neighborhood counting measure is, in fact, an approximation of the G-based Bayes classifier. It is further shown that the G probability leads to a probability estimator similar in spirit to the kernel density estimator. New experimental results are presented in this paper, which show that this measure compares favorably with the euclidean distance not only on multivariate data but also on time-series data. New experimental results are also presented regarding probability/density estimation. It was found that the G probability estimation can outperform the kernel density estimation in classification tasks.

AB - The neighborhood counting measure is the number of all common neighborhoods between a pair of data points. It can be used as a similarity measure for different types of data through the notion of neighborhood: multivariate, sequence, and tree data. It has been shown that this measure is closely related to a secondary probability G, which is defined in terms of a primary probability P of interest to a problem. It has also been shown that the G probability can be estimated by aggregating neighborhood counts. The following questions can be asked: What Is the relationship between this similarity measure and the primary probability P, especially for the task of classification? How does this similarity measure compare with the euclidean distance, since they are directly comparable? How does the G probability estimation compare with the popular kernel density estimation for the task of classification? These questions are answered in this paper, some theoretically and some experimentally. It is shown that G is a linear function of P and, therefore, a G-based Bayes classifier is equivalent to a P-based Bayes classifier. It is also shown that a weighted k-nearest neighbor classifier equipped with the neighborhood counting measure is, in fact, an approximation of the G-based Bayes classifier. It is further shown that the G probability leads to a probability estimator similar in spirit to the kernel density estimator. New experimental results are presented in this paper, which show that this measure compares favorably with the euclidean distance not only on multivariate data but also on time-series data. New experimental results are also presented regarding probability/density estimation. It was found that the G probability estimation can outperform the kernel density estimation in classification tasks.

KW - Artificial intelligence

KW - Classification

KW - Density estimation

KW - Neighborhood counting similarity

KW - Probability

UR - http://www.scopus.com/inward/record.url?scp=40349093217&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2007.190721

DO - 10.1109/TKDE.2007.190721

M3 - Article

AN - SCOPUS:40349093217

VL - 20

SP - 449

EP - 461

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 4

M1 - 4384486

ER -