Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding

Fionn Murtagh, Geoff Downs, Pedro Contreras

Research output: Contribution to journalArticle

23 Citations (Scopus)

Abstract

Coding of data, usually upstream of data analysis, has crucial implications for the data analysis results. By modifying the data coding-through use of less than full precision in data values-we can aid appreciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierarchically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second application, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.

LanguageEnglish
Pages707-730
Number of pages24
JournalSIAM Journal on Scientific Computing
Volume30
Issue number2
DOIs
Publication statusPublished - 14 Feb 2008
Externally publishedYes

Fingerprint

Hierarchical Clustering
High-dimensional Data
Data analysis
Coding
Ultrametric Space
Baire Space
Minimum Variance
Hashing
K-means
Prefix
Observation
Relationships

Cite this

@article{fbf4697656654bf7b92bb02ebd3f0823,
title = "Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding",
abstract = "Coding of data, usually upstream of data analysis, has crucial implications for the data analysis results. By modifying the data coding-through use of less than full precision in data values-we can aid appreciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierarchically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second application, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.",
keywords = "Hashing, Hierarchical clustering, Partitioning, Tree distance, Ultrametric",
author = "Fionn Murtagh and Geoff Downs and Pedro Contreras",
year = "2008",
month = "2",
day = "14",
doi = "10.1137/060676532",
language = "English",
volume = "30",
pages = "707--730",
journal = "SIAM Journal of Scientific Computing",
issn = "1064-8275",
publisher = "Society for Industrial and Applied Mathematics Publications",
number = "2",

}

Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding. / Murtagh, Fionn; Downs, Geoff; Contreras, Pedro.

In: SIAM Journal on Scientific Computing, Vol. 30, No. 2, 14.02.2008, p. 707-730.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding

AU - Murtagh, Fionn

AU - Downs, Geoff

AU - Contreras, Pedro

PY - 2008/2/14

Y1 - 2008/2/14

N2 - Coding of data, usually upstream of data analysis, has crucial implications for the data analysis results. By modifying the data coding-through use of less than full precision in data values-we can aid appreciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierarchically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second application, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.

AB - Coding of data, usually upstream of data analysis, has crucial implications for the data analysis results. By modifying the data coding-through use of less than full precision in data values-we can aid appreciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierarchically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second application, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.

KW - Hashing

KW - Hierarchical clustering

KW - Partitioning

KW - Tree distance

KW - Ultrametric

UR - http://www.scopus.com/inward/record.url?scp=55549116015&partnerID=8YFLogxK

U2 - 10.1137/060676532

DO - 10.1137/060676532

M3 - Article

VL - 30

SP - 707

EP - 730

JO - SIAM Journal of Scientific Computing

T2 - SIAM Journal of Scientific Computing

JF - SIAM Journal of Scientific Computing

SN - 1064-8275

IS - 2

ER -