The remarkable simplicity of very high dimensional data

Application of model-based clustering

Research output: Contribution to journalArticle

24 Citations (Scopus)

Abstract

An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.

Original languageEnglish
Pages (from-to)249-277
Number of pages29
JournalJournal of Classification
Volume26
Issue number3
DOIs
Publication statusPublished - 1 Dec 2009
Externally publishedYes

Fingerprint

Model-based Clustering
High-dimensional Data
Cluster Analysis
Simplicity
Hierarchical Structure
Sparsity
Dimensionality
Segmentation
Time series
Topology
time series
Modeling
Range of data
Datasets
Hierarchical structure
Clustering

Cite this

@article{7d9366d3b82b4546a3724630f558a545,
title = "The remarkable simplicity of very high dimensional data: Application of model-based clustering",
abstract = "An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.",
keywords = "Cluster analysis, Dimensionality, Hierarchy, Multivariate data analysis, p-Adic, Ultrametric",
author = "Fionn Murtagh",
year = "2009",
month = "12",
day = "1",
doi = "10.1007/s00357-009-9037-9",
language = "English",
volume = "26",
pages = "249--277",
journal = "Journal of Classification",
issn = "0176-4268",
publisher = "Springer New York",
number = "3",

}

The remarkable simplicity of very high dimensional data : Application of model-based clustering. / Murtagh, Fionn.

In: Journal of Classification, Vol. 26, No. 3, 01.12.2009, p. 249-277.

Research output: Contribution to journalArticle

TY - JOUR

T1 - The remarkable simplicity of very high dimensional data

T2 - Application of model-based clustering

AU - Murtagh, Fionn

PY - 2009/12/1

Y1 - 2009/12/1

N2 - An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.

AB - An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.

KW - Cluster analysis

KW - Dimensionality

KW - Hierarchy

KW - Multivariate data analysis

KW - p-Adic

KW - Ultrametric

UR - http://www.scopus.com/inward/record.url?scp=75549090468&partnerID=8YFLogxK

U2 - 10.1007/s00357-009-9037-9

DO - 10.1007/s00357-009-9037-9

M3 - Article

VL - 26

SP - 249

EP - 277

JO - Journal of Classification

JF - Journal of Classification

SN - 0176-4268

IS - 3

ER -