Clustering data is the practice of grouping or partitioning a set of points based on how similar they are to one another. One of the most basic forms of this would be by eye when looking at a plot of data. For example, if we see two clumps on the *x* and *y* axes, then we say these data are grouped into two clusters.

Here, I artificially created two clusters in the R project tool grouped around (1,1), and (2,2). The job of clustering tools is to figure out that these are indeed clusters. This becomes harder when dealing with more than two dimensions, different types of data, and the question of what a cluster actually is.

Clustering tools are often used as a form of exploratory data analysis and fit within the class of unsupervised learning techniques. It’s not hard to see their practical relevance when they can help form groups from data elements having many different properties.

For some time now, I’ve heard the phrase “clustering is an art not a science.” This becomes even more evident when faced with the frustrating quantity of clustering algorithms for specialized tasks and intimate interaction with the data to realize what’s actually happening. Moreover, I can’t help sneaking suspicion on how certain techniques seem to overlap with one another. Now I’m not trying to knock specialized techniques. In fact, quite the opposite. For the thoughtful ones that seem to tackle a particular class of data, *at least they do that and can be used*. That being said, without being able to formally lay out what’s actually happening, it’s hard to have a sound basis on which to compare methods.

What I’d like to (quickly) highlight in this post is a couple of interesting notions that tickle the possibility of a cleaner way to think about clustering. I’m not going to focus on the usual classifications of algorithmic approaches (e.g. hierarchical vs. partitioning). While those distinctions are useful in choosing techniques and tools to cluster with, I think it might be valuable to look at the problem without concerning ourselves too much with traditional algorithms.

**A data point can be quite flexible.** It’ s just a tuple. It can even be of mixed types (e.g. the first dimension can be some point in the real numbers, the second can be one of *n* nominal elements of a set, …).

A**ll that matters is that we can define the distance between any two tuples.** Even with mixed types, there are reasonable ways to approach this [1]. With this, we have a metric space.

**We can stick this metric space in a Euclidean space. **In the mid-80s, it was shown that any metric space with *n* elements (in our case tuples) can be embedded in a O(log* n*)-dimensional Euclidean space with some reasonable constraints on *distortion *[2]*.* Distortion is essentially the discrepancy between the true distances in the *n *element metric space and the distances in the Euclidean space. A decade later, randomized algorithms were developed to actually do this (e.g [3]).

The curse of dimensionality is a problem where the higher we go in dimension, the more likely it is for points to hover around the edges of some abstract hypersphere of our cluster. This is because as the dimension increases, most of the volume of the hypersphere is contained in some small distance from the edge of the sphere. Say our tuple was just a vector in the real number space, then it may be possible to reduce the dimensionality considerably and not bias our cluster shapes too much due to high dimension.

These facts (the first reference being quite practical and the last two more foundational) are quite relevant to clustering and I’d argue make the whole thing a little more of “a science” (eh hem, “a math”).

There are, I feel, two big issues I’ve ignored ’till yet, and these seem to really make up the “art” part.

1. We still have to take these points in some lower-dimensional Euclidean space and partition them into groups. This requires some assumption of what a cluster actually is. We can assume they’re spherical, but that’s likely not always the case. What about different types of shapes? While clustering shape sounds minor and in some cases may not be practically that relevant, to my knowledge, most ways of defining distances between clusters tend to restrict all the clusters to relatively homogeneous shapes. This seems like a problem. Maybe it can be dealt with mathematically in some way?

2. Let’s go back to the start where I talk about defining distances between points. This can be a potentially arbitrary process depending on the methods by which the data was gotten in the first place. But even aside from that, the structure one is clustering on may suggest a variety of alternate ways to define distance between two points. For example, we can define the distance between two nodes on a graph in many ways (not just the shortest path).

While for the moment I’m certainly more at home as a “user/developer” of tools in this regard, here’s to theory.

——

[1] Kaufman L and Rousseeuw PJ. *Finding Groups in Data: An Introduction to Cluster Analysis*. John Wiley & Sons (1990).

[2] Bourgain J. “On Lipschitz embedding of finite metric spaces in Hilbert space”. *Israel Journal of Math* 52, 46-52 (1985).

[3] Linial N, London E, and Rabinovich U. “The geometry of graphs and some of its algorithmic applications”. *Combinatorica *15, 215-245 (1995).

To be a little more precise with what I was saying above, a metric space must obey the triangle inequality: . Thus we can’t choose any arbitrary number for distances.

The example at the end of the post where the distance between two nodes on a graph is the shortest path (often called the “shortest path metric”) requires that the graph be connected. It should be straightforward to see how triangle inequality is satisfied in this case.

Interestingly, for graph clustering problems, Andreas Noack has research on some equivalences between graph layout and graph clustering. In what I feel is a well-written recent paper, he shows how force directed methods subsume modularity [1].

A couple of things interest me for further exploration:

(1) Compare and contrast shortest-path spring models using multidimensional scaling vs. force-directed layouts. Seems like his papers might have some good lit pointers for that.

(2) The determination of what a cluster is and the implications on the resulting shape as mentioned originally still interests me.

—–

[1] Noak A. “Modularity clustering is force-directed layout“. Phys. Rev. E 79, 026102 (2009). The purposeful avoidance of shortest-path metric seemed necessary for this size of paper, but I feel there were places where comparisons needed to be addressed. Related to the original post, he did invoke a Johnson–Lindenstrauss lemma justification for dimensionality reduction later in the paper: “the dimensionality of layouts can be somewhat reduced without large changes of the pairwise vertex distances”.