## Dec 15: Non-Dyadic Distance Measure

Category: Methodology

Posted by: Aaron

Distance, as it is usually thought of, is between two points (aka dyadic). This can be discreet as in the minimal number of grid spaces an agent needs to traverse to travel from location A to location B (the Chebyshev distance). For standard real-valued spaces it is the length of the shortest line connecting the two points. We can general this into higher dimensions by measuring the distances of manifolds (like strings, curves, solids, etc.), but to be a metric they always have to satisfy the same criteria: 1) positive or 0, 2) symmetric, and 3) the triangle inequality. Now, let's say I want to measure how far apart the elements of a set are, say three points in a 2D plane. Can we come up with a formula for the distance among these three points?

Yes we can! First let me describe my method for calculating the distance among three points. Let A, B, and C be three points in 2D space: (xA,yA), (xB,yB), and (xC,yC). Let M({…}) be the centroid (a multipoint version of the midpoint) of all the points in {…} and d2(.) be the standard dyadic distance function:

d2(P,Q) = sqrt( (xP – xQ)^2 + (yP – yQ)^2 )

My non-dyadic distance function is just the sum of the distances from each point to the centroid. For the three-point example this comes to:

d3(A,B,C) = d2(A,M) + d2(B,M) + d2(C,M)

Naturally we can extend this to any number of points, and in any number of dimensions. This measure is always greater or equal to 0 because it is the sum of three measures that each is greater than or equal to 0, thus satisfying condition (1). The order of the three terms clearly doesn't affect the result, so it is symmetric as well.

The triangle inequality is that the sum of the distances between P and Q through the point R is greater than or equal to the distance directly from P and Q; i.e. the direct path is the shortest path; i.e. there are no shorter paths. This feature is also preserved by my measure because the centroid is identified such that it is indeed the minimum distance point from all the other points. One can work out the proofs for all these features quite simply…that just leads up to the interesting part.

Given those features, this generalized distance function over sets of more than two points is a metric. And if it's a metric then it defines a topological space, and if that's correct, is the topological space different from the one that the standard distance function generates? If so, what do open and closed mean on that space? How can we define Cauchy sequences that converge? There are lots of mathematical things to explore. Probably somebody has done all this already, but I couldn't find it because it's probably called something different.

A distance measure like this can work as a spatial measure of clustering within communities, and even help identify community structure in spatial agent-based models...this is called "k-means clustering". It can also be used to analyze and compare multidimensional data across data sets in a simple nonparametric way. For this statistical application, it's actually similar to variance of least-squared error measures, and is certainly in the class of dispersion measures. For me, the next step would be to prove the mathematical properties...whether it's a metric and, if so, what are the properties of the space it defines? Has anybody seen this done already?

Yes we can! First let me describe my method for calculating the distance among three points. Let A, B, and C be three points in 2D space: (xA,yA), (xB,yB), and (xC,yC). Let M({…}) be the centroid (a multipoint version of the midpoint) of all the points in {…} and d2(.) be the standard dyadic distance function:

My non-dyadic distance function is just the sum of the distances from each point to the centroid. For the three-point example this comes to:

Naturally we can extend this to any number of points, and in any number of dimensions. This measure is always greater or equal to 0 because it is the sum of three measures that each is greater than or equal to 0, thus satisfying condition (1). The order of the three terms clearly doesn't affect the result, so it is symmetric as well.

The triangle inequality is that the sum of the distances between P and Q through the point R is greater than or equal to the distance directly from P and Q; i.e. the direct path is the shortest path; i.e. there are no shorter paths. This feature is also preserved by my measure because the centroid is identified such that it is indeed the minimum distance point from all the other points. One can work out the proofs for all these features quite simply…that just leads up to the interesting part.

Given those features, this generalized distance function over sets of more than two points is a metric. And if it's a metric then it defines a topological space, and if that's correct, is the topological space different from the one that the standard distance function generates? If so, what do open and closed mean on that space? How can we define Cauchy sequences that converge? There are lots of mathematical things to explore. Probably somebody has done all this already, but I couldn't find it because it's probably called something different.

A distance measure like this can work as a spatial measure of clustering within communities, and even help identify community structure in spatial agent-based models...this is called "k-means clustering". It can also be used to analyze and compare multidimensional data across data sets in a simple nonparametric way. For this statistical application, it's actually similar to variance of least-squared error measures, and is certainly in the class of dispersion measures. For me, the next step would be to prove the mathematical properties...whether it's a metric and, if so, what are the properties of the space it defines? Has anybody seen this done already?