Clustering of addresses with Variance controlling K-means

In routing problems, a lot of information depends on the place we are in. For example, the demand in deliveries is much higher in a capital city rather than in the countryside. Data is very sensitive and very dependant on location. For example, if a city has more inhabitants than another one, it will probably have more demands and more customers which is a crucial information to estimate and anticipate the size of one’s fleet.

Instead of using notions such as cities, departments or regions because they are not accurate enough for most applications, we want to create our own way to separate all the territory into small geographical areas.

In this blog post, we will see how to create our own clustering for a given work location.

Why do we want to do this ?

As we said, data is very dependant on the location and one of its explanation is the population density.

Indeed, the majority of the data depends on people’s habits and how they live. For example, route traffic or supplies and demands for services like delivery or shared transport will be directly influenced by the number of people of the location(s).

However, storing all the data for every single location is not realistic, especially if we want to store data related to all pairs of locations (like traffic congestion over time). So we want to aggregate points into a cluster and only store data that summarize the entire group, or a tuple of groups, to make computations easier.

Desirable properties for our groups

Without heading too deep into formalism, there are some properties we would like to obtain in each of our groups, as well as for the grouping as a whole:

  • The description of a “group” should make it very easy to check wether a specific location is in that group or not
  • The description of a “group” should be very simple to understand and to use
  • As groups of location will serve as an approximation of these locations, it is crucial that all groups are representing the same “level” of approximation (which is yet to be defined).

 

For the last property, we will rely on the hypothesis that the only relevant approximation criteria for a given group is the number of addresses in it. It implies that two groups may cover areas with very different sizes, while containing the same amount of addresses. As it would deserve a blog post on its own, we will not discuss that hypothese here so that we can focus on the algorithmic side of things.

Problem and theory

Clustering ? K-Means ?

Clustering is a very popular topic in machine learning or optimization. It’s a type of unsupervised learning method that involves the grouping of data points. Points in the same cluster should be close to each other according to a metric, usually an euclidean distance.

One simple and efficient algorithm to construct good solutions is the K-Means algorithm.

The K-Means algorithm [1] is one of the simplest unsupervised machine learning method for clustering. The main idea of this algorithm is to define k centroids, one for each cluster. Then, until we decide to stop, assign each point to its nearest centroid and compute the new centroid of each cluster. This algorithm works very well, it gives great solutions for the clustering problem of data points and converges very quickly with convergence guarantees.

K-Means :

  • Initialization : choose K random points which are called “centers of the clusters”
  • Repeat until stop conditon :
    1. compute the closest center for every point, they belong to the cluster of their closest center.
    2. compute the new center of each cluster (the mean point of each cluster)
  • At the end : for every point, we compute its closest center and the associated cluster is the final cluster of this point then we can compute the size of each clusters.

Our problem

A very well known property of K-means is that it tends to produce groups with the same density, which is a very good thing in most cases. But when the data points are scattered in regions of variable density, it means that K-means can produce groups of very variable size.

But as we stated in our introduction, our goal is to build K clusters of approximatively same size from N location data points. It implies that if we favor cluster size over cluster density as a common objective for all groups, we will obtain groups with roughly the same number of addresses, but with very variable densities ! It means that K-means as it is do not fully answer our requirements.

But in the spirit of K-means, after creating the clusters, when we compute the closest center to any point, we should find the center corresponding to the associated cluster of the point. The idea is to create a certain number of significant points (centers) which will represent all the other points. We will store only centers’ data and if we need information on a point, we have to compute the closest center from it and use the data associated to the cluster. That answers or first two desirable properties as our groups description will be very scarce and easy to use.

So our first axis of research will be to adapt the existing K-means to make it work on group size instead of group density.

Location data

Our main goal is to make clusters for France and then, the method used has to be compatible with other countries.

For France, we can use IRIS data from Insee which is division of the France territory into clusters of uniform size. Each IRIS has between 1800 and 5000 inhabitants and there are a total of 16 100 IRIS in France. This is almost the clustering we wanted because they are created based on the number of people living in. However the issues from IRIS are that they are specific to France and those clusters aren’t flexible in terms of size and cannot be changed.

Another available database is the OpenAddresses database. The only issue is that it doesn’t contain the number of people that live in each address so we have to consider that each location represents the same amount of people which is obviously false because, for example, in an appartment from a capital city, there can be hundreds of people and in a house from the countryside there can be only a few people. The advantages of this database is that it contains addresses from different countries too and we can choose our own size of the clusters.

(Example of addresses in Paris from OpenAddresses)

Application and results

Our solution

One of the issues of the K-Means algorithm in our problem is that clusters created don’t have necessarily the same size and usually the statistical dispersion of the clusters’ size can be very important.

So we modified the K-means algorithm in order to have clusters with closer sizes.

Here’s the proposed algorithm :

modified K-Means (the expected size of each cluster should be s = N/K where N is the total number of points):

  • Initialization : choose K random points which are called “centers of the clusters”
  • Repeat until stop conditon:
    1. compute all the distances between centers and points
    2. for each pair (point, center) from the smallest distance to the biggest one, if the point has not already been assigned and if the closest cluster’s size is less than s, the point belong to this cluster.
    3. compute the new center of each cluster (the mean point of each cluster)
  • At the end : for every point, we compute its closest center and the associated cluster is the final cluster of this point then we can compute the size of each clusters.

 

The K-Means algorithm uses originally the Euclidian distance but here we have to use a distance between two geolocations. Computing the distances between addresses can be done using the haversine formula [2] or the approximation of spherical earth projected to a plane for example. The first one gives more accuracy but its computation is slower than the second one.

Note that this algorithm does not have the two following properties of K-means :

  • Asymptotic convergence
  • Convex clusters are produced at each step of the algorithm

 

However, we argue that convexity is guaranteed at the end of the algorithm, at the cost of not having groups of equal sizes (which is acceptable in practice). And for convergence, we observed that in practice it is slower than with K-means and most importantly, it does not happen all the time as the algorithm ends up having centroids oscillating between positions. However, for practical use, it is more than fine to use it, especially because what we need in the end is a set of centroids for the final allocation.

Results

We’re testing the classical K-Means algorithm and the modified one on the data from the departments of France.

In the graph below, we tested both algorithms on the data of Paris. It contains 151431 addresses and we made 101 clusters, expecting to have clusters of approximatively 1500 points .

The modified algorithm gives clusters that have their sizes more centered around 1500. In order to compare the two algorithms, we will use the standard deviation which is a classical metric to mesure the statistical dispersion. The clustering with less dispersion is better, because it means that the clusters are more centered around the expected size.

We ran the two algorithms on each department of France to compare properly the standard deviation of their clusters’ size. For each department, we computed and compared their results.

For example, for Paris, we found :

which means a standard deviation decrease of 46.1%.

The proposed algorithm gives smaller standard deviation with a decrease of 55% in median. As we hoped, our algorithm creates clusters which sizes are closer to each other than the classical K-Means.

Finally, here’s what a clustering of Paris looks like. Each point is a center of a cluster and the color of the point represents the number of locations that are associated to the cluster.

Conclusion and extension

Our proposed method gives clusters that have closer sizes than the classical one.

Instead of using the K-Means algorithm we can try other methods of clustering that can handle arbitrary distances like Hierarchical clustering, PAM, HDBSCAN …

References

  • [1] MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. 1. University of California Press. pp. 281–297. MR 0214227. Zbl 0214.46201. Retrieved 2009-04-07.
  • [2] R.W. Sinnott, “Virtues of the Haversine”, Sky and Telescope, vol. 68, no. 2, 1984, p. 159