Some basic thoughts on the choice of the clustering algorithm

1) So far the CLARANS algorithm (Clustering for Large Applications RANdomised Search) seems to be the most promising because of the implementation that specifically targets geographical applications. Additionally, two variants of this algorithm seems to allow to give more or less prevalence to the semantic component of my specific case:

SD(CLARANS) are the spatial-dominant variant and NSD(CLARANS) is the non-spatial-dominant.

2) While cluster analysis sometimes uses the data matrix, some clustering algorithm uses the similarity matrix, that is often called proximity matrix.

In my case I have two proximity matrices, the first is generated by the two-dimensional geographic distance of the messages. The second is the semantic distance as returned from a comparative method based on Latent Semantic Analysis, for instance.

One of the idea is to use these two matrices alternatively on the algorithm or to combine them before passing the average to the cluster engine.

3)  An interesting technique called Sparsification allows to reduce the connectivity of nodes by breaking some of the links of the connectivity graph.  This is really interesting and can be used to take into account links from one message to the other and an external data enrichment to weight the incidence of specific nodes.

Tags: , , ,

Leave a Reply