Although a human has a sense of what a cluster is, it is difficult to define exactly what a cluster is, and the definition will vary according to the application. Hence many cluster algorithms have been created for different applications. Generally speaking, different algorithms will work better in different scenarios.
MPCluster supports two different clustering algorithms: K-Means and Hierarchical. These are also probably the most common algorithms in general usage. MPCluster's algorithm is set using the Algorithm box in the upper left corner of the main panel.
The K-Means is an iterative algorithm that assigns points to their nearest cluster center, and then recalculates the cluster center. This process is repeated until the clusters are stable or a timeout occurs. The algorithm is stochastic, i.e. the initial allocation of the cluster centers is partially random. Therefore multiple runs can produce different results. MPCluster helps to minimize this effect by running multiple K-Means and choosing the best.
K-Means clusters tend to be circular in shape, and are usually convex.
The K-Means algorithm also has the ability to pre-define a number of 'fixed' cluster positions. The positions will always be reported as clusters, and a series of constraints can be applied. These are similar to the regular cluster constraints except they are limited to maximums. For example, you can set the maximum radius or diameter of a fixed cluster, but not the minimum size. This is because the fixed cluster will always reported – regardless of whether any data points are nearby. The maximum constraints simply limit the size of the final fixed clusters.
Be careful applying too many constraints to the K-Means algorithm. Too many constraints can make it difficult or impossible for the K-Means algorithm to find a stable solution.
The Hierarchical algorithm is deterministic with no random element. Clusters are created from points which are close together. Additional neighboring points are added until the clusters reach a specified maximum. Because it is a deterministic algorithm, multiple runs with identical data and parameters will produce identical results.
Hierarchical clusters are not always as circular as K-Means clusters, and they are less likely to be convex. MPCluster is capable of drawing concave boundaries. There is a cut-off in how concave a boundary can be, in order to avoid "spikey" star-like cluster shapes. This means there can still appear to be small overlaps with adjacent clusters.
The Hierarchical algorithm can determine a cluster's center using the mean ("centroid") or median. The latter restricts possible cluster centers to the set of input data points, and can be used with an external distance table.
When set to use Median centers, the Hierarchical algorithm currently has a maximum limit of less than 10,000 data points. A warning is also given for datasets with more 2,500 data points. MPCluster can compute these datasets, but they might take a long time. As a guide, a dataset of 9,000 data points took over 7 hours to compute on a Core i7.
The Maximum number of clusters setting is also optional - whilst you have to specify it for the K-Means.