The clustering algorithm is controlled using the Cluster Options on the main MPCluster panel. The exact controls listed will vary according to the selected algorithm. Here it is set for the K-Means algorithm:
And here it is set for the Hierarchical algorithm:
The rest of this section describes these settings.
For the K-Means algorithm, Use the Number of Clusters value to set the number of clusters to find. This is a maximum value. MPCluster will find this many clusters if it can, but there is no guarantee that it will. Often it is simply impossible to find the requested number of clusters using the input data and the specified parameters. It is possible to estimate the number of clusters present in the dataset using the current options. To do this, set the required Input Data settings and Cluster Options, and then press Estimate to start the Estimate the Number of Clusters dialog box. Note: This process can take a few minutes, so it may be quicker to manually try a few cluster counts by hand. The K-Means algorithm also has the ability to include pre-defined fixed cluster positions. This is enabled by setting the Include Fixed Clusters check box, and then pressing the Set Fixed Clusters button to set the fixed cluster options.
In contrast, the Hierarchical algorithm replaces these with the optional Maximum number of clusters setting. Use this to set the maximum number of clusters to find. The hierarchical algorithm does not require a maximum limit.
The Hierarchical algorithm also has a Cluster Center setting. Use this to specify how the cluster center should be calculated. This can be Mean or Median. The mean calculates the cluster's center using the geometric mean, i.e. centroid location. The median chooses one of the input data points as the center. It does this by minimizing the total distance between the chosen center and the other data points. Technically this is known as an L1 minimization.
All of the remaining parameters are switched on with check boxes. With all of the remaining cluster options switched off (as above), the clusters will tend to cover the entire map and incorporate all available data points. This is true for both algorithms. Use the options with numeric parameters to restrict the cluster definitions to useful groups and subsets of data locations.
Constraints on the maximum and minimum number of data points to allocate per cluster, can be set with the Maximum number of points per cluster and Minimum number of points per cluster options. Enter the number of points for these limits in the boxes to the right of these check boxes. Points are allocated to a cluster closest-first. So if a cluster may naturally have 60 data points and you have set a maximum of 50, then the 50 points which will be used are those that are closest to the cluster's center. Note that MPCluster always applies a minimum cluster size of three data points. This is even applied if the Minimum number of points per cluster has not been set.
The maximum cluster size can also be restricted by geographic extent defined as a linear distance (i.e. radius or diameter). The choice of diameter or radius is set with the Cluster dimensions defined as drop-down box. The maximum geographic extent is set by using the Maximum cluster diameter / Maximum cluster radius setting. The distance units (listed on the panel) are the same as the current Maptitude application setting.
The K-Means algorithm defines the diameter as double the radius. The Hierarchical algorithm defines it as the longest distance between two points in the cluster. The radius is defined the same for both algorithms (maximum distance from the center to a point in the cluster). The Hierarchical algorithm also lets you set a Minimum cluster diameter / Minimum cluster radius.
The Minimum distance between centers setting can be used to set a minimum distance between the centers of neighboring clusters. MPCluster will remove (or re-allocate) the smaller of two clusters that are found to be closer than this distance. Here, the 'smallest' cluster is the one with the least number of constituent data locations.
Finally, the K-Means' Allocate unused points on completion setting is used to allocate unused data points. After processing, there may be "orphan" data points which have not been allocated to a cluster. Set this check box if you wish to allocate these points to the nearest cluster center. This allocation process ignores the earlier constraints. Note that this can result in clusters which break the other constraints (e.g. larger clusters and clusters with more data points), and it can also result in clusters which overlap.