This example shows you how to use MPCluster to find clusters in a simple point dataset. This sample data consists of centroid points for zipcodes in an area that includes most of Texas and Louisiana but also extends into Oklahoma, Arkansas, and parts of Mississippi and Tennessee. This is a good demonstration dataset because zipcode centers (centroids) tend to form clusters in urban and metropolitan areas. Note, however, that this is artificial data. A more practical example would find clusters in customer data, with a data point representing each customer location.
The source data file is called tx_zipcodes.xlsx and contains one worksheet called zip_pushpins. It can be found in the MPCluster examples file:
It is assumed that you are familiar with basic Maptitude operations. Import this worksheet into Maptitude as a data view. This can be performed using the Create-a-Map Wizard or File->Add on the main menu. The resulting map will look something like this:
Some large scale clustering is apparent in this zoomed-out image. South Texas shows this clearly with large clusters around Austin, San Antonio, and Houston.
Start MPCluster by selecting MPCluster on Maptitude's Tools->Add-ins menu. This will display MPCluster's main panel. Set the parameters as follows:
This tells MPCluster to search for up to 20 clusters that fit the data points in the zip_pushpin dataset. Cluster sizes are also constrained to require at least 55 data points, but to be no bigger than 30 miles radius. I.e. a cluster can only include data points within 30 miles of its center; and it must have at least 55 data points.
The K-Means algorithm has been selected. This is a stochastic algorithm – i.e. contains a random element. Repeated runs with the same data and settings will usually produce slightly different results. Repeated runs may be required to get the best results. The Hierarchical algorithm is demonstrated at the end of this example. This will always produce the same result for the same data with the same settings.
The display options are set so that all clusters are drawn with boundary outlines and central coordinates. This is set using the Write clusters to layers check box. This will create two layers called cluster_CTR and cluster_BDR (the 'cluster' prefix is set in the options). These mark the cluster centers with solid triangles, and the cluster boundaries with solid lines, both with the same color. These layers are overwritten if they already exist, so you must make sure they are write-able. You can add labels to the boundary layer (cluster_BDR) by setting the Draw Labels checkbox. Make sure the Draw Cluster Circles check box is clear - this would draw circles around each circle at the requested maximum radius (30 miles in this example).
The Write allocations to a data view check box creates a data view that lists all of the data points and their cluster allocations as a number. Typically this is then joined to your input data, and have a theme allocated to it. This theme colors each data point according to its cluster allocation. Set the Apply a Join to the input data points and Apply a Theme check boxes to do this automatically.
MPCluster can also write the allocations to an Excel workbook, but we have not selected this option here.
Next press Start on the main panel to start processing. MPCluster will prompt you for the FFA file for the new data view. You can use any name you want. Here we use txzip:
Then it will ask you for the output file prefix for the new data layer (i.e. cluster centers and boundaries) DBD files. Here we use txzip_clusters:
Pressing Save on this final dialog box will initiate the actual processing. MPCluster will then display a dialog box indicating clustering progress:
This dialog box will disappear when processing has completed and the clusters have been drawn on the map. Here are the results, zoomed out:
As expected, MPCluster has identified clusters around many of the larger cities: San Antonio, Dallas-Fort Worth, Oklahoma City, and Memphis. Some other large cities are missing (especially Houston and Austin). This is partly because the K-Means algorithm is stochastic in nature. Multiple runs may produce better results. A unusual looking cluster is located on the Louisiana delta between Baton Rouge and New Orleans. Although you may think this should include both New Orleans and Baton Route, our limit of a 30 mile radius limits this cluster from growing to include either city. Instead it incorporates the western edge of New Orleans, and many of the zipcodes south east of Baton route.
Here are the results from the same parameters but using the Hierarchical algorithm:
The Hierarchical algorithm is deterministic - i.e. it does not contain a random component. In this example, it does a much better job of finding the clusters around all of the larger cities and many of the intermediate sized cities. The largest cities (the DFW Metroplex, and Houston) have also been split into two adjacent clusters. This is due to the maximum radius constraint of 30 miles.
The best choice of algorithm will typically depend on your exact application. Sometimes one algorithm will suit the data better. For example, the K-Means will tend to prefer compact, circular clusters; whilst the Hierarchical algorithm will often produce flattened, oval-shaped, or slightly concave clusters.
Further details on how to control the clustering parameters can be found on the Changing Clustering Behavior and Setting Options page.