EJPAU 2006. Galant K. DATA CLASSIFICATION FROM CARTOGRAPHIC POINT OF VIEW

Electronic Journal of Polish Agricultural Universities (EJPAU) founded by all Polish Agriculture Universities presents original papers and review articles relevant to all aspects of agricultural sciences. It is target for persons working both in science and industry,regulatory agencies or teaching in agricultural sector. Covered by IFIS Publishing (Food Science and Technology Abstracts), ELSEVIER Science - Food Science and Technology Program, CAS USA (Chemical Abstracts), CABI Publishing UK and ALPSP (Association of Learned and Professional Society Publisher - full membership). Presented in the Master List of Thomson ISI.

2006
Volume 9
Issue 4

Topic:

Geodesy and Cartography

ELECTRONIC
JOURNAL OF
POLISH
AGRICULTURAL
UNIVERSITIES

Galant K. 2006. DATA CLASSIFICATION FROM CARTOGRAPHIC POINT OF VIEW, EJPAU 9(4), #36.
Available Online: http://www.ejpau.media.pl/volume9/issue4/art-36.html

DATA CLASSIFICATION FROM CARTOGRAPHIC POINT OF VIEW

Katarzyna Galant
Institute of Geodesy and Geoinformatics, Wrocław University of Environmental and Life Sciences, Poland

ABSTRACT

Classification plays a significant role in the simplification of data visualisation. Instead of individual values, which number can be thousands, there can be obtained only few classes with known range and number of values in it. That results in clear and easy-interepretable thematic map. Hence the quality of data classification carried out automatically in GIS software packages is of great importance. The purpose of the paper is to compare the tools that the researched programs deliever in order to carry out data classification as well as analysis of quality of data classification performed.

Key words: data classification, methods of classification, GVF.

INTRODUCTION

One can risk an assertion and state that computer technology and Internet simplified the process of creation thematic maps to a few clicking of a mouse. Everyone has access to enormous data sets, one is able to retrieve and acquire loads of data, furthermore – using commercial software packages as MapInfo, ArcInfo, GeoMedia as well as non-commercial CommonGIS – visualize them. The visualization is only one of the possibilities as the GIS software is primarily a tool for solving problems and making decisions relating to wide range of disciplines. Presentation of statistical data in form of thematic maps like choropleth or diagram maps is very common and beneficial as it delivers information not only about size of a phenomenon but also about its geographical distribution. The cartographers know that one of the conditions, in order a map aims its goal and correctness, is analysis of data: their utility, reliability, accuracy; the nature of the objects the data refer to; the character of data (absolute, relative); the measurement scale [1]. Data processing includes also a classification of data considered as generalization resulting in simplified representation of data. Instead of individual values, which number can be thousands, there can be obtained few classes with known range and number of observed values within it, that is known as cartographic generalization [3]. Paslawski has pointed out two approaches in the class determination: as a stage in the creation of thematic map or as a separate problem of cartography. Moreover he has defined the class delineation as a process conforming to construction of stemplots in statistics, however restricted by additional conditions: number of classes and the their boundaries.

CLASSIFICATION METHODS AND STATISTICAL CHARACTER OF DATA SET

Techniques of classification

Data classification performed for a purpose of data visualization on choropleth map involves number of classes as well as a technique for their determination. The first factor depends on how many variable one can discern: in cartographic literature [4] as optimal number of classes 7 (max. 9) is considered. The second factor is widely described by many authors. They group the classification techniques differently. The classification methods can be categorized according to Paslawski [3]:

Graphic methods

graphs of values distribution
graphs of values and areas of reference units

Mathematical and statistical methods based on:

number of data set
values of data set
values of data set and additional characteristic of data

Non-formalized methods

normative ranges
traditional ranges

In this paper the analysis of the fully automated data classification performed in ArcInfo v.9.0, MapInfo v.8.0, GeoMedia v.5.2 and CommonGIS v.2.2.2 is of particular interest. Dividing data into classes is important stage in the process of creating thematic maps. An attempt has been made to indicate the best classification scheme using a GVF (goodness of variance fit) factor as an indicator. Furthermore the purpose of the paper is to compare the tools that above mentioned software packages deliver in order to carry out data classification.

Researched data set

Analysis of classification methods available in the researched programs were performed on data referring to forestry in Lower Silesia area grouped by counties (Tab.1).

Table 1. Sorted set of the researched data

Fig. 1. Graph of the values

In order to compare the classification methods and indicate the most suitable one (for the particular data) the same number of classes for each classification in each program has been assumed. The number of classes was specified base on the graph (Fig.1) created in MapInfo. We can focus our attention to the creation of graphs in examined software.

In order to construct appropriate classes, the character of data set i.e. the statistical distribution of a given variable has to be known. That is depicted the best on a histogram or values’ graph as noted in Pasławski [3]. The graphic methods are considered as the first stage in the determination of class boundaries.

Fig. 2. ArcInfo classification dialog box

Data may be presented on histograms using ArcInfo, MapInfo and CommonGIS, while GeoMedia does not have any option of creating graphs. The difference is substantial in terms of user’s input into creation of histograms: in ArcInfo the diagram is automatically created and displayed in the module of classification data (Fig.2), whereas in MapInfo and CommonGIS is a separate process of creating charts.

Comparing the usability of graphs within classification process, the histogram in ArcInfo can be pointed out: the number of columns (i.e. the number of classes that the data range is divided into) may be controled by a user, a mean and a standard deviation for the data set can be displayed, and the class boundaries may be moved, inserted or deleted.

CommonGIS provides the option of control for the values’ range i.e. hides outlier values as well as counts’ range using sliders. Moreover a user can define interval i.e. number of classes common to ArcInfo and if one moves the mouse on the bar in the histogram, the exact interval borders and the objects contained in it are listed [6] (Fig.3.a). The important advantage is the interactive character of histogram: by clicking on the bar all objects included in the bar are highlighted on the map and the user can detect the location of the objects on the map [6]. Another view of data classification delivers the diagram Ranged distribution that is automatically created within a choropleth map. Here the number of bars is the number of classes (the horizontal axis does not have values) and the vertical axis represents the number of observations in each class. The bars are depicted in the same colour as the classes in legend and on the map and can be sorted by size (enable option Ranged).

Fig. 3. Histogram in CommonGIS: default one (a) and hiding outliers (b)

The histogram in MapInfo requires the user’s modification (format options in General and Grid and Scale tabs) as the default one (Fig. 4.a) is useless for classification process because it shows too generally the statistical distribution of data (too few classes).

Fig. 4. Histogram in MapInfo: default one (a) and after the user’s modification (b)

A values’ graph is the second type of graphs that can be used to graphical data classification.It may be created for the sorted ascending/descending data set as the Scatter chart (Fig.1) in MapInfo and in ArcInfo. According to Paslawski [3] classes based on such graphs should group similar values and divide various ones, what is considered as optimization that will be discussed further.

In CommonGIS there is an usefull graph that is automatically displayed during map creation, namely a cumulative curve. The horizontal axis represents the value range of an attribute, the vertical axis shows the proportional frequency of objects. Peculiarities of value distribution can be perceived from the shape of the curve. Steep segments correspond to clusters of close values [6]. The user can adjust class boundaries by moving them on the classification bar (Fig.5.a) as there is continuous connection between classification bar, graphs, map and legend. Furthermore the cumulative curve gives the possibility to divide data set on the base of more than one criterion (Fig.5.b). Hence in CommonGIS – considering the determined categories – group of methods based on values of data set and additional data characteristics is available. This also corresponds to Quantile method in MapInfo, which will be explained further.

Fig. 5. Data classification in CommonGIS based on cumulative curve for one (a) and two varaibles (b)

An additional method for graphical representation of statistical distribution of a variable is the dispersion graph (Fig.6), which differs from frequency histogram that the data range is not divided into classes. Dots representing attribute values with only small deviations are drawn on top of each other in “stacks” [6]. The stacked representation allows the user to differentiate between objects with equal attribute values and get an impression of the dispersion of values.

Fig. 6. Dispersion graph (CommonGIS)

All the above types of graphs give information about the data set and help to determine number and boundaries of classes suitable for the statistical data distribution. Any kind of user modification: inserting or deleting class boundaries results in automatic choice of the Custom/Manual classification method. One can notice the importance of creation such graphs in decision making: choosing the classification scheme that suits to the particular data set as well as analyzing if the automatically created class boundaries are correct. It prevents also from creation unwarranted number of classes as elaborating the choropleth maps with maximum available in the programs number of classes (Mapinfo – 16, GeoMedia – 20, ArcInfo – 32 ) causes their unreadablility and misinterpretation. In author’s opinion only CommonGIS producers take into account the fact that number of classes depends on the ability of discerning graphic variables so the number of classes in the program is limited to nine.

ANALYSIS OF CLASSIFICATION SCHEMES

Classification schemes

When the number of classes is established e.g. based on graph, the classification can be carried out and in the next step the analysis of its quality should be performed. The researched software packages offer the following classification methods:

MapInfo
– Equal Count
– Equal Interval
– Natural Breaks
– Standard Deviation
– Quantile

ArcInfo
– Equal Interval
– Quantile
– Natural Breaks
– Standard Deviation

GeoMedia
– Equal Count
– Equal Interval
– Standard Deviation

CommonGIS:
– Equal Size
– Equal Interval
– Nested Means
– Optimal (mean/median/entropy)

The Quantile method offered by MapInfo does not conform to universal definition of quantiles found in cartography or statistics literature whereas the one in ArcInfo does. Therefore the Quantile classification in ArcInfo referred to the Equal Count (or the Equal Size) method and the one in MapInfo will be discussed further. Additionally, each program offers the Custom/Manual classification scheme that was mentioned before.

The results of performed classification are presented in the Table 3. In order to analyze the quality of data classification, four most popular classification schemes were taken into consideration. It should be stressed out that there are different software options for classifying values into particular ranges. In MapInfo values in each class are greater or equal to minimum and less than maximum, in ArcInfo they are greater than minimum and less or equal to maximum, in GeoMedia it is common to ArcInfo. Moreover CommonGIS is the only program that define class boundaries as a mean of two values and ArcInfo the only one which gives the lower boundary of the next class different from the upper boundary of the previous one. The disadvantage of CommonGIS while determining class boundaries is the default rounding to two decimal places. All the above remarks can be noticed in the table below.

Table 2. Results of data classification carried out in the researched software

GVF analysis

The performed analysis is based on the assumption that the most suitable classification scheme is the one which minimizes the differences between the observed data values and the average of the data values. Then the data assigned to a class are alike and each class is well represented by the mean value. Methods fulfilling this condition are called optimization methods and belong to mathematical and statistical category. The creation of optimal classes bases on some statistical criterion. In Jenk’s optimization it is a GVF factor (goodness of variance fit) which rises up to one for the best classification methods whereas the sum of the variance within each of the classes is minimized. Hence a GFV factor is used to determine class boundaries in optimization method as well as an indicator of the classification quality. The GVF is calculated as follows [5]:

where
SDAM – squared deviations from the array mean
SDCM – squared deviations from the class mean

Table 3. GVF factor for each classification scheme

Performed analysis (Tab.3) shows that the best classification scheme is the Natural Breaks as the GVF factor is the closest to one. The outcomes confirm the statement that the Natural Breaks is the most proper classification method as it groups similar data values and divide different ones. It proves also that the Natural Breaks classification scheme available in MapInfo uses the optimization algorithm. In ArcInfo and CommonGIS a user is informed about this fact. Additionally, in CommonGIS there are three optimization algorithms based on different criterion: mean, median, entropy. In the analysis – as it is Jenk’s optimization – the first one was taken into account. The median can be used in order to construct another factor – GADF (goodness of absolute deviation fit), that is calculated analogically to GVF but referring to median [5]. The GADF is – alike GVF – the criterion to perform optimal data classification as well as indicator of quality of the classification. In Optimal classification a user can define either number of classes or the maximal error of classification expressed in percent, then the program calculate number of classes.

One can notice that the GVF values for all classification methods (besides the Standard Deviation method) are alike (the differences range from 0.01 – 0.02), what may be a result of the character of data set (statistical data distribution). Another conclusion is that the outcomes of analysis in case of MapInfo and CommonGIS are the same although the class boundaries are different. That is the result of classifcations option described above.

Reception of aggregated data

Another analysis of quality of data classification proposed in the paper is essential while using a map as a source of data e.g. to further analysis. Reception of aggregated data relies upon assigning the mean from class boundaries to the reference units. In this analysis the square deviations of data belonging to one class from the class boundaries were calculated (Tab.4). One can noticed that only in CommonGIS the best classification scheme is the Natural Breaks and for other programs is the Equal Interval (there is slight difference from the Natural Breaks).

Table 4. Outcomes of the quality analysis

SDCM* – square deviation from class mean, where class mean is calculated as the mean from the class boundaries

An important issue is also how the Standard Deviation scheme is carried out in researched software packages as the classes are completely different in each of them (Tab.3). The most suitable way of creating class is presented by ArcInfo because a user define not the number of classes, alike in other two programs, but value of standard deviation (1σ, 1/2 σ or 1/3 σ). On the other hand the disadvantage is that the mean is not a class boundary but always placed in the middle of class. In GeoMedia the empty classes beyond the range of values to fulfill the condition of specified number of classes are pointless. It must be emphasised that Standard Deviation in GeoMedia is the worst data classification scheme that has been prooved by the two analyses: the GFV factor is the lowest and in the second – the SDCM* is the highest.

The process of classification results in the simplification of a map’s reception. Based on such maps the spatial structure of the phenomenon may be easier to assess by a common reader. The figure below (Fig.7) shows the outcome of classified researched data set in form of choropleth maps using different methods of classification and software packages.

Fig. 7. The choropleth maps presenting forestation elaborated using different classification schemes in the researched software

The visual assessment of these maps leads to the conclusion that high forested areas are in the west (Bory Dolonolšskie) and southwest of Lower Silesia while the central part is of low forestation and in the northern part there are areas of diverse forestation. It results from the landscape and land cover of the region. Looking carefully at the maps one notices that the Equal Interval method gives the same choropleth map in each of researched software alike the Equal Count one although the latter delivers different class ranges. Analyzing the map elaborated using the Natural Breaks classification scheme a user spots the difference between the choropleth map in MapInfo and ArcInfo. Even if the maps created in CommonGIS and MapInfo look the same, the class “breaks” are different. The fact may also have impact on quantitative interpretation of phenomenon because – as it was mentioned before – the process of reading choropleth map is based upon assumption that the mean from class boundaries refers to the reference unit. The reception of the presented phenomenon may be different based on choropleth maps created using the Standard Deviation scheme. GeoMedia delivers almost uniform depiction – on the map we can distinguish only three classes (two classes are empty), in MapInfo’s and ArcInfo’s maps five classes can be discerned but the spatial distribution is different. The way how the Standard Deviation is performed in each program was explained above.

As the legend consitutes the key of the map interpretation, it is of great importance. The analysis points out that the classification carried out in CommonGIS is the best according to the rule: group alike values and divide different ones. However there are some disadvantages: a legend which is automatically created by this software and a colour selection so that a user cannot define colours by indicating the amount of RGB scale (Red/Green/Blue). That is why the colours on choropleth maps in CommonGIS are different then in the other three programs and the legend in form of a continuous bar is incorrect due to generally known rules of elaborating choropleth maps (Fig.7).

Quality indicators

It should be stressed out that CommonGIS is the only program from the researched ones that offers indicators describing quality of data classification. There are two indicators expressed in percent and illustrated in form of horizontal bars (Fig.8).

Fig. 8. Quality indicators (CommonGIS)

The first one (Quality of the classification) expresses how far the classification is from the original data set (measure the loss of precision in the result of the classification) and the second one (Quality Vs. best) indicates how far it is from the optimal classification (ratio of the first value to the value for the statistically optimal classification with the same number of classes) [6]. These indicators are computed on base of the same algorithms as in Optimal method using mean, median or entropy.

Other techniques of data classification

In the performed analysis Quantile method of classification available in MapInfo has not been examined as it does not conform to universal definition of quantiles found in cartography or statistics literature and is not offered by the other programs. This method uses so called weighted quantiles as the ranges are defined based upon another variable than visualized one. Unfortunately in “Help” option a user does not find any explanation about this methodology. Therefore the maps created using this technique may be incorrect. The classification is performed according to the following algorithm: data set sorted by one variable is divided into classes based upon the second variable; a class range is determined in the Equal Interval classification of the second variable; then starting with the lowest value, the values of the second variables are being added until the sum achieves or is less than class range. This explanation can also be found on a forum of MapInfo users [7]. Most often the range is quantiled by area, then such choropleth map deliver different kind of information. These method of classification is called geographical quantiles method [5]. Using this technique a user should be aware that the two variables should be connected in some way (population and density of population; population and buing power index) otherwise the map can be interpreted wrongly. Moreover it should be pointed out that only relative data is presented on choropleth map whereas the variable, that quantiling is based upon, is recommended to be absolute. Using Quantile method one can create “mosaic concentration map” developed by Uhorczak [4]. He proposed to depict density of population in Poland on the choropleth map determinining the classes based upon the set of population. The second set is divided into 10 classes (10% of population). The map constructed in such a way delievers the information about the concentration of population, e.g. the counties of average density of population, that were inhabited by 10% of population, have encapsulated ¼ of Poland.

Unfortunately, despite the rules described above, program does not work in such a way that has been examined based on different data sets. The class boundaries seem to be rather random than created according to assumption that each range is up to but not over the value calculated as the sum of values divided by number of classes. This statement is confirmed by questions from users wrestling the same problems that are placed in MapInfo webpage [7].

This type of classification based on additional characteristic of data set is possible to perform also in CommonGIS. Using the cumulative curve a user can add an attribute based upon the classes will be determined. The interactive character of the classification bar, cumulative curve and choropleth map help in proper delineation of classes.

Another classification scheme has not described yet is Nested Mean offered only by CommonGIS. This method belongs to the second group according to Paslawski [3]: methods based on values of data set (with mean as the statistic measure). The automated classification proceeds according to known algorithm. The disadvantage is that the number of classes is always multiple of two.

In CommonGIS a user can create also bi-variate choropleth map using ‘cross-classification’ option. This issue has been described in details by Leonowicz [2].

Usefull remarks on elaborating maps based on classified data

Even though the classification is automatic there are some things that a novice user should be aware of. There are already underlined: usefulness of graphs, number of classes and assignment of values to particular class. Besides this, if it deals with GeoMedia, the user has to know that in Map by Ranges dialog box it is possible to change the class boundaries in Labels but it is only reflected in the legend, the data is not reclassified so the map is simply wrong. The advantage of GeoMedia – as opposed the other programs – is that there is no default method of classification hence the created map can not be a result of coincidence.

In ArcInfo, alike in CommonGIS on histogram, there is possibility to eliminate so called outlier values from classification process in Data Exclusion, which is part of Classification dialog box. There are two kinds of elimination: Exclusion (using SQL) and Sampling (three methods).

CommonGIS offers the option of transferring the categorization as new attribute into the table (Add to table button). Then the table can be arranged according to class membership and other methods can be applied to this new attribute again (Fig.9) [6].

Fig. 9. Example of categorization of classified data (CommonGIS)

CONCLUSIONS

Presented in the paper the comparison of classification tools offered by ArcInfo, MapInfo, GeoMedia and CommonGIS gives a view of their usefullness and correctness. The problem of choosing number of classes and techniques for their determination suitable for particular data set has been explained. The role of charts presenting statistical distribution of data as one of the factor in cartographic generalization was pointed out. The diagrams are significant as they “say” which from known and available classification patterns is suitable for this data set and help to assess if the automatically created class boundaries are correct. Moreover – as it was noticed before – they prevent from creation pointless number of classes. As it has been presented in the paper CommonGIS offers the widest range of graphs (histogram, cumulative curve, dispersion graph, ranged distribution) whereas ArcInfo and MapInfo two types: Scatter chart and histogram. However the usefullness of the histogram in ArcInfo is incomparable to the default one created in MapInfo. GeoMedia does not have any option of creating graphs. This leads us to the conclusion that CommonGIS is the best software among researched ones that has the most regarded tools to indicate the suitable classification scheme for particular data set, then ArcInfo, further MapInfo and GeoMedia.

The performed analysis were based upon the optimization assumption. Hence not surprisingly the best is the Natural Breaks/Optimal classification scheme that only confirm the statement that the researched software automatically carries out the classification according to optimization alogorithm. However the optimal classification is used very rarely as we can noticed in published atlases: the most common and the simplest in interpratation is choropleth map with equal interval classes. It has been also pointed out that reading a map one assigns the mean from class boundaries values to the reference units. Therefore the second analysis was carried out that resulted in the Equal Interval as the best classification pattern.

Moreover it has been noticed that although the same method is applied in each program, different values of class boundaries are determined. ArcInfo is the only program among researched ones that clearly allocates the observations in the classes, and CommonGIS the one that takes the mean from values in the “gap” as boundary value (the Optimal and the Equal Size). Additionally CommonGIS has three algorithms of optimization that are used to determine class boundaries as well as indicators of classification quality.

REFERENCES

Kraak M.-J., Ormeling F.J., 1998. Cartography. Visualization of spatial data. Addison Wesley Longman, London.

Leonowicz A., 2003. Kartogram w programie CommonGIS [Choropleth map in CommonGIS]. Polski Przeglšd Kartograficzny. Tom 35, 2003, nr 3. PTG i PPWK. Warszawa [in Polish].

Pasławski J., 1992. Kartogram jako forma prezentacji kartograficznej. [Choropleth map as a form of cartographic presentation]. Wydawnictwo Uniwersytetu Warszawskiego. Warszawa [in Polish].

Ratajski L., 1989. Metodyka kartografii społeczno – gospodarczej. [Methodology of socio-economic cartography]. Wyd. II. Warszawa – Wrocław. PPWK [in Polish].

Robinson i in., 1988. Podstawy kartografii. [Elements of cartography]. PWN. Warszawa [in Polish].

CommonGIS Reference Manual, 2004. – www.commongis.com

http://testdrive.mapinfo.com

Accepted for print: 29.11.2006

Katarzyna Galant
Institute of Geodesy and Geoinformatics,
Wrocław University of Environmental and Life Sciences, Poland
Grunwaldzka 53, 50-357 Wrocław, Poland
email: galant@kgf.ar.wroc.pl

Responses to this article, comments are invited and should be submitted within three months of the publication of the article. If accepted for publication, they will be published in the chapter headed 'Discussions' and hyperlinked to the article.