Home -> In the Spotlight -> Quality Assessment of Volunteered Geographic Information (VGI)

13-10-2010

QUALITY ASSESSMENT OF VOLUNTEERED GEOGRAPHIC INFORMATION (VGI)
BASED ON OPEN WEB MAP SERVICES AND ISO/TC 211 19100-FAMILY STANDARDS

Hans-Jörg Stark(1)

(1)University of Applied Sciences Northwestern Switzerland, Institute of Geomatics Engineering,
     Muttenz, Switzerland, hansjoerg.stark@fhnw.ch  

This article summarises the findings of the master thesis 'Quality assurance of crowdsourced geocoded address-data within OpenAddresses. Concepts and implementation.'  (Stark, 2010)

1.  Introduction

Geocoded address data are of high value (Hancock, 2010) as reference datasets for a broad range of applications such as delivery services, emergency services, business mapping, etc. However, its value depends heavily on its quality: it must provide quality in terms of positional accuracy, correct spelling and currency.  If quality of the reference dataset is poor the resulting geocoding results will implicitly be equally poor (Ratcliffe, 2001, 2004; Zandbergen, 2007). In European countries, especially German speaking countries, high quality geodata is available through either public or commercial organisations (Auer and Zipf 2009) but their cost is high. This situation led to the conception and implementation of the open geo-data project OpenAddresses  (OA) in 2007 (Stark, 2009), the aim of which is to collect geocoded addresses as volunteered geographic information (VGI) in a central database.

As useful as the integration of volunteers into information collection may be, the quality of the gathered information remains a valid concern (Goodchild, 2008). According to Agichtein et al. (2008: 183) 'The quality of user-generated content varies drastically from excellent to abuse and spam.' The acceptance of (spatial) data in general by the user community depends heavily on the data's quality. Thus research in the field of quality assurance of VGI is necessary.

The ISO/TC 211 19100 family standards provide a framework to assure and document the quality of geo-spatial information. These standards serve as a framework in conceptualising, assessing and documenting the quality of spatial data. They are used as reference in the conception of quality assurance of OA.

1.1 Approach of Quality Assessment of OpenAddresses
      To assess the quality of OA a reference dataset or service must cover the complete area of investigation. Originally, OA was focussed solely on Swiss address data. However, since OA has received more and more international contributions, in addition to being openly the reference resource should also provide international data. Therefore, Open Web Map Services (OWMS) (Jain, 2007) such as Google Maps, Bing Maps and Yahoo! Maps are used as the reference data-set. Hence their suitability for the task of quality assessment for OA is investigated. The challenge in this context is that the dataset to be assessed claims to have higher accuracy than the reference dataset which it is compared to.

Two basic steps are necessary to perform the quality assessment of OA with OWMS: Firstly the three introduced OWMS must themselves be assessed individually. Secondly it must be determined how the results of the OWMS assessment can be used to appraise each address collected in the OA project.

1.2 Volunteered Geographic Information
      The general concept of volunteer-contributed geographic information has been described by many authors and is well documented (Fischer, 2008; Flanagin and Metzger, 2008; Coleman et al., 2009; Elwood, 2009). In the area of community based VGI, in which OA is located, the most prominent project is certainly OpenStreetMap (OSM) . But there is also the area of commercially oriented VGI, i.e., enterprises that take advantage of VGI data for commercial gain.

1.3 Geocoded Address Data
      In business mapping and other fields, high-resolution geocoded address data are often used to analyze spatial distributions, customer densities, etc. Address gazetteers and administrative units also take advantage of these data (Harris et al., 2006). In health geography and epidemiology micro-geographic analyses based on geocoded address-data are now common (Gatrell and Senior, 2005; Messina et al., 2006). Most importantly, this form of analysis demands not only high spatial accuracy for each application area but also completeness of the reference data (Goldberg, 2008).

1.4 Quality Assessment in general
       The term ‘quality’ expresses various unquantifiable characteristics, and no consensus can be found among experts on a single definition. For some people, a high-quality product is one without errors; for others it is one that meets the expectations of a consumer. In the context of spatial data, the term fitness for use (Jakobsson and Tsoulos, 2007) is used quite often. It means that, used in different contexts, the same product may conform to one context's quality requirements but not to another’s. Due to the characteristics of OA as a dynamic project the focus is on accuracy in terms of attribute and spatial accuracy. Attribute correctness mainly consists of completeness of information and correct spelling while spatial accuracy is defined as the deviation or error distance of the true location and - in the case of OWMS assessment - the location provided by the OWMS geocoder or - in the case of OA - the user entered position. Figure 1 illustrates how buildings are located along a street in the sample of Gellertstrasse in Basel. Some buildings are close to the street, others are farther away etc. Such characteristics have a direct impact on the quality of street geocoding results. Implicitly the error distances can vary greatly for street-based (linear) geocoding algorithms that are used within OWMS.

Additionally, the issue of malicious data entry must be addressed. There is a potential within any VGI project that data is intentionally falsified as an act of vandalism. This could mean that address values are incorrect or that addresses are positioned incorrectly. The presented approach proved that with the use of OWMS, such malicious data can be detected or at least indicated in OA.

From the ISO/TC 211 19100 family standards ISO/TC 211:19113 (2001) (Quality principles), ISO/TC 211:19114 (2001) (Quality evaluation procedures) and ISO/TC 211:19138 (2006) (data quality measures) are applied in the quality assessment process.

In order to serve as reference for the assessment process of OA the above mentioned three OWMS must be assessed. A complete dataset of geocoded addresses of the Canton of Solothurn (cadastral data consisting of 93,623 addresses) serves as the reference data for this first quality assessment (OWMS assessment).

1.5 Open Web Map Services
       All three OWMSs discussed provide application programming interfaces (APIs) offering a range of actions to be taken by the client among which is geocoding. Since all three OWMS use both different spatial datasets as reference data and different geocoding algorithms their geocoding results are not equal for the same address. Figure 1 presents a number of sample addresses in Basel’s Gellertstrasse, showing clearly the differences of the three OWMS geocoding results.

 
Figure 1. Map excerpt showing OWMS derived locations versus true locations of addresses in Basel at Gellertstrasse

2.  Quality Assessment of Open Web Map Services

2.1 Attribute and Spatial Accuracy
       Each of the referece addresses is geocoded by all three OWMS, stored in a database and investigated on its attribute completeness and its error distance. Certain constraints were applied to yield the best possible error distance not biased by either bad geocoding quality or bad thematic accuracy. None of the three OWMS geocoders achieved 100% attribute completeness.

Error distances were investigated in more detail to obtain the best possible estimators of threshold values for each OWMS with regard to the OA data quality assessment. Following Zimmerman et al. (2007), differences in x and y error distance directions for each address are analysed. ISO/TC 211:19138 (2006, p. 42) suggests the application of a threshold value emax to determine the mean value of positional uncertainties excluding outliers.
Because the range of deviations can vary greatly (cf. Figure 1) setting a precise definition for emax is difficult. The approach to determining emax involved analysing x- and y- components of deviations. To exclude gross errors, only addresses whose x- and y- parts of the deviation are within 95% of the total number of values were considered for analysis. The analysis of the deviations’ distribution in x- and y-directions for each OWMS is shown exemplarily in Figures 2 and 3 for Bing Maps.

 
Figure 2. Histogram of x- direction deviations for Bing Maps     Figure 3. Histogram of y- direction deviations for Bing Maps
  
The definition of emax is derived from the computed values of the 95% Quantile in x- and y-direction for each OWMS. To evaluate reasonable estimators for threshold values for the quality assessment of positional accuracy in OA, the maximum distance of the 95% quantile in x- and y-directions defines the threshold to determine outliers (cf. Table 1).

   Bing Maps
 Google Maps
 Yahoo! Maps
 Threshold Quantile 95%  67.08  15.36  42.62
 Threshold Outlier  111.76  40.81  68.41
Table 1. Threshold values for quality assessment of OA data for positional accuracy

It must be emphasised at this point that the presented findings and figures apply primarily to Switzerland. In other countries data quality of OWMS may vary and thus threshold values should be assessed accordingly (cf. Outlook).

3.  Quality Assessment of OpenAddresses

3.1 Approach
      Unlike the OWMS assessment the quality assessment of OA is dynamic, i.e., a new address that is entered or an existing one that is altered shall be assessed immediately. The basic idea is to send the user entered address parameter values to the three OWMS and evaluate the returned OWMS information. If the spelling of the user entered address values match with those of the OWMS returned values it can be assumed that the address values were entered correctly. A binary approach is applied for attribute accuracy.

In terms of positional accuracy the user entered position is compared to the OWMS returned positions for the specific address. The computed error distance - user entered position versus OWMS position - is compared to the corresponding threshold values for each OWMS.

3.2 Proof of concept
      In order to test whether the OWMS quality assessment was successful and serves as a reference for the quality assessment of OA data a set of test-addresses was used. These test-addresses were classified into three categories: the first category contained addresses with correct locations, the second category contained addresses with small positional errors (e.g. the position was defined as slightly outside the building) while the third category contained addresses with gross positional errors. For all addresses the address parameter values were entered without errors.

The goal of the test was to evaluate whether a) correct addresses were indicated as correct, b) addresses with gross positional errors (=malicious edits) could be detected and c) whether this OWMS based approach is able to detect addresses with small positional errors.

3.3 Results
      User entered address parameter values are considered correct if at least one of the three OWMS returns a true match for these values. This leads to the result that statements on the correctness of attribute values of addresses are reliable in around 77%. This is because of the strict binary comparison algorithm that was applied. Especially when adding characters to house numbers (e.g. '37a') OWMS geocoders do not return identical values and thus user entered input is erroneously classified as potentially wrong. However in only 23% an additional manual check of the entered values must - erroneously - be conducted. Since this is a Type I error (false positives) it causes only unnecessary effort but does not harm the quality of the data.

Positional accuracy is more difficult to assess because error distances between true location and OWMS interpolated location vary greatly. Two constraints are applied for the spatial quality assessment: one regarding deviations, the other regarding OWMS geocoding level information. The first constraint correctly classifies none of the maliciously misreported addresses as correct. The second constraint correctly identifies 92.7% of addresses with gross positional errors. Small positional errors could not be detected with this approach. There must be further research to find alternative approaches to handle addresses with small positional errors.

In order to post-process the entered or altered addresses a web-based user interface is available that lists the latest addresses along with the values of their quality assessment (cf. Figure 4). This interface both indicates the attribute conformance and the computed error distance along with a colour-coded rating. A small map offers a visual control of the user entered position.


Figure 4. Overview of OpenAddresses quality assessment

The presented work approves that a less accurate reference dataset can help in assessing a better dataset in terms of being an indicator especially for gross errors.

4.  Outlook

Since OA is operating globally a concept of "global quality managers" could be evaluated. This means that for certain regions or countries qualified and identified persons act as quality managers. In this case further investigation on the threshold values must be conducted for each country or region.

References

  • Agichtein, E., C. Castillo, et al. 2008. 'Finding high-quality content in social media'. Proceedings of the international conference on Web search and web data mining, Palo Alto.
  • Auer, M. and A. Zipf 2009. 'How do free and Open Geodata and Open Standards fit gogether? From Sceptisim versus high Potential to real Applications.' The First Open Source GIS UK Conference, Nottingham.
  • Coleman, D. J., Y. Georgiadou, et al. 2009. 'Volunteered Geographic Information: The Nature and Motivation of Produsers.' International Journal of Spatial Data Infrastructures Research vol 4, no, pp. 332
  • Elwood, S. 2009. 'Geographic Information Science: new geovisualization technologies - emerging questions and linkages with GIScience research.' Progress in Human Geography vol 33, no 2, pp. 256
  • Fischer, F. 2008. 'Collaborative mapping. How Wikinomics is Manifest in the Geo-Information Economy.' GEOInformatics vol 11, no 2, pp. 28
    Flanagin, A. J. and M. J. Metzger 2008. 'The credibility of volunteered geographic information.' GeoJournal vol no 72, pp. 137
  • Gatrell, A. C. and M. L. Senior 2005. 'Health and healthcare applications'. Geographical Information Systems. Principles, Techniques, Management, and Applications. P. A. Longley, M. F. Goodchild, D. J. Maguire and D. W. Rhind. Hoboken, New Jersey, John Wiley & Sons, pp. 925 - 938.
  • Goldberg, D. W. 2008. A Geocoding Best Pracitices Guide, Springfield, IL: North American Association of Central Cancer Registries.
  • Goodchild, M. F. 2006. 'Foreword'. Fundamentals of spatial data quality. R. Devillers and R. Jeansoulin. London, ISTE, pp. 13 - 16.
  • Goodchild, M. F. 2007. 'Citizens as sensors: the world of volunteered geography.' GeoJournal vol no 69, pp. 211
  • Goodchild, M. F. 2008. 'Citizens as sensors.' GIS Trends + Markets vol no 6, pp. 27
  • Hancock, C. 2010. 'Address management for emergency services.' GEOconnexion International Magazine vol 9, no 2, pp. 20.
  • Harris, R., P. Sleight, et al. 2006. Geodemographics, GIS and Neighbourhood Targeting, John Wiley & Sons, Ltd.
  • ISO/TC 211:19113 2001. 'Geographic Information - Quality Principles, International Organization for Standardization (ISO), pp. 1 - 32.
  • ISO/TC 211:19114 2001. 'Geographic Information - Quality Evaluation Procedures, International Organization for Standardization (ISO), pp. 1 - 71.
  • ISO/TC 211:19138. 2006. 'Text for TS 19138 Geographic Information - Data quality measures, as sent to ISO for publication.' viewed March 25 2010, <www.isotc211.org/protdoc/211n2029/>.
  • Jain, A. 2007. 'Mechanisms for validation of volunteer data in open web map services.' viewed March 11 2010, <www.ncgia.ucsb.edu/projects/vgi/docs/supp_docs/Jain_paper.pdf>.
  • Jakobsson, A. and L. Tsoulos 2007. 'The Role of Quality in Spatial Data Infrastructures'. 23rd International Cartographic Conference, Moscow.
  • Messina, J. P., A. M. Shortridge, et al. 2006. 'Evaluating Michigan's community hospital access: spatial methods for decision support.' International Journal fo Health Geographics vol 5, no 42.
  • Oort, P. A. J. v. 2006. 'Spatial data quality: from description to application'. viewed March 6 2010.
  • Ratcliffe, J. H. 2001. 'On the accuracy of TIGER-type geocoded address data in relation to cadastral and census areal units.' Int. J. Geographical Information Science vol 15, no 5, pp. 473
  • Ratcliffe, J. H. 2004. 'Geocoding crime and a first estimate of a minimum acceptable hit rate.' Int. J. Geographical Information Science vol 18, no 1, pp. 61.
  • Stark, H.-J. 2009. 'OpenAddresses - Free geocoded street addresses'. Applied Geoinformatics for Society and Environment, Stuttgart.
  • Stark, H.-J. 2010. 'Quality assurance of crowdsourced geocoded address-data within OpenAddresses. Concepts and implementation.' Centre for GeoInformatics (Z_GIS) viewed July 29 2010, <www.unigis.ac.at/club/bibliothek/pdf/40138.pdf>.
  • Zandbergen, P. A. 2007. 'Influence of geocoding quality on environmental exposure assessment of children living near high traffic roads.' BMC Public Health vol 7, no 37.
  • Zimmerman, D. L., F. Xiangming, et al. 2007. 'Modeling the probability distribution of positional errors incurred by residential address geocoding.' International Journal of Health Geographics vol 6, no 1, pp. 1