: Thomas N. Herzog, Fritz J. Scheuren, William E. Winkler
: Thomas N. Herzog, Fritz J. Scheuren, William E. Winkler (Eds.)
: Data Quality and Record Linkage Techniques
: Springer-Verlag
: 9780387695051
: 1
: CHF 104.90
:
: Informatik
: English
: 234
: Wasserzeichen/DRM
: PC/MAC/eReader/Tablet
: PDF

This book offers a practical understanding of issues involved in improving data quality through editing, imputation, and record linkage. The first part of the book deals with methods and models, focusing on the Fellegi-Holt edit-imputation model, the Little-Rubin multiple-imputation scheme, and the Fellegi-Sunter record linkage model. The second part presents case studies in which these techniques are applied in a variety of areas, including mortgage guarantee insurance, medical, biomedical, highway safety, and social insurance as well as the construction of list frames and administrative lists. This book offers a mixture of practical advice, mathematical rigor, management insight and philosophy.

7 Automatic Editing and Imputation of Sample Survey Data (p. 61)

7.1. Introduction

As discussed in Chapter 3, missing and contradictory data are endemic in computer databases. In Chapter 5, we described a number of basic data editing techniques that can be used to improve the quality of statistical data systems. By an edit we mean a set of values for a specified combination of data elements within a database that are jointly unacceptable (or, equivalently, jointly acceptable). Certainly, we can use edits of the types described in Chapter 5.

In this chapter, we discuss automated procedures for editing (i.e., cleaning up) and imputing (i.e., filling in) missing data in databases constructed from data obtained from respondents in sample surveys or censuses. To accomplish this task, we need efficient ways of developing statistical data edit/imputation systems that minimize development time, eliminate most errors in code development, and greatly reduce the need for human intervention.

In particular, we would like to drastically reduce, or eliminate entirely, the need for humans to change/correct data. The goal is to improve survey data so that they can be used for their intended analytic purposes.

One such important purpose is the publication of estimates of totals and subtotals that are free of self-contradictory information. We begin by discussing editing procedures, focusing on the model proposed by Fellegi and Holt [1976]. Their model was the first to provide fast, reproducible, table-driven methods that could be applied to general data. It was the first to assure that a record could be corrected in one pass through the data.

Prior to Fellegi and Holt, records were iteratively and slowly changed with no guarantee that any final set of changes would yield a record that satisfied all edits. We then describe a number of schemes for imputing missing data elements, emphasizing the work of Rubin [1987] and Little and Rubin [1987, 2002].

Two important advantages of the Little–Rubin approach are that (1) probability distributions are preserved by the use of defensible statistical models and (2) estimated variances include a component due to the imputation. In some situations, the Little–Rubin methods may need extra information about the non-response mechanism.

For instance, if certain high-income individuals have a stronger tendency to not report or misreport income, then a specific model for the income-reporting of these individuals may be needed. In other situations, the missing-data imputation can be done via methods that are straightforward extensions of hot-deck. We provide details of hot-deck and its extensions later in this chapter.

Ideally, we would like to have an all-purpose, unified edit/imputation model that incorporates the features of the Fellegi–Holt edit model and the Little– Rubin multiple imputation model. Unfortunately, we are not aware of such a model. However, Winkler [2003] provides a unified approach to edit and imputation when all of the data elements of interest can be considered to be discrete.

Preface5
Contents7
About the Authors12
1 Introduction13
1.1. Audience and Objective13
1.2. Scope13
1.3. Structure14
Part 1 Data Quality: What It Is, Why It Is Important, and How to Achieve It17
2 What Is Data Quality and Why Should We Care?19
2.1. When Are Data of High Quality?19
2.2. Why Care About Data Quality?22
2.3. How Do You Obtain High-Quality Data?23
2.4. Practical Tips25
2.5. Where Are We Now?25
3 Examples of Entities Using Data to their Advantage/ Disadvantage29
3.1. Data Quality as a Competitive Advantage29
3.2. Data Quality Problems and their Consequences32
3.3. How Many People Really Live to 100 and Beyond? Views from the United States, Canada, and the United Kingdom37
3.4. Disabled Airplane Pilots – A Successful Application of Record Linkage38
3.5. Completeness and Accuracy of a Billing Database: Why It Is Important to the Bottom Line38
3.6. Where Are We Now?39
4 Properties of Data Quality and Metrics for Measuring It41
4.1. Desirable Properties of Databases/Lists41
4.2. Examples of Merging Two or More Lists and the Issues that May Arise43
4.3. Metrics Used when Merging Lists45
4.4. Where Are We Now?47
5 Basic Data Quality Tools49
5.1. Data Elements49
5.2. Requirements Document50
5.3. A Dictionary of Tests51
5.4. Deterministic Tests52
5.5. Probabilistic Tests56
5.6. Exploratory Data Analysis Techniques56
5.7. Minimizing Processing Errors458
5.8. Practical Tips58
5.9. Where Are We Now?60
Part 2 Specialized Tools for Database Improvement62
6 Mathematical Preliminaries for Specialized Data Quality Techniques63
6.1. Conditional Independence163
6.2. Statistical Paradigms65
6.3. Capture–Recapture Procedures and Applications66
7 Automatic Editing and Imputation of Sample Survey Data73
7.1. Introduction73
7.2. Early Editing Efforts75
7.3. Fellegi–Holt Model for Editing76
7.4. Practical Tips77
7.5. Imputation78
7.6. Constructing a Unified Edit/Imputation Model83
7.7. Implicit Edits – A Key Construct of Editing Software85
7.8. Editing Software87
7.9. Is Automatic Editing Taking Up Too Much Time and Money?90
7.10. Selective Editing91
7.11. Tips on Automatic Editing and Imputation91
7.12. Where Are We Now?92
8 Record Linkage – Methodology93
8.1. Introduction93
8.2. Why Did Analysts Begin Linking Records?94
8.3. Deterministic Record Linkage94
8.4. Probabilistic Record Linkage – A Frequentist Perspective95
8.5. Probabilistic Record Linkage – A Bayesian Perspective103
8.6. Where Are We Now?104
9 Estimating the Parameters of the Fellegi – Sunter Record Linkage Model105
9.1. Basic Estimation of Parameters Under Simple Agreement/ Disagreement Patterns105
9.2. Parameter Estimates Obtained via Frequency- Based Matching1106
9.3. Parameter Estimates Obtained Using Data from Current Files108
9.4. Parameter Estimates Obtained via the EM Algorithm109
9.5. Advantages and Disadvantages of Using the EM Algorithm to Estimate m- and u-probabilities113
9.6. General Parameter Estimation Using the EM Algorithm115
9.7. Where Are We Now?118
10 Standardization and Parsing119
10.1. Obtaining and Understanding Computer Files121
10.2. Standardization of Terms122
10.3. Parsing of Fields123
10.4. Where Are We Now?126
11 Phonetic Coding Systems for Names127
11.1. Soundex System of Names127
11.2. New York State Identification and Intelligence System ( NYSIIS) Phonetic Decoder131
11.3. Where Are We Now?133
12 Blocking135
12.1. Independence of Blocking Strategies136
12.2. Blocking Variables137
12.3. Using Blocking Strategies to Identify Duplicate List Entries138