ebook ebooks e-book e-books downloaden bei MyEbooks.ch downloaden

Data Quality and Record Linkage Techniques

:	Thomas N. Herzog, Fritz J. Scheuren, William E. Winkler
:	Thomas N. Herzog, Fritz J. Scheuren, William E. Winkler (Eds.)
:	Data Quality and Record Linkage Techniques
:	Springer-Verlag
:	9780387695051
:	1
:	CHF 104.90
:

:	Informatik
:	English

:	234
:	Wasserzeichen/DRM
:	PC/MAC/eReader/Tablet
:	PDF

This book offers a practical understanding of issues involved in improving data quality through editing, imputation, and record linkage. The first part of the book deals with methods and models, focusing on the Fellegi-Holt edit-imputation model, the Little-Rubin multiple-imputation scheme, and the Fellegi-Sunter record linkage model. The second part presents case studies in which these techniques are applied in a variety of areas, including mortgage guarantee insurance, medical, biomedical, highway safety, and social insurance as well as the construction of list frames and administrative lists. This book offers a mixture of practical advice, mathematical rigor, management insight and philosophy.

7 Automatic Editing and Imputation of Sample Survey Data (p. 61)

7.1. Introduction

As discussed in Chapter 3, missing and contradictory data are endemic in computer databases. In Chapter 5, we described a number of basic data editing techniques that can be used to improve the quality of statistical data systems. By an edit we mean a set of values for a specified combination of data elements within a database that are jointly unacceptable (or, equivalently, jointly acceptable). Certainly, we can use edits of the types described in Chapter 5.

In this chapter, we discuss automated procedures for editing (i.e., cleaning up) and imputing (i.e., filling in) missing data in databases constructed from data obtained from respondents in sample surveys or censuses. To accomplish this task, we need efficient ways of developing statistical data edit/imputation systems that minimize development time, eliminate most errors in code development, and greatly reduce the need for human intervention.

In particular, we would like to drastically reduce, or eliminate entirely, the need for humans to change/correct data. The goal is to improve survey data so that they can be used for their intended analytic purposes.

One such important purpose is the publication of estimates of totals and subtotals that are free of self-contradictory information. We begin by discussing editing procedures, focusing on the model proposed by Fellegi and Holt [1976]. Their model was the first to provide fast, reproducible, table-driven methods that could be applied to general data. It was the first to assure that a record could be corrected in one pass through the data.

Prior to Fellegi and Holt, records were iteratively and slowly changed with no guarantee that any final set of changes would yield a record that satisfied all edits. We then describe a number of schemes for imputing missing data elements, emphasizing the work of Rubin [1987] and Little and Rubin [1987, 2002].

Two important advantages of the Little–Rubin approach are that (1) probability distributions are preserved by the use of defensible statistical models and (2) estimated variances include a component due to the imputation. In some situations, the Little–Rubin methods may need extra information about the non-response mechanism.

For instance, if certain high-income individuals have a stronger tendency to not report or misreport income, then a specific model for the income-reporting of these individuals may be needed. In other situations, the missing-data imputation can be done via methods that are straightforward extensions of hot-deck. We provide details of hot-deck and its extensions later in this chapter.

Ideally, we would like to have an all-purpose, unified edit/imputation model that incorporates the features of the Fellegi–Holt edit model and the Little– Rubin multiple imputation model. Unfortunately, we are not aware of such a model. However, Winkler [2003] provides a unified approach to edit and imputation when all of the data elements of interest can be considered to be discrete.

	Preface	5
	Contents	7
	About the Authors	12
	1 Introduction	13
	1.1. Audience and Objective	13
	1.2. Scope	13
	1.3. Structure	14
	Part 1 Data Quality: What It Is, Why It Is Important, and How to Achieve It	17
	2 What Is Data Quality and Why Should We Care?	19
	2.1. When Are Data of High Quality?	19
	2.2. Why Care About Data Quality?	22
	2.3. How Do You Obtain High-Quality Data?	23
	2.4. Practical Tips	25
	2.5. Where Are We Now?	25
	3 Examples of Entities Using Data to their Advantage/ Disadvantage	29
	3.1. Data Quality as a Competitive Advantage	29
	3.2. Data Quality Problems and their Consequences	32
	3.3. How Many People Really Live to 100 and Beyond? Views from the United States, Canada, and the United Kingdom	37
	3.4. Disabled Airplane Pilots – A Successful Application of Record Linkage	38
	3.5. Completeness and Accuracy of a Billing Database: Why It Is Important to the Bottom Line	38
	3.6. Where Are We Now?	39
	4 Properties of Data Quality and Metrics for Measuring It	41
	4.1. Desirable Properties of Databases/Lists	41
	4.2. Examples of Merging Two or More Lists and the Issues that May Arise	43
	4.3. Metrics Used when Merging Lists	45
	4.4. Where Are We Now?	47
	5 Basic Data Quality Tools	49
	5.1. Data Elements	49
	5.2. Requirements Document	50
	5.3. A Dictionary of Tests	51
	5.4. Deterministic Tests	52
	5.5. Probabilistic Tests	56
	5.6. Exploratory Data Analysis Techniques	56
	5.7. Minimizing Processing Errors4	58
	5.8. Practical Tips	58
	5.9. Where Are We Now?	60
	Part 2 Specialized Tools for Database Improvement	62
	6 Mathematical Preliminaries for Specialized Data Quality Techniques	63
	6.1. Conditional Independence1	63
	6.2. Statistical Paradigms	65
	6.3. Capture–Recapture Procedures and Applications	66
	7 Automatic Editing and Imputation of Sample Survey Data	73
	7.1. Introduction	73
	7.2. Early Editing Efforts	75
	7.3. Fellegi–Holt Model for Editing	76
	7.4. Practical Tips	77
	7.5. Imputation	78
	7.6. Constructing a Unified Edit/Imputation Model	83
	7.7. Implicit Edits – A Key Construct of Editing Software	85
	7.8. Editing Software	87
	7.9. Is Automatic Editing Taking Up Too Much Time and Money?	90
	7.10. Selective Editing	91
	7.11. Tips on Automatic Editing and Imputation	91
	7.12. Where Are We Now?	92
	8 Record Linkage – Methodology	93
	8.1. Introduction	93
	8.2. Why Did Analysts Begin Linking Records?	94
	8.3. Deterministic Record Linkage	94
	8.4. Probabilistic Record Linkage – A Frequentist Perspective	95
	8.5. Probabilistic Record Linkage – A Bayesian Perspective	103
	8.6. Where Are We Now?	104
	9 Estimating the Parameters of the Fellegi – Sunter Record Linkage Model	105
	9.1. Basic Estimation of Parameters Under Simple Agreement/ Disagreement Patterns	105
	9.2. Parameter Estimates Obtained via Frequency- Based Matching1	106
	9.3. Parameter Estimates Obtained Using Data from Current Files	108
	9.4. Parameter Estimates Obtained via the EM Algorithm	109
	9.5. Advantages and Disadvantages of Using the EM Algorithm to Estimate m- and u-probabilities	113
	9.6. General Parameter Estimation Using the EM Algorithm	115
	9.7. Where Are We Now?	118
	10 Standardization and Parsing	119
	10.1. Obtaining and Understanding Computer Files	121
	10.2. Standardization of Terms	122
	10.3. Parsing of Fields	123
	10.4. Where Are We Now?	126
	11 Phonetic Coding Systems for Names	127
	11.1. Soundex System of Names	127
	11.2. New York State Identification and Intelligence System ( NYSIIS) Phonetic Decoder	131
	11.3. Where Are We Now?	133
	12 Blocking	135
	12.1. Independence of Blocking Strategies	136
	12.2. Blocking Variables	137
	12.3. Using Blocking Strategies to Identify Duplicate List Entries	138