| Preface | 5 |
|---|
| Contents | 7 |
|---|
| About the Authors | 12 |
|---|
| 1 Introduction | 13 |
|---|
| 1.1. Audience and Objective | 13 |
| 1.2. Scope | 13 |
| 1.3. Structure | 14 |
| Part 1 Data Quality: What It Is, Why It Is Important, and How to Achieve It | 17 |
|---|
| 2 What Is Data Quality and Why Should We Care? | 19 |
| 2.1. When Are Data of High Quality? | 19 |
| 2.2. Why Care About Data Quality? | 22 |
| 2.3. How Do You Obtain High-Quality Data? | 23 |
| 2.4. Practical Tips | 25 |
| 2.5. Where Are We Now? | 25 |
| 3 Examples of Entities Using Data to their Advantage/ Disadvantage | 29 |
| 3.1. Data Quality as a Competitive Advantage | 29 |
| 3.2. Data Quality Problems and their Consequences | 32 |
| 3.3. How Many People Really Live to 100 and Beyond? Views from the United States, Canada, and the United Kingdom | 37 |
| 3.4. Disabled Airplane Pilots – A Successful Application of Record Linkage | 38 |
| 3.5. Completeness and Accuracy of a Billing Database: Why It Is Important to the Bottom Line | 38 |
| 3.6. Where Are We Now? | 39 |
| 4 Properties of Data Quality and Metrics for Measuring It | 41 |
| 4.1. Desirable Properties of Databases/Lists | 41 |
| 4.2. Examples of Merging Two or More Lists and the Issues that May Arise | 43 |
| 4.3. Metrics Used when Merging Lists | 45 |
| 4.4. Where Are We Now? | 47 |
| 5 Basic Data Quality Tools | 49 |
| 5.1. Data Elements | 49 |
| 5.2. Requirements Document | 50 |
| 5.3. A Dictionary of Tests | 51 |
| 5.4. Deterministic Tests | 52 |
| 5.5. Probabilistic Tests | 56 |
| 5.6. Exploratory Data Analysis Techniques | 56 |
| 5.7. Minimizing Processing Errors4 | 58 |
| 5.8. Practical Tips | 58 |
| 5.9. Where Are We Now? | 60 |
| Part 2 Specialized Tools for Database Improvement | 62 |
|---|
| 6 Mathematical Preliminaries for Specialized Data Quality Techniques | 63 |
| 6.1. Conditional Independence1 | 63 |
| 6.2. Statistical Paradigms | 65 |
| 6.3. Capture–Recapture Procedures and Applications | 66 |
| 7 Automatic Editing and Imputation of Sample Survey Data | 73 |
| 7.1. Introduction | 73 |
| 7.2. Early Editing Efforts | 75 |
| 7.3. Fellegi–Holt Model for Editing | 76 |
| 7.4. Practical Tips | 77 |
| 7.5. Imputation | 78 |
| 7.6. Constructing a Unified Edit/Imputation Model | 83 |
| 7.7. Implicit Edits – A Key Construct of Editing Software | 85 |
| 7.8. Editing Software | 87 |
| 7.9. Is Automatic Editing Taking Up Too Much Time and Money? | 90 |
| 7.10. Selective Editing | 91 |
| 7.11. Tips on Automatic Editing and Imputation | 91 |
| 7.12. Where Are We Now? | 92 |
| 8 Record Linkage – Methodology | 93 |
| 8.1. Introduction | 93 |
| 8.2. Why Did Analysts Begin Linking Records? | 94 |
| 8.3. Deterministic Record Linkage | 94 |
| 8.4. Probabilistic Record Linkage – A Frequentist Perspective | 95 |
| 8.5. Probabilistic Record Linkage – A Bayesian Perspective | 103 |
| 8.6. Where Are We Now? | 104 |
| 9 Estimating the Parameters of the Fellegi – Sunter Record Linkage Model | 105 |
| 9.1. Basic Estimation of Parameters Under Simple Agreement/ Disagreement Patterns | 105 |
| 9.2. Parameter Estimates Obtained via Frequency- Based Matching1 | 106 |
| 9.3. Parameter Estimates Obtained Using Data from Current Files | 108 |
| 9.4. Parameter Estimates Obtained via the EM Algorithm | 109 |
| 9.5. Advantages and Disadvantages of Using the EM Algorithm to Estimate m- and u-probabilities | 113 |
| 9.6. General Parameter Estimation Using the EM Algorithm | 115 |
| 9.7. Where Are We Now? | 118 |
| 10 Standardization and Parsing | 119 |
| 10.1. Obtaining and Understanding Computer Files | 121 |
| 10.2. Standardization of Terms | 122 |
| 10.3. Parsing of Fields | 123 |
| 10.4. Where Are We Now? | 126 |
| 11 Phonetic Coding Systems for Names | 127 |
| 11.1. Soundex System of Names | 127 |
| 11.2. New York State Identification and Intelligence System ( NYSIIS) Phonetic Decoder | 131 |
| 11.3. Where Are We Now? | 133 |
| 12 Blocking | 135 |
| 12.1. Independence of Blocking Strategies | 136 |
| 12.2. Blocking Variables | 137 |
| 12.3. Using Blocking Strategies to Identify Duplicate List Entries | 138 |
|