Volume 3 Number 1 Copyright 1997
A Process for Improving Data Quality
Mary Jane Willshire, Colorado Technical University, and Donna Meyen, MITRE Corporation
Key Words: DQEF, Improvement Plans, Priorities, Modeling, Attributes
Abstract: We introduce a Data Quality Engineering Framework (DQEF) that an organization can customize. This DQEF can both measure current data quality and be used to develop data quality improvement plans. We show how organizations can use this framework and demonstrate how it may be tailored to an organization's needs.
Databases that have multiple sources and contributors, multiple data managers, and which support a diverse set of users present significant challenges in sustaining data quality. Research in the fields of data and information quality is becoming comprehensive. As early as 1988, Dvorak and Richters defined quality parameters for a specific application. [2] McKeating noted that data quality should be specific to the environment where the data are used when he wrote in 1992, "Data must be more than merely consistent, it must reflect reality." [8] Levitin [6] and Strong [11] also mentioned this concept when they defined a "relevance attribute," while Wand [14] used the word "meaningful" to reflect the same idea. In 1995, Madnick expressed similar ideas when describing the problem of "context interchange." [7]
How is an organization to proceed once it has determined that it should examine the quality of its data? In this paper, we present a Data Quality Engineering Framework (DQEF) and a process for implementing the DQEF that organizations can customize to fit their needs.
The Data Quality Engineering Framework is a vehicle that an organization can use to define a model of its data environment, identify relevant data quality attributes, analyze data quality attributes in their current (or future) context, and provide guidance for data quality improvement. We developed DQEF so that data would be treated consistently within the context of specific systems (whether automated or not). This context includes system functions, data, and most importantly, data users. We believe the DQEF is the only known data quality framework that specifically addresses temporal data quality.
The DQEF is flexible enough to accommodate general data-centric paradigms and includes specific tools to attack many data quality problems. By using a "modeling phase" before the traditional "define, analyze, and improve" cycle, data engineers and users may obtain a better understanding of the data environment. Because the DQEF considers the temporal issues in a data environment's evolution, we believe it is a valuable tool for identifying and addressing future data quality problems.
2.1 Developing a Data Quality Engineering Model
The first step in developing a DQEF is to develop an appropriate functional and data modeling paradigm (e.g., mathematical, production-based, data flow) to produce a Data Quality Engineering Model. It is important to carefully describe the environment that the data supports, so that the context is clearly understood. The needs of primary and other data users should also be described in the model. Once one determines the scope of the data environment, all other tasks associated with this step can be accomplished within this specific context. These tasks include, but are not limited to, defining information and data requirements, defining information and data flow, and defining business rules. Because the only constant in life is change, we must consider the model, the current operating environment, as well as any future environments.
2.2 Define Data Quality Attributes
DQEF developers must establish a precise definition of the data quality parameters peculiar to the applications which will appeal to both primary and secondary data users. User requirements (e.g., critical needs), functional context, recommendations from the literature, suggestions from experts, and best engineering judgment will determine the selection and definition of data quality parameters.
2.3 Collect, Measure, and Analyze Data Quality Attributes
Next, the DQEF developers select methods of collecting (e.g., manually or automatically) and assessing (e.g., mathematical model, relational model, entity- or attribute-specific data) and make quality assessments using at least two different assessment methods. The assessments are then recorded. It is important to use multiple assessment methods to adequately evaluate quantitative and qualitative measures in order to provide users with the greatest benefits. DQEF developers should choose assessment methods based upon the user requirements and the data. The DQEF developers then identify problem areas and their probable causes. At this point, developers may utilize automated tools to facilitate interpretation.
2.4 Identify, Evaluate, Select, Apply, and Analyze Results
The final DQEF development task is to identify, evaluate, and select remedies. These are based on the current and future functional context or environment, type of data, processes affecting the data, recommendations from the literature, experience, and best engineering judgment. The developers select a remedy, or some combination of remedies, which they incorporate into the data life cycle. The developers perform a second assessment, using the same methods used for the initial assessment. They then perform an analysis of the adjusted data life cycle vs. data quality to uncover areas requiring additional attention. The developers use the results to refine and adjust the input, processes, product, and/or model.
DQEF is a step-by-step process that an organization may tailor to conform to its unique needs. In the following sections we make some suggestions that can serve as a starting point for data quality improvement, but each organization should determine what best suits its needs.
3.1 Modeling Existing Data
The first step in applying the DQEF process is to obtain a clear understanding of the existing data, its users, and its context. This is accomplished by developing a Data Quality Engineering Model as described above. Existing databases provide a starting point for the model. But this alone is insufficient. The developers must both examine the data model for existing databases and include their working environment in the DQEM. The context in which the databases are used, how they are used, and for what purposes have a significant impact on this model. The users and the use each group makes of the data must be identified. Whenever possible, future users must be identified. The developers should determine whether future users' needs will be the same as current users. The more detailed the model, the easier it will be to pinpoint target areas for quality improvement.
3.2 Define Data Quality Attributes
The second step in the process is to define which data quality attributes are important for the organization. The end result of this activity will be a prioritized list of attributes based on present and future users' most important data quality requirements. Due to pragmatic concerns (e.g., time, money, resources), it may not be possible to achieve all the desired data quality improvements. For example, an analysis might indicate that data quality would improve if records in a database included the date specific values were entered or last verified as correct. This would give some indication as to the age (and perhaps validity) of the values. But an organization might not have the resources to implement such a procedure.
One way for an organization to accomplish this step would be to select a collection of data quality attributes. In one recent survey, Strong identified and categorized a list of 179 data quality attributes. [11] The authors recommend that organizations use this list; the attributes are both qualitative (subjective) and quantitative (objective). An organization that chose to do this would develop a precise organization-specific definition for each attribute. For instance, "easy to change" might correspond to a user requirement asking, "Can the data be manipulated easily?" "Reputation" could mean "What is the source of the data?" An organization may also decide to use only some attributes on Strong's list, or expand the list to include more attributes. Whatever an organization decides, it needs a balanced mixture of objective and subjective attributes. The final list is used as input for the next step.
3.3 Determine Data Quality Priorities
This step begins the evaluation of the quality of existing data. Data users receive the list of data quality attributes developed and customized in the previous step. DQEF developers ask various types of data users to select the data quality attributes that they feel are most important. Then, the results are analyzed by user type. The form of this analysis should be tailored to the organization. One method that may be used is to tabulate the raw numbers (how many users select each attribute), then sort the attributes into major categories. There are four major categories identified by Strong [11]: accuracy, accessibility, relevance, and representation. Using a few major categories makes it easier to evaluate the results of user needs surveys.
The frequency of user response is used to prioritize the list of data quality attributes for current and future data environments. The categories (with their tallied responses) enable the organization to determine the most desirable data quality attributes for its needs. The organization uses this information in later steps to determine its current data quality and to locate areas where it can apply remedies.
3.4 Evaluate Existing Data Quality Levels
Based upon the data quality attributes that were found in the previous step to be the most important, an organization can evaluate how well the current system satisfies data quality attributes. This current data quality assessment can be conducted in several ways, some of which are suggested below. The methods selected should match the type of data and user requirements for quality, as well as organizational characteristics. For better results, we recommend that at least two distinct methods be used. At this point the evaluation may be done manually or by using automated tools.
Figure 1 lists some relevant objective and subjective quality indicators that have been suggested in the literature [10] [13] as indicative of database quality. The first ten objective indicators (internal attributes) listed in Figure 1 are currently used in several automated tools (software) to help measure structural and representational database quality. The remaining two indicators are subjective internal attributes used either manually or in concert with automated tools to judge actual database quality and fitness for use. Database users also subjectively evaluate data quality and provide a high, medium, or low rating.
Figure 1. Indications for Measuring Database Quality
Objective Indications |
| 1. Range of Values |
| 2. Domain Values |
| 3. Cyclic Redundancy Check |
| 4. Units |
| 5. Business Rules |
| 6. Consistency Checks |
| 7. Standard Definitions |
| 8. Metadata Checks |
| 9. Presence of Value |
| 10. Linking (e.g., origin/source) |
Subjective Indications |
| 11. Subjective Matter Expert Evaluation |
| 12. User Evaluation |
The next step, after taking measurements, is to use the measurements to evaluate current data quality. Again, there are several ways to do this, and the organization should select the method(s) which best suits its needs. We describe four methods in the next sections.
One method is a decision-analysis approach described by Kaomea. [14] In a given decision scenario this method computes the values of various data quality attributes. Kaomea claims that the value of a data quality attribute may be computed as the product of related sub-attributes. For example, the value of the attribute "accuracy" may be computed as the product of "source accuracy," "source credibility," and "data clarity." The values for other data quality attributes may be computed in the same way. Kaomea then would have data users construct a decision tree that models the data scenario under consideration and inserts the computed data quality values into the decision tree. Later, Kaomea would have data users perform a sensitivity analysis by varying each sub-attribute value and comparing the outcomes.
Another method of measuring data quality is by extending the Entity-Relationship (ER) or the Object-Oriented (OO) model to include quality attributes. [7] [12] The original theory proposed by Kon [5] defined a data quality parameter as a "qualitative or subjective dimension by which a user evaluates data quality. Source credibility and timeliness are examples." Kon also defined a data quality indicator as a "data dimension that provides objective information about the data.
Source, creation time, and collection method are examples." Kon then extended the entity-relationship model defined by Codd [1] to attach data quality parameters to entity attributes defined in the relations. Kon advises adding data quality indicators (assigned values) to data quality parameters. Data users would therefore be able to estimate data quality from the value of data quality parameters. Figure 2 illustrates this approach.
Notice that Figure 2 is a standard ER diagram augmented with rectangles that represent data quality parameters and an indicator for each parameter. These symbols represent additional data, stored in the database, whose sole purpose is to indicate the quality of the associated attribute values. For example, the entity "Client" has an "address" attribute. Not only would the address be stored in the database, but a quality parameter ("timeliness," which is an indication of the age of this data) is also stored in the database. If someone who uses Client data is interested in how timely this address is, the user can check the age of the address to estimate its quality. Unlike the decision-analysis approach, Kon's method allows users to assess database quality.
Figure 2. Data Quality Extension of the Entity-Relationship Model [5]
A third method of evaluating data quality is based on mathematics, as in Morey. [9] This work considers a data life cycle in an information management system, in particular a transaction-based system. This assessment method predicts database accuracy, which differs from assessing sensitivity to a particular attribute or assessing overall database quality.
Ishii [3] described a fourth method for measuring data quality. It is qualitative and is called a "reduction-based data quality calculus." Ishii's method focuses on qualitative attributes of data obtained from several data sources. Unlike the other three methods, Ishii's method uses subjective measurements. His derivation of overall data quality is based on an algorithm that uses special data quality values and their relationships to each other as input and computes an overall data quality value. The data quality attribute relationships are dictated by each user's needs.
3.5 Identify Remedies
By this point an organization should have a good model of its data, an understanding of user data quality requirements, and an assessment of current data quality. The organization must now develop alternative remedies, and model the impact of each. This is accomplished by taking work previously done and performing a sensitivity analysis on the current data quality level to determine which remedies produce the greatest positive results for the users' most important data needs. The method of performing this sensitivity analysis depends on which method was used in the previous step to measure current data quality. For instance, assume that Kaomea's decision-analysis approach [4] were used as an assessment method. Then we can compute the data quality attribute's value as the product of values of the sub-attributes. We then conduct a sensitivity analysis by adjusting the values of the sub-attributes, one by one, then comparing final outcome values. This measures the data's sensitivity to each of the sub-attributes whose values, in turn, may be adjusted. The data model and the prioritized list of data quality attributes indicate which values to adjust.
Once the organization determines which values most influence outcomes, corresponding remedies may be selected. If timeliness is an important data quality attribute for the organization and the sensitivity analysis shows that more up-to-date data have a large positive impact on quality, then a recommended remedy might be to update the database more frequently or distribute the updates to remote sites more often. This step clearly must be tailored to an organization's needs and constraints.
After an organization develops remedies, it must also develop a thorough cost/benefit analysis. A prioritized list of remedies then becomes the basis for improving the organization's data quality. We recommend that changes be implemented according to an organization's mission and culture.
3.6 Remeasure and Iterate
After putting changes in place an organization should re-evaluate its data quality using the same approaches as the initial assessment. This will allow the organization to determine the impact of remedies and perhaps refine a list of future remedies. We believe that organizations should be aware that data quality requirements change over time and not all changes may be correctly anticipated. Cost/benefit ratios may prohibit immediate implementation of a remedy. Over time the ratio may change, making a marginally beneficial remedy more attractive later.
We've presented a general Data Quality Engineering Framework for defining, measuring, and improving an organization's data quality levels. We've also shown a step-by-step process for implementing this framework together with suggestions for particular methods to use in various steps of the process. This framework can be customized, and we have provided some suggestions for tailoring it to the needs of various organizations. The framework takes into account the needs of current and future data users.
1. E.F. Codd, "A Relational Model of Data for Large Shared Data Banks," Communications of the ACM, Vol. 13, no. 6 (1970): 377-387. [return to text]
2. C.A. Dvorak and J.S. Richters, "A Framework for Defining the Quality of Communications Services," IEEE Communications Magazine, (October 1988): 17-23. [return to text]
3. A. Ishii, Y. Jang, and R. Wang, "A Qualitative Approach to Automatic Data Quality Judgment," Massachusetts Institute of Technology (MIT) Sloan School of Management, Cambridge, MA, TDQM-93-02, (May 1993). [return to text]
4. P. Kaomea, "Valuation of Data Quality: A Decision Analysis Approach," Massachusetts Institute of Technology (MIT) Sloan School of Management, Cambridge, MA, TDQM-94-09, (September 1994). [return to text]
5. H. Kon, S. Madnick, and R. Wang, "Data Quality Requirements Analysis and Modeling," Massachusetts Institute of Technology (MIT) Sloan School of Management, Cambridge, MA, TDQM-92-03, (December 1992). [return to text]
6. A. Levitin and T. Redman, "Quality Dimensions of a Conceptual View," Massachusetts Institute of Technology (MIT) Sloan School of Management, Cambridge, MA, TDQM-95-04, (February 1995). [return to text]
7. S. Madnick, "Integrating Information from Global Systems: Dealing with the 'On-and-Off-Ramps' of the Information Superhighway," Massachusetts Institute of Technology (MIT) Sloan School of Management, Cambridge, MA, TDQM-95-12, (February 1995). [return to text]
8. A. McKeating, "Quality Can Stop Dirty Data," Computerworld, Vol. 26, no. 49, (1992): 33. [return to text]
9. R.C. Morey, "Estimating and Improving the Quality of Information in a MIS," Communications of the ACM, Vol. 25, no. 5, (1982): 337-42. [return to text]
10. SRA Corporation, "Data Quality Engineering: Imposing Order on Chaos," Briefing Slides, SRA Corporation, Washington, DC, 1996. [return to text]
11. D. Strong and R. Wang, "Beyond Accuracy: What Data Quality Means to Data Consumers," Massachusetts Institute of Technology (MIT) Sloan School of Management, Cambridge, MA, TDQM-94-10, (October 1994). [return to text]
12. S.Y. Tu and R. Wang, "Modeling Data Quality and Context Through Extension of the ER Model," Massachusetts Institute of Technology (MIT) Sloan School of Management, Cambridge, MA, TDQM-93-13, (October 1993). [return to text]
13. Vality Technology, "Product Brief: The Integrity Data Re-engineering Tool," Vality Technology, Boston, MA, (July 1996). [return to text]
14. Y. Wand and R. Wang, "Anchoring Data Quality Dimensions in Ontological Foundations," Massachusetts Institute of Technology (MIT) Sloan School of Management, Cambridge, MA, TDQM-94-03, (June, 1994). [return to text]
Go to:
Comments: dqemail@aol.com (10/01/98)