DATA
QUALITY September 1998

Volume 4 Number 1  Copyright 1998


A Hierarchical Approach to Improving Data Quality

Marcey L. Abate, Kathleen V. Diegert, Sandia National Laboratories and Heather W. Allen, Heather Allen &
Associates

Key Words:  Attributes, Dimensions, Categories, Assessment, Lewycky Model, Weaknesses, Deficiencies

Abstract:

We show that data quality is multi-dimensional and hierarchical by defining it in terms of conformance to intended use. Using a hierarchy proposed in the recent literature, we draw parallels to the Lewycky Model, which is commonly used in root cause analysis. We believe that this parallel relationship demonstrates that data quality problems can be effectively assessed and remedied using a structured approach. By examining data quality  problems within a hierarchical framework we show that data quality cannot be improved independently of the process producing the data or the context in which the data is to be used. Our results imply that purely technological approaches in the form of technology-driven solutions are necessary - but not sufficient - to provide sustained data quality improvements. To ensure long term data quality, better data acquisition and processing system design is essential.

Contents

  1. Introduction
  2. Definition of Data Quality
  3. Assessing Data Quality
  4. Data Quality Improvements
  5. Conclusion
  6. References

1. Introduction

Recent advances in technology have allowed organizations to create, store, and process massive amounts of data. As data are increasingly used to support organizational activities, it is obvious that poor data quality may negatively affect organizational effectiveness and efficiency. Such negative impacts are often manifested as the direct and indirect costs of prevention. Examples of such costs (U.S. Dept. of Defense, 1997) include recurring costs for preventing, appraising, and correcting data errors, and prevention-based training. Despite the potential costs of bad data quality, organizations often ignore data quality altogether. We still find it common for organizations to ignore the impact of poor data quality on their activities. We have found, for example, that it is still common for organizations to impose goals of 100% customer satisfaction, no defects, and no accidents upon their employees, with no explicit recognition that the data that will be used to achieve these goals typically suffer various deficiencies. We believe that only be remedying data deficiencies can organizations' goals and activities be fully realized.

To remedy these deficiencies, one must approach data quality in a structured manner. Therefore, our paper investigates a structure proposed in the recent literature, and we discuss several data quality solutions and improvements. It is clear to us that data quality cannot be improved independently of the processes producing the data, rather than the context  in which the data is to be used. While this seems obvious, it has generally not been recognized by those who espouse traditional data quality methods. For example, it is commonly observed that edit checks are commonly implemented in new data quality hardware, software, and interfaces. Such purely technological approaches are necessary, but not sufficient, to provide sustained data quality improvements. To ensure consumers receive long term data quality improvements, data producers must improve data-related processes. Before discussing  data quality assessment and improvement, we define data quality.

2. Definition of Data Quality

We recognize that a standard definition of data quality does not exist in the current literature (Kon et. al. 1993; Firth and Wang 1993; Kaomea 1994). We believe that, based on our survey of recent literature, the International Organization for Standardization supplies an acceptable definition of data quality using accepted terminology from the quality field. The International Organization for Standardization (ISO) is a federation of national standards bodies. ISO's working groups from most of the world's nations forge international agreements which are published as International Standards. These standards are documented agreements containing technical specifications or other precise criteria to be used consistently as rules, guidelines, or definitions of characteristics, to ensure that materials, products, processes and services are fit for their purpose. Like other ISO standards, ISO quality standards are frequently updated to reflect advances in quality methodology.

Among the many ISO standards is ISO 8402: Quality Management and Quality Assurance Vocabulary. ISO 8402 provides a formal definition of quality as: The totality of characteristics of an entity that bear on its ability to satisfy stated and implied needs. Thus, we can define data to be of the required quality if it satisfies the requirements stated in a particular specification and the specification reflects the implied needs of the user. Therefore, an acceptable level of quality has been achieved if the data conforms to a defined specification and the specification correctly reflects the intended use. We refer to these defining phrases as conformance and utility. We believe that making conformance and utility the qualifying indicators for data quality is consistent not only with the ISO 8402 standard but also agrees with traditional references which describe quality as "fitness for use", "fitness for purpose", "customer satisfaction", or "conformance to requirements" (Juran 1989; Crosby 1980).

We believe conformance and utility assessment, and thus data quality assessment, is enabled when stated and implied needs are translated into a set of characteristics having specific criteria. Structured analysis of these characteristics. together with careful planning, should provide a data quality assessment that reveals key data quality problems, root causes for the problems, and solutions for improving both conformance and utility. The following section provides a framework within which to implement such a data quality assessment.

3. Assessing Data Quality

Data Quality Attributes

A set of characteristics, or data quality attributes, are required for the objective and measurable assessment of conformance and utility, and hence for the assessment of data quality. Commonly used attributes to measure data quality include accuracy, completeness, consistency, reliability, timeliness, uniqueness, and validity. But we feel that initially assessing data quality by using specific attributes suffers from  number of deficiencies, including:

Clearly, depending on user requirements, the appropriate set of attributes and attribute levels may differ. But even if an appropriate set can be defined, it is likely that interdependencies will exist among attributes. For example, data that are generated or arrive too late cause an unacceptable amount of incompleteness and timeliness. Similar relationships exist between timeliness and accuracy, and completeness and accuracy (Kon et. al.1993). Such interdependencies may make it difficult, if not impossible, to define a minimal and orthogonal set of data quality attributes. Finally, analyzing specific subsets of attributes may make it difficult to identify systemic data quality problems. This may occur because the same root cause may precipitate a number of similar data quality problems. Failure to recognize obvious patterns may lead to missed opportunities for improvement. These ideas are consistent with using hierarchical models in accident and root cause analysis (Leveson 1995; Wilson et. al., 1993).

Data Quality Dimensions

Rather than approaching data quality assessment an an attribute level, we believe an approach is needed that organizes attributes so as to better identify data quality problems. Such an organization would naturally group related attributes together. Again, this approach is consistent with accident and root cause analysis, and is applied in Wang et. al. (1994) where the authors define "data quality dimension" as a set of data quality attributes that represent a single data quality abstract or construct. Grouping attributes into dimensions offers the following advantages:

Thus, by organizing attributes into data quality dimensions, many difficulties encountered when dealing with singular attributes can be effectively addressed. So, not only are dimensions more comprehensive, but organizing attributes into dimensions both organizes and minimizes the material that must be comprehended. Moreover, by analyzing dimensions a data quality researcher may discover systemic root causes of data errors. Considering isolated, singular attributes may cause a researcher to only analyze errors that occur infrequently, and thus have a specific (or rare) cause.

Wang et. al. (1994) discussed how to construct specific data quality dimensions. His group first gathered 179 data quality attributes, from the data quality literature, from researchers and from consumers. They used factor analysis to collapse their list of attributes into fifteen data quality dimensions. Table 1 shows the dimensions, with a brief description of each.

Table 1: Data Quality Dimensions and Descriptions
Dimension Description
Access Security Access to data must be restricted, and hence, kept secure.
Accessibility Data must be available or easily and quickly retrievable.
Accuracy Data must be correct, reliable, and certified free of error.
Appropriate Amount of Data The quantity or volume of available data must be appropriate.
Believability Data must be accepted or regarded as true, real, and credible.
Completeness Data must be of sufficient breadth, depth, and scope for the task at hand.
Concise Representation Data must be compactly represented without being overwhelming.
Ease of Understanding Data must be clear, without ambiguity, and easily comprehended.
Interpretability Data must be in appropriate language and units, and the data definitions must be clear.
Objectivity Data must be unbiased (unprejudiced) and impartial.
Relevancy Data must be applicable and helpful for the task at hand.
Representational Consistency Data must always be presented in the same format and compatible with previous data.
Reputation Data must be trusted or highly regarded in terms of their source or content.
Timeliness The age of the data must be appropriate for the task at hand.
Value-Added Data must be beneficial and provide advantages from their use.

Whereas attributes represent the lowest level at which data quality problems can be identified and understood, the dimensions shown in Table 1 represent a higher level of understanding. That is, attributes represent the lowest level mechanism by which data problems become apparent and dimensions represent conditions that would not occur if data quality problems became apparent at the attribute level. This is important to recognize, given that a data quality assessment typically has overall goals of assessing data quality problems, identifying root causes for data quality problems, and improving data quality. Achieving these goals would therefore involve assessment and identification of data deficiencies at a low level by analyzing attributes and then identifying problem areas by grouping attributes according to appropriate dimension. Stopping an analysis and drawing conclusions at the attribute level may thus prevent the identification of broader, systemic problems. Again, a hierarchical approach to pattern recognition for problem identification and solution is consistent with traditional root cause analysis and will be further developed below.

Data Quality Categories

Although data may have quality problems in one dimension while being satisfactory in others, it is possible that a single root cause could precipitate problems in multiple dimensions. As with attributes, it appears that an inherent grouping of dimensions may exist that would further help researchers recognize patterns of data quality problems. Wang et. al. observed that dimensions seem to form several natural families, or categories, as shown in Table 2.

Table 2: Categories of Data Quality Dimensions
Category Dimensions Deficiencies May Indicate
Intrinsic Accuracy, Objectivity, Believability, Reputation A lack of process or weakness in the current process for creating data values that correspond to the actual or true values.
Contextual Value-Added, Relevancy, Timeliness, Completeness, Appropriate Amount of Data A lack of process or weakness in the current process for producing data pertinent to the tasks of the user.
Representational Interpretability, Ease of Understanding, Representational Consistency, Concise Representation A lack of process or weakness in the current process for supplying data that are intelligible and clear.
Accessible Accessibility, Access Security A lack of process or weakness in the current process for providing readily available and obtainable data.

Using the category definitions from Wang's study, one can infer that a within-dimension pattern of deficiencies categorized as intrinsic may indicate either a process does not exist to create data values that correspond to the actual or true values, or an existing process suffers from a weakness preventing the creation of data values that correspond to the actual or true values. Similarly, a within-dimension pattern of deficiencies categorized as contextual may indicate a current process weakness or the lack of a process for creating data pertinent to user tasks. A pattern of deficiencies in representational dimensions suggests a process does not exist to the current process is not capable of supplying intelligible and clear data. A pattern of deficiencies in those dimensions categorized as accessible may reveal a process does not exist or that the current process does not allow for data to be readily available and obtainable.

The categories shown in Table 2 can be interpreted as the third level for understanding data quality problems. At the third level of understanding, one would find the constraints or weaknesses of processes that allowed the conditions at the second level to occur or exist. Because conducting a problem analysis using the data quality categories in Table 2 may indicate constraints or weaknesses of the processes by which data is acquired and manipulated (and are, therefore, the processes responsible for the presence or absence of second level conditions) the categories can be interpreted as a third level of data quality understanding. A third level is necessary given the previously-stated overall goals of assessing data quality problems, identifying root causes of problems, and improving data quality at the level where root causes related to processes that acquire and manipulate data can be identified and interpreted. Of course, this  identification requires integrating the information gathered at each level of the hierarchy. Organizing information up to the third level provides an opportunity to recognize patterns of recurring problems that affect data quality. A hierarchical  approach to pattern recognition for problem solving is consistent with traditional root cause analysis, as we will show.

A Hierarchy for Data Quality Assessment

Referring to the relationships among attributes, dimensions, and categories as shown in Wang et.al (1994), we use Figure 1  to illustrate a hierarchy for analyzing data quality problems. In this figure, the lowest boxes denote attributes that must be specified for a data quality assessment, the second level contains the dimensions, and the third level denotes the categories. Although future research is needed to confirm the values of particular categories and dimensions, we believe the hierarchy forms a valid basis for implementing a data quality assessment.

Figure 1: Hierarchy for Data Quality Problems

Previously, we mentioned that using a hierarchy for pattern recognition is consistent with accepted accident analysis methods. In particular, Peter Lewycky proposed a three-level model for understanding accidents by organizing causality (Lewycky,1987; Leveson 1995). Nancy Leveson (1995) described the Lewycky model as having three components: at the lowest level of understanding causality that describes an accident's mechanism, at second level of understanding causality that includes the presence or absence of conditions that allowed events at the first level to occur, and a third level that may include or exclude constraints that allowed the conditions at the second level to permit the events at the first level to occur, or that allowed conditions that enabled events to occur to even exist. Leveson believed that the third level may include weaknesses or constraints on technical and physical conditions, social dynamics and human acts, management systems and organizational culture, and governmental or socioeconomic policies. In Figure 2, we show a graphical representation of this hierarchical accident model.

Figure 2: Lewycky Hierarchical Model for Accident Causes

[Image]

We alluded to the relationships between the Lewycky model and the hierarchical model for data quality problems earlier. We will restate them for clarity. Data quality attributes represent the lowest level mechanism by which data quality problems become apparent. Thus, attributes map to Level 1 of the Lewycky model. Because data quality dimensions represent conditions that would be absent if data quality problems became apparent at the attribute level, dimensions can be viewed as corresponding to Level 2 in the Lewycky model.  Finally, the organization of information according to data quality categories may indicate constraints on the existence of, or weaknesses in, process that acquire and manipulate data. These processes are responsible for the presence or absence of second level conditions. So, data quality categories may be viewed as analogous to Level 3 in the Lewycky model.

Applying a structured hierarchical approach like the Lewycky model to accident analysis has been proven effective in identifying and eliminating the root causes of accidents. Our goal in writing this paper was to demonstrate that the data quality hierarchy presented in Figure 1 can aid in identifying and eliminating the root causes of data quality problems, and suggest improvement opportunities This assumption can be justified by our observation that only by working through the data quality hierarchy to categorize dimensions at the third level, can the following key questions be answered:

Thus, we believe applying a data quality hierarchy involves gathering information at the attribute level, and organizing this information at the dimension and category level. By integrating the information at each level of the data quality hierarchy the answers to previously-posed questions, and thus the recognition of patterns in data quality problems, may become apparent. The ability to gather and integrate information is necessary to identify and treat the root causes of data quality problems, rather than only fixing a specific causal factor. Once the root causes are identified, appropriate solutions for remedying the problem can be proposed and implemented. We next discuss the connection between the data quality hierarchy and improvements.

4. Data Quality Improvements

As in accident analysis, effective improvements will remedy root causes by acting upon an underlying process or system, rather than just fixing a causal factor. In the Lewycky model, these improvements are associated with systems, cultures, and policies, and are revealed at Level 3 of the model. Similarly, it can be shown that data quality improvements are associated with the systems or process whose weaknesses are revealed at the category level of the data quality hierarchy. This is because investigation of the dimensions within each category show that the weaknesses revealed by each of the four categories correspond to either a utility or conformance deficiency (and thus a data quality problem).

Utility deficiencies indicate a needed change to the process or system used to acquire data. Conformance deficiencies may require changes to the processes used to both acquire and manipulate data. Changing the process by which data is acquired can be considered an alteration to a fundamental process, while changing the process by which data is manipulated involves altering only technological processes. The methods by which technological processes are altered can be labeled "technology driven" approaches to improvement. Those methods used to alter the fundamental processes can be labeled "data driven" approaches. The relationship among the weaknesses revealed by the category level of the data quality hierarchy and the appropriate approach to improvement is represented in Figure 3.

Figure 3: Relationship Between Level 3 Weaknesses and Approaches to Improvements

Figure 3 shows that there are two basic categories of improvement opportunities, those that require the alteration of fundamental processes and those that require alterations of technological processes. Alteration of both processes occurs in a unique manner. In the terminology of the FY97 FIRM proposal, (Sandia National Laboratories, 1997) those improvements that require altering fundamental processes require data driven solutions. Examples of technology driven solutions are the installation of new hardware, Commercial Off the Shelf (COTS) software, or developing a new interface. Data driven solutions may involve statistical analysis, human factors engineering, requirements engineering, and modeling and simulation.

Figure 3 also shows that  to improve data utility it is necessary to implement data driven solutions because applications of technology (technology driven solutions) will not affect data acquisition processes. Although it is intuitive that the utility of data cannot be improved independently of the process producing the data, traditional approaches to solving data quality problems have often involved only the application of technology driven solutions. In part, this may be because technological processes are typically straightforward, although potentially expensive. While it is true that technology driven solutions can address some compliance issues, sustained data quality improvement also requires the application of data driven solutions.

5. Conclusion

We have shown that data quality, defined in terms of conformance and utility, is a multi-dimensional and hierarchical concept. We also showed that a hierarchy proposed in the recent literature is compatible with a well-accepted hierarchical model for root cause analysis. This indicates that data quality problems may also be effectively assessed and remedied by using a structured approach.

Placing data quality problems within a hierarchical framework also revealed that data quality cannot be improved independently of the process that produced the data, nor the context in which the data is to be used. This implies that pure technological approaches in the form of technology driven solutions are necessary, but not sufficient, to provide sustained  data quality improvements. To ensure long term data quality improvements, research efforts should also include data driven solutions directed at processes and systems to ensure both compliance and utility.

6. References

Crosby, Philip. 1980. Quality is free: The art of making quality certain. New York: Mentor.

Firth, Chris, and Richard Wang. 1993. Closing the data quality gap: Using ISO 9000 to study data quality. Working paper TDQM-93-03, MIT TDQM Research Program, E53-320, 50 Memorial Drive, Cambridge, Ma. 02139.

Juran, J. M.  1989.  Juran on leadership for quality: An executive handbook. New York: Free Press.

Kaomea, Peter. 1994. Valuation of data quality: A decision analysis approach. Working paper TDQM-94-09, MIT TDQM Research Program, E53-320, 50 Memorial Drive, Cambridge, Ma. 02139

Kon, Henry, Jacob Lee, and Richard Wang.  1993.  A process view of data quality. Working paper TDQM-93-01, MIT TDQM Research Program, E53-320, 50 Memorial Drive, Cambridge, Ma. 02139.

Leveson, Nancy.  1995.  Safeware: System safety and computers. New York: Addison-Wesley.

Lewycky, Peter. 1987. Notes toward an understanding of accident causes. Hazard Prevention (March/April): 6-8.

Redman, Thomas C. 1995. Improve data quality for competitive advantage. Sloan Management Review. 36 (Winter 1995): 99-107.

Redman, Thomas C. 1992.  Data quality: Management and technology. New York: Bantam Books.

Sandia National Laboratories. 1997. FSAS information requirements modeling fiscal year 1997. FIRM proposal submitted to the Federal Aviation Administration.

Strong, Diane, Yang Lee, and Richard Wang. 1994. The life of data quality projects. Working paper, MIT TDQM Research Program, E53-320, 50 Memorial Drive, Cambridge, Ma. 02139.

U.S. Dept. of Defense. 1997. DoD Guidelines on data quality management. Defense Information Systems Agency

Wang, R.Y., and D. Strong. 1996. Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems 12 (4): 5-34.

Wang, Richard, Diane Strong, and Lisa Guarascio. 1994.  An empirical investigation of data quality dimensions: A data consumer's perspective. Working paper TDQM-94-01, MIT TDQM Research Program, E53-320, 50 Memorial Drive, Cambridge, Ma. 02139

Wilson, Paul F., Larry D. Dell, and Gaylord F. Anderson. 1993. Root cause analysis: A tool for total quality management. Milwaukee, Wisc.: ASQC Quality Press.


This work was supported by the United States Department of Energy under contract DE-AC04-94AL85000. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the U.S. Department of Energy