DATA
QUALITY September 1998

Volume 4  Number 1    Copyright 1998


Continuously Improving Data Quality in Persistent Databases

Paul. L Bowen, University of Queensland, David A. Fuhrer, Consultant, Frank M. Guess, University of Tennessee

Key Words: Continuous Improvement, Process Management, Databases, Transactions

Abstract:

Data quality problems lead to bad decisions, monetary losses, and other negative consequences for stakeholders. In this paper, we discuss ways organizations have used statistical techniques to continuously improve data quality in persistent databases such as fixed assets, inventory, accounts receivable, and customer information.  We apply statistical process improvement techniques to the concept of a data quality service objective (i.e., data quality over the service life of a dataset). We describe the relationship between statistical process management and transaction processing.  We present a procedure that can be used to test whether or not the data quality service objective for a persistent database is being met. We show how organizations can anticipate and prevent data problems and continuously improve data quality. We use an inventory database to illustrate how our concepts and procedures can be applied to both transaction processing and to persistent databases themselves. Organizations that use different types of persistent databases can use our approach for continuously improving data quality. Implementing the strategies described in this paper can help managers develop a culture of data quality improvement.

Contents

  1. Introduction
  2. Continuously Improving Data Quality
  3. Related Literature
  4. Implementing Statistical Process Control and the Data Quality Service Level
  5. Testing Current Data Quality
  6. Summary
  7. References
  8. Figures

1. Introduction

Data quality problems may cause serious financial problems for organizations. Data quality problems recently cost a fiber-optics manufacturer $500,000 when a mislabelled shipment caused the wrong cable to be laid along the bottom of a lake, caused a brokerage firm to lose $500 million when a dealer entered an incorrect exchange rate, and caused the U.S. government to lose over $2 billion in federal loan monies (Firth 1996). Organizations increasingly rely on their information systems to integrate and support their business processes (Wang and Kon 1993). These information systems and the quality of the data they contain affect customer's perceptions of the quality of purchased products and services (Svanks 1988; Wang, Storey, and Firth 1995; Wang and Strong 1996). Unfortunately some databases contain many errors (Johnson, Leitch, and Neter 1981; Morey 1982; Laudon 1986; Ballou and Pazer 1985). Inaccurate data reduce the value of information systems and lead to poor decisions (Hilton 1979; Hilton 1981; Hilton et al. 1981).

For more than 70 years, statistical quality techniques have been used to monitor and improve product quality in manufacturing (Shewhart 1925, 1931). Recently, it has become apparent that applying these techniques can provide continuous improvements to all organizational processes including information systems (Caby, Pautke, and Redman 1995; Huh, et al. 1990; Redman 1994). Improving data quality is becoming even more important because information systems are crucial for organizations to exploit the opportunities that information science and information technology will offer during the next decade (Drucker 1995).

This paper illustrates how statistical process control (SPC) techniques can be used to continuously improve data accuracy in persistent databases such as fixed assets, inventory, accounts receivable, and customer information. For example, organizations can use cause and effect diagrams and Pareto charts to implement data improvement processes (see Figures 1 through 4). Using graphical tools can produce iterative, continuous process improvements . These are shown in a p-chart (Figure 5) and a data quality service level chart (Figure 6), which is a chart of data quality over a dataset's service life.

We next outline the steps required to continuously improve data quality. Our third section reviews related data quality and quality control literature. In the fourth section we discuss how quality control of transaction processing relates to data quality in an inventory database. In section five we present a sampling and testing procedure to estimate the current data quality service level and to determine when that minimum data quality service level is reached. The last section summarizes our paper.

2. Continuously Improving Data Quality

Improving data quality in persistent databases requires preventing errors in input transactions, preventing errors in the persistent data, and estimating the data quality of the persistent data. To prevent transaction errors, input errors can be investigated and the findings used to construct cause and effect diagrams. Pareto charts can be used to show how frequently individual errors occur, so managers can prioritize and plan improvements. P-charts, which measure the percent of nonconforming items in a group, can show the accuracy rate of  records input over time. Data quality will continuously improve if organizations not only employ statistical tools but work with data suppliers to prevent non-conforming data from being input into databases.

Second, to prevent errors in the persistent data, records in the database can be sampled periodically. Errors discovered during these investigations are likely to reveal additional error sources that can be shown on cause and effect diagrams and Pareto charts. Analyzing these diagrams and charts can lead to further improvements in input controls and organizational procedures.

Third, databases model reality. As errors occur in input transactions and events occur in other processes (e.g., damage or theft of items in inventory), recorded values in the persistent data diverge from true values. Data quality in persistent databases can be estimated by taking a sample and determining how many records have values that differ from the true values. According to this quality methodology, when the estimated percentage of erroneous records becomes too large, (i.e., when the database fails to meet the current minimum data quality service level), an investigation (or audit) of the "real world" entities (values) represented in the database can be conducted and data quality restored to an acceptable level. That is, errors in persistent databases can be corrected or cleared by determining the correct data values and making the necessary corrections (Bowen 1992). Obviously, the expected benefits from conducting the audit or clearing, (e.g., better decisions or enhanced customer relations), must exceed the cost. For high-volume, transaction-only databases, clearings may not be economically feasible.

3. Related Literature

3.1 Data Quality

History teaches that organizations experience data quality problems. Neter and Loebbecke (1975) reported that, in one medium-sized manufacturing company, 155 of the 217 unique inventory item lines audited contained errors. Using data provided by a major accounting firm, Ramage, Krieger, and Spero (1979) found that twelve of the twenty-nine inventory cases they examined had error rates above 25%. Data quality problems in other accounting databases have been reported by Kinney (1979); Hylas and Ashton (1982); Willingham and Wright (1985); Ham, Losell, and Smieliauskas (1985); Kreutzfeldt and Wallace (1986); and Icerman and Hillison (1991). Other important databases that were found to have data quality problems included criminal justice (Laudon 1986) and military systems (Morey 1982).

Properties associated with data quality include accuracy, timeliness, and relevance (Hilton 1979; Specht 1986). Information economics research demonstrated that accuracy is a major determinant of an information system's value (Feltham 1972; Itami 1977; Hilton, Swieringa, and Hoskin 1981). Recent research indicates that data accuracy is becoming more rather than less important (Fox, Levitin, and Redman 1994).

The concern for data quality is reflected in the attempts to model, improve, and define it. Several researchers have developed quantitative models to help auditors, controllers, and information system developers evaluate, improve, and manage data accuracy (Yu and Neter 1973; Cushing 1974; Hamlen 1980; Ballou and Pazer 1985a, 1985b; Morey 1982; Paradice and Fuerst 1991; Bowen 1992). These models provide assistance for identifying, evaluating, monitoring, and managing input controls. Other researchers have begun to define, model, and measure the quality of data, including accuracy, accumulated in information systems (Dillard 1992; Jang et al. 1992; Wang and Kon 1993; Wang and Reddy 1993; Fox et al. 1994; Wand and Wang 1996). Some researchers advocate using rule-based integrity constraints (Parsaye and Chignell 1993), integrity analysis (Svanks 1988), total data quality management (Wang and Kon 1993), and table level control procedures (Bowen, Schneider, and Fields 1995) to improve incoming and accumulated data quality. In particular, Huh et al. (1990) and Redman (1994) advocate using statistical process control (SPC) to measure, control, and improve data quality.

3.2 Statistical Process Control and Information Systems

Statistical process control procedures were used to measure and improve manufacturing processes beginning in the 1920s (Shewhart 1925, 1931). Cause and effect diagrams (like those shown in Figures 1 and 3) display the causal factors affecting the process, that is, the sources of data errors. Properly used, these diagrams capture as many potential sources of data errors as possible and reflect the efforts of people with different backgrounds and perspectives. Cause and effect diagrams help organize knowledge about sources of variation and facilitate discussion and investigation of problems. They can be effective tools for improving operations and performance. Analysts can use Pareto charts (like those shown in Figures 2 and 4) to show the frequencies of events and help compare the causes of data quality problems. Pareto charts can focus attention on areas that have historically caused problems, and can  help determine in which order sources of data errors should be addressed (Leitnaker et al. 1996, pp. 43-51).

Control charts for attributes (like the p-charts shown in Figure 5), can be valuable aids for determining when and where to dedicate resources for quality improvement (Grant and Leavenworth 1980, 6). They can be used to avoid, first, searching for trouble and making adjustments when no trouble exists and, second, failing to search for trouble and not taking action when trouble does exist (Deming 1952, 14). Control charts can also be used as an estimating device, e.g., to estimate process parameters such as the mean and standard deviation of nonconforming item lines (Montgomery 1991, 106; Leitnaker et al. 1996, 72-74). Ideally, control charts should be developed throughout system analysis, design, conversion, and implementation to ensure data collection and maintenance processes will be problem free.

Many organizations already apply statistical quality control procedures to information systems. Gilbert, Reeve, and Wannemacher (1988) discussed how a major oil company used cause and effect diagrams and control charts to analyze and improve system response times. LePage (1990) explained how a large insurance company used statistical process control (SPC) to monitor the quality of data and of data entry controls. Huh et al. (1990) described data quality control efforts at AT&T. Other data quality control efforts have been described or advocated by Svanks (1988); Liepens (1990); Loebl (1990); Flournoy and Hearne (1990); and Anderson, Dumphy, and Wilson (1990). Statistical quality management techniques must continue to evolve to provide continuous process improvements as organizations increase their reliance on information systems. Future systems must identify and respond to internal and external changes to support changing organizational structures (such as short-lived teams) and to provide reasonable assurance to management that organizational objectives are being met. A recent paper by Wang, Storey, and Firth (1995) provides an overview of such data quality methodology.

4. Implementing Statistical Process Control and the Data Quality Service Level

We advocate a three step approach to improve data quality: sampling update transactions, sampling records in persistent databases, and enhancing input controls and procedures.  We feel these procedures can reduce errors in transactions, reduce the rate at which errors accumulate in the persistent data, and lead to a higher optimal data quality service level.

4.1 SPC and Transactions

One would begin improving data quality by using classic quality methodology to examine a random sample of input transactions and determine whether or not each transaction in the sample contains an error. A stratified sampling method may be used, e.g., selecting samples based on possible error causes, dollar value, or data source. The sample may be obtained manually, or the organization can develop software to select the sample from transactions as they occur (Weber 1999; Vasarhelyi and Halper 1991). Causes of data errors must be thoroughly investigated. Results of the investigations can be used to construct cause and effect diagrams (see Figure 1) and Pareto charts (see Figure 2). Cause and effect diagrams are used to categorize the sources of the transaction errors and depict the relationships between the error sources. Pareto charts illustrate the relative magnitude of the sources of errors. Organizations can use cause and effect diagrams and the Pareto charts to help reduce data entry error rates.

Using p-charts to plot the percent of non-conforming transactions over time (see Figure 5) organizations can summarize and track the quality of input transactions. Traditionally, upper and lower control limits for p-charts are set at three standard deviations above and below the mean percent non-conforming (Ishikawa 1976, 80). Estimates of the percent of non-conforming transactions can be obtained by investigating random samples for each time period, usually daily. Some out-of-control conditions, like points outside control limits, runs, trends, and periodicity (Ishikawa 1976, 76-78), are causes of particular concern.

Using p-charts to monitor transactions can help organizations improve the reliability of the data input process. For example, using p-charts can help organizations identify changes in data entry error rates caused by problems such as poorly trained data entry personnel or defective equipment. Correcting these problems will bring the data entry process into statistical control. Moreover, p-charts can be used during volume testing and parallel conversion to pinpoint problems before they become significant and costly.

4.2 SPC and Persistent Data

Even if no errors occur during data entry procedures, errors can still collect in persistent databases. For example, customer databases may accumulate errors because customers fail to notify an organization about relocations, births, deaths, or other events. Inventory databases may amass errors because of events like obsolescence, deterioration, damage, spoilage, and theft.

Data quality in persistent databases is controlled by determining materiality guidelines and setting a minimum data quality service level. Under inspect and repair methodology, data quality is restored by reconciling the recorded values in the database with the observed values of the corresponding real world entities whenever records containing material errors exceeded some maximum percent established by management. The minimum data quality service level was defined as 100 minus this maximum allowable percent error. If the data are error free at a point in time, say t sub zero, data quality will decline as errors accumulate in the database until a minimum data quality service level is reached. At that time, inspection and repair of nonconforming data will restore data quality and the process will repeat. The organization using such a methodology should select the data quality level at which the expected costs of reconciliations that restore data quality equals the expected costs of decision errors caused by data errors.

Traditionally, either all errors found in persistent data or a random sample of errors were investigated. Results of these investigations were used to construct cause and effect diagrams (see Figure 3) and Pareto charts (see Figure 4). In addition to reflecting errors from data entry processes, investigating errors in persistent data revealed additional causes of data errors and yielded Pareto charts that focused attention on different causes of errors.

4.3 Improve Input Controls and Procedures

Traditional data quality investigations yield two sets of cause and effect diagrams and Pareto charts: one set for transactions and one set for persistent data. Improvements to transactions will likely involve enhanced input controls and better data capture methods; they will be technically oriented. Improvements to the persistent databases usually focus on policies and procedures; they will be organizationally or people oriented.

Traditionally, enhanced input controls can take a variety of forms. First, input controls can focus directly on the data, i.e., additional or enhanced field checks, record checks, file checks, and batch checks can be designed and implemented (Weber 1999). Input controls can also focus on data structure, i.e., how data can be restructured to prevent insertion, update, and deletion anomalies (Date 1995). Third, personnel practices can be changed to promote data quality improvements. For example, an organization can improve training programs, provide ergonomic furniture and equipment, and reduce noise levels. Fourth, the organization can use better data capture methods. These methods might include greater mechanization of the data capture process, e.g., more extensive use of bar codes, and more information sharing, e.g., accepting and transmitting electronic data between trading partners.

Investigations of  persistent databases may lead to other changes. The organizational structure might be changed to more closely align employees who enter data with the people using that data to make decisions. The organization might implement incentive systems that reward improvements to data quality and ideas for its continuous enhancement. The prestige of data entry personnel might be enhanced through higher entry requirements, greater recognition of organizational contributions, and better compensation. Other, more situation specific changes might be made. For a fixed asset database, for example, the organization could strengthen its policies and procedures by allocating all items of equipment to specific individuals; ensuring that all individuals were aware of  their total responsibility to care for the equipment..

4.4 Expected Results

SPC produces a number of positive results. When Pareto charts are used  for data entry errors (see Figure 2) as the upper control bound for errors decline, the rate of errors produced in the data entry process decreases. Second, when Pareto charts are used to address other problems that affect the persistent data (for example, procedural, organizational, and reward system changes), they will also lower the rate at which errors accumulate (see Figure 4). Addressing the problems in the Pareto charts should produce a less steep slope of decline in the data service level (i.e., errors in the persistent data accumulate at a slower rate). Lowering the error accumulation rate will, in turn, make raising the minimum data quality service level more economical. Third, the increased attention and focus on data quality may also contribute to lower errors by generating greater awareness of its importance, i.e., produce a "Hawthorne effect." Fourth, improved data quality can increase confidence in the data, improve decisions, and make the benefits of higher data quality more apparent. These positive results may, in turn, lead to further increases in data quality. That is, improvements in data quality may produce synergistic effects. Figures 5 and 6 show the anticipated behavior of the upper control limit on data entry errors and the minimum data quality service level. These figures illustrate how statistical data quality techniques continuously improve input accuracy rates, lower the frequency of data cleaning operations, and raise the data quality service level.

5. Testing Current Data Quality

Data quality can be measured as 100 minus the percent of records containing material errors. This measure of data quality is analogous to measuring data accuracy on a continuous scale from 0 to 1 as proposed by Wang and Reedy (1993). We next explain how to test whether current data quality exceeds the current minimum data quality service level.

5.1 Stating the Current Hypothesis

(Editor's note: Due to limitations of html character sets, several expressions in the next  paragraphs are written out as they would be in e-mail text.)

Let p sub zero be the current minimum data quality service level. Management wants to test the hypothesis that the current level of data quality (DQ) exceeds the service objective versus the alternative that it does not:

Note that p sub zero will be repeatedly revised upward as the continuous process improvements are made to both input controls and organizational procedures. Let alpha sub zero be the alpha level and n the sample size chosen by management. Recall that alpha sub zero is used to control the chance of type I errors given the null hypothesis is true. In this statement of the hypotheses, type I errors cause the organization to incur the heavy cost of a complete audit when none is needed. Type II errors cause the organization to use data that does not meet the minimum data quality service objective and, thus, increases the number of decision errors caused by lower quality data. The organization can reduce type II errors by increasing the number of entities sampled, i.e., by increasing n. Increasing the sample size, however, increases the cost of performing the periodic checks on the quality of the data. Alternately, the organization can reverse the null and alternative hypotheses. Management can then use alpha sub zero to control the risk that the data does not meet the minimum data quality service objective.

5.2 Estimating Current Data Quality

To estimate the current data quality of the database, management needs to know the number of records that have material errors. If the database has a large number of records, investigating all records may be time consuming and expensive. The organization can estimate the percent of records that have material errors by investigating a sample of the real world entities (values) modelled by the information system and comparing the results to the corresponding records. (If real world entities are not of approximately equal value, the organization could use stratified or dollar unit sampling.) If the entities are chosen randomly, the sample will provide an unbiased estimate of the percent of the records in the entire database that contain material errors. The results of the sample can be used to compute the probability that the database accuracy level is acceptable, i.e., to test the hypothesis that the current level of data quality exceeds the minimum data quality service objective.

Formally, let C be the set of all entities, i.e., C = {ci | 1 < or = i < or = N} where N is the total number of entities. The random sample will consist of n entities, n < N. Let S be the sample set of n entities drawn from C, i.e., S= {sj | 1 < or  = j < or = n} where S is a subset of C. For each sj that is an element of S, the actual and recorded values will be compared. Let the indicator function I(·) be defined as

If each record has the same chance of containing a material error, i.e., if Pr[I(sj) = 1] is the same for all j = 1, ... , n, then the sample data vector

E = [I(s1), ... , I(sn)]

is a set of identically distributed random variables. Before the sample is taken, E is a random vector of data quality indicators. Under the assumption of independence, E yields a class of Bernoulli random variables and the exact sampling distribution of the test statistic

will be a binomial random variable. After the sample is taken, E is a vector of constant 0s and 1s. For a particular realization of the test statistic, e.g., T=t, we can compute the p-value as [Ross 1987, p. 231]:

The null hypothesis that the current level of data quality exceeds the service objective should be rejected if p-value < alpha sub zero. Rejecting H sub zero indicates that the organization should perform a complete investigation of the records and values to improve the current data quality.

5.3 Anticipating an Unacceptable Service Level

The procedure described in the previous section tests for violations of the current data quality service objective. Management can also use past data in a regression or time series analysis to predict when a complete investigation of the real world entities count should be conducted. Creating data quality control charts such as shown in the figures can help management develop a culture of data quality improvement. Unfortunately, these improvements canot totally eliminate all errors because some data quality problems are beyond management's control.

5.4 An Example

Over the past several years, organizations have worked hard to reduce their inventories. These efforts have helped organizations reduce warehouse storage requirements, inventory carrying costs (e.g., interest charges), and losses from theft, damage, and obsolescence. Some manufacturing organizations can convert to just in time (JIT) inventory systems and virtually eliminate their inventories. Some wholesalers, retailers, or JIT suppliers cannot totally eliminate their inventories. These organizations rely on their information systems to make decisions about their inventory. Data problems may cause managers to make decisions that generate over-stock or under-stock conditions. Either condition adversely affects an organization's profitability.

5.4.1 Data Entry

Managing and improving data entry requires sampling transactions, and constructing cause and effect diagrams, Pareto charts, and control charts. It also requires using the insights obtained from statistical analyses to enhance data entry processes. For example, transaction data for an inventory system might consist of receipts, shipping, and physical inventory data. Statistical sampling requires obtaining a random sample of transactions and investigating each transaction selected to determine whether it was correct or not. For paperless systems, the sample could be obtained by performing duplicate entry of a random sample of transactions.

If data are captured on source media, (e.g., paper purchase orders or recordings of telephone orders), then a sample may be selected from the transaction log and traced back to the source media. For incorrect transactions, the reason for the error must be determined. Results of error investigations are used to construct cause and effect diagrams and Pareto charts. Addressing the largest sources of errors on the Pareto charts should result in reductions the upper control bounds on the control chart. The error rate for each time period can be plotted on the control chart. If the control chart indicates an out of control condition, the errors can investigated, the reason for the out of control condition determined, and appropriate actions taken.

Figure 1 shows a possible cause and effect diagram for a data entry process and Figure 2 illustrates a potential distribution of sources of data entry errors. Figure 5 presents a hypothetical control chart for data entry errors, showing how the upper control limit falls as problems associated with the data entry process are remedied..

5.4.2 Persistent Data

Monitoring and improving the persistent inventory database requires sampling inventory items, constructing cause and effect diagrams, Pareto charts, and data quality service level graphs.  Insights from these activities are used to enhance the organizational efforts affecting inventory and inventory data. Sampling may be conducted by randomly selecting inventory items, physically counting the number of units of each of these items, and comparing the physical count with the amounts in the database. Alternatively, an organization may sample inventory items during stocking and order filling processes. Both alternatives could use stratified sampling (including  item selection algorithms that reflect item usage or other measures of the importance of individual inventory items).

Investigating selected items requires a method for determining real-world values. In warehouses, physical inventory counts will usually achieve this objective. In other situations (e.g., customer data) determining the true values may be more difficult. Results of investigating discrepancies between inventory item records and the actual state of the inventory, whether for sampled items or for the complete physical inventory, may be used to construct cause and effect diagrams and Pareto charts. The data quality (i.e., 1 minus the error rate) for each time period may be plotted on the data quality graph. If the data quality is below the current minimum acceptable data quality service level, then all inventory items should be investigated, i.e., a complete physical inventory should be performed.

Remedying the largest sources of data quality errors on the Pareto charts in the inventory database should result in even less steep slopes. Improving data entry processes will result in less steep slopes in  charts showing data quality degradation. Figure 3 shows a possible cause and effect diagram for the organizational factors affecting data quality of the inventory database and Figure 4 portrays a distribution of various organizational factors on the quality of  persistent data. Figure 6 illustrates that improving the data entry process and improving organizational procedures, first, decreases the rate of decline of data quality; second, reduces the frequency of clearings when data quality reaches a minimum data quality service level; and third, improves the minimum data quality service level.

5.4.3 Long-Term Effects

Decreasing upper control bounds as reflected in the control charts, slower declines in data quality, and higher minimum data quality service levels can yield higher overall data quality, more predictable data quality, lower data quality maintenance costs, and better decisions. These benefits mean that the organization will enjoy greater returns from its information systems investments. Implementing data quality methodology is likely to make an organization more data quality conscious. Improved decision making as the result of better information quality can generate more information quality improvements, and more innovative ways to use information systems to achieve competitive and strategic advantages. The organization is likely to become a more desirable trading partner because of the accuracy of the data it provides to organizations with which it does business.

5.4.4 Sampling Persistent Inventory Data

Finally, we use a simple example to show how traditional statistical quality control can improve data quality. Suppose the organization stocks 10,000 different item lines of inventory (N = 10,000). Assume that management has selected a minimum data quality service level of 95% (p sub zero = 0.95), a weekly sample size of 100 item lines (n = 100), and an alpha level of 10% (alpha sub zero = 0.10). Table 1 illustrates the calculation of the probability that the inventory data base exceeds the required service level (p-value) for each number of acceptable item lines in the sample (t). In this example, management would request a complete physical inventory count if fewer than 93 of the 100 unique inventory item lines sampled were within required tolerances.

TABLE 1.  Probability of Satisfying the Current Data Quality Objective

  Sample Size (n) Data Quality Service Objective (p sub zero) No. Item Lines Within Tolerance (t) Probability Service Objective Satisfied p-value
100 0.95 100 0.99408
100 0.95 99 0.96292
100 0.95 98 0.88174
100 0.95 97 0.74216
100 0.95 96 0.56402
100 0.95 95 0.38400
100 0.95 94 0.23399
100 0.95 93 0.12796
100 0.95 92* 0.06309*
100 0.95 91* 0.02819*
100 0.95 90* 0.01147*

* Below the example alpha sub zero = 0.10.

Suppose the results of the weekly samples were as shown in Table 2. Then management could anticipate requesting a complete physical inventory count approximately every four weeks. This time period could change if the rate that transactions enter the system increases or decreases, if data entry employees change or modify their behavior, or if the data entry methods, procedures, and controls are altered. Indeed, the cost of frequent complete physical inventory counts may motivate management to install new equipment, upgrade employee training, and redesign data entry and control processes to reduce error rates and, thus, the frequency of complete physical counts.

TABLE 2   Example Results of Weekly Samples

   Week No. Item Lines Within Tolerance (t) Week No. Item Lines Within Tolerance (t)
1 100 7 90*
2 98 8 99
3 94 9 94
4 92* 10 95
5 99 11 91*
6 96 12 100

* Triggered a complete physical inventory count that restored near perfect data quality. 

6. Summary

Statistical quality techniques can be used to continuously improve data quality in persistent databases. These techniques allow organizations to measure the quality of their data, to detect and assess changes (up or down) in data quality, and to improve the management and control of the database. Our paper identified benefits of using statistical quality control techniques to enhance data quality, demonstrated how statistical quality control of transaction processing can improve data quality in the inventory database, showed how to perform statistical tests, and discussed how to anticipate and prevent violations of the service objective. We believe that implementing the strategies described in this paper can help management develop a culture of continuous data quality improvement. Improving data quality in persistent databases will yield many natural benefits including better decision making, increased profitability, and positive outcomes for stakeholders.

 7.  REFERENCES

Anderson, K. M., D. C. Dumphy, and P. W. F. Wilson. 1990. Management of data quality in a long-term epidemiologic study: The Framingham heart study. In Data quality control: Theory and pragmatics, ed. Liepens and Uppuluri, 57-68. New York: Marcel-Dekker.

Ballou, D. P., and H. L. Pazer. 1985. Modeling data and process quality in multi-input, multi-output information systems. Management Science (February): 150-62.

Ballou, D. P., and H. L. Pazer. 1985. Cost/quality tradeoffs for control procedures in information systems. Omega: International Journal of Management Science 15, no. 6: 509-21.

Bowen, P. L. 1992. Managing data quality in accounting information systems: A stochastic clearing system approach. Ph.D. diss., University of Tennessee.

Bowen, P. L., G. P. Schneider, and K. T. Fields. 1995. Managing data quality in client/server environments. IS Audit & Control Journal 4: 28-35.

Caby, E. C., P. W. Pautke, and T. C. Redman. 1995. Strategies for improving data quality. Data Quality 1, no. 1 (March): 4-12.

Cushing, B. E. 1974. A mathematical approach to the analysis and design of internal control systems. Accounting Review (January): 24-41.

Date, C. J. 1995. An introduction to database systems. 6th ed. Reading, Mass.: Addison-Wesley Publishing.

Deming, W. E. 1952. Elementary principles of the statistical control of quality. Tokyo: Nippon Kagaku Gijutsu Remmei.

Dillard, R. A. 1992. Using data quality measures in decision making algorithms. IEEE Expert 7, no. 6 (December): 63-72.

Drucker, P. F. 1995. Managing in a time of great change. New York: Truman Talley Books/Dutton.

Feltham, G. A. 1972. Information evaluation. Sarasota, Fla.: American Accounting Association, Sarasota.

Firth, C. P. 1996. Data quality in practice: Experience from the frontline. The 1996 Conference on Information Quality, Massachusetts Institute of Technology, October 25-26. (Also see http://sunflower.singnet.com.sg/~cfirth/dataquality/.   Special viewing authorization may be needed.)

Flournoy, N., and L. B. Hearne. 1990. Quality control for a shared multidisciplinary database. In Data quality control: Theory and pragmatics, ed. Liepens and Uppuluri, 43-56. New York: Marcel-Dekker.

Fox, C., A. Levitin, and T. C. Redman. 1994. The notion of data and its quality dimensions. Information Processing & Management 30, no 1: 9-19.

Gilbert, K. C., J. M. Reeve, and R. A. Wannemacher. 1988. Improving information system efficiency through statistical process control. Journal of Engineering Computing and Applications (Fall): 21-27.

Grant, E. L., and R. S. Leavenworth. 1980. Statistical quality control. New York: McGraw-Hill Book Company.

Ham, J., D. Losell, and W. Smieliauskas. 1985. An empirical study of error characteristics in accounting populations. Accounting Review (July): 387-406.

Hamlen, S. S. 1980. A chance-constrained mixed integer programming model for internal control design. Accounting Review (October): 578-93.

Hilton, R.W. 1979. The determinants of information system value: An illustrative analysis. Journal of Accounting Research (Autumn): 411-35.

Hilton, R.W. 1981. The determinants of information system value: Synthesizing some general results. Management Science 27 (January): 57-64.

Hilton, R.W, R. J. Swieringa, and R. E. Hoskin. 1981. Perception of accuracy as a determinant of information value. Journal of Accounting Research (Spring): 86-108.

Huh, Y. U., F. R. Keller, T. C. Redman, and A. R. Watkins. 1990. Data quality. Information and Software Technology (October): 559-65.

Hylas, R. E., R. H. and Ashton. 1982. Audit detection of financial statement errors. Accounting Review (October): 751-65.

Icerman, R.C., and W. A. Hillison. 1991. Disposition of audit-detected errors: Some evidence on evaluative materiality. Auditing: A Journal of Practice & Theory (Spring): 22-34.

Ishikawa, K. 1976. Guide to quality control.  Tokyo: Asian Productivity Organization.

Itami, H. 1977. Adaptive behavior: Management control and information analysis.  Sarasota, Fla: American Accounting Association.

Jang, Y., H. B. Kon, and R. Y. Wang. 1992. A data consumer-based approach to supporting data quality judgment. Proceedings of the Second Annual Workshop on Information Technology and Systems. Dallas, Texas.

Johnson, J. R., R. A. Leitch, and J. Neter. 1981. Characteristics of errors in accounts receivable and inventory audits. Accounting Review (April): 270-293.

Kinney, W. R., Jr. 1979. The predictive power of limited information in preliminary analytical review: An empirical study. Journal of Accounting Research  (Supplement): 148-65.

Kreutzfeldt, R.W., and W. A. Wallace. 1986. Error characteristics in audit populations: Their profile and relationship to environmental factors. Auditing: A Journal of Practice & Theory (Fall): 20-43.

Laudon, K.C. 1986. Data quality and due process in large interorganizational record systems. Communications of the ACM (January): 4-11.

Leitnaker, M. G., R. D. Sanders, and C. Hild. 1996. The power of statistical thinking: Improving industrial processes. Reading, Mass.: Addison-Wesley.

LePage, N. J. 1990. Data quality control at United States Fidelity and Guaranty Company. In Data quality control: Theory and pragmatics, ed. Liepens and Uppuluri, 25-42.  New York: Marcel Dekker.

Liepens, G. E. 1990. Reflections on validation and quality assessment of FPC form 4 data. In Data quality control: Theory and pragmatics, ed. Liepens and Uppuluri, 19-24. New York: Marcel Dekker.

Liepens, G. E., and V. R. R. Uppuluri, eds. 1990. Data quality control: Theory and pragmatics. New York: Marcel Dekker.

Loebl, A. S. 1990. Accuracy and Relevance and the Quality of Data. in Data Quality Control: Theory and Pragmatics, Liepens and Uppuluri, eds., pp. 105-144.

Montgomery, D. C. 1991. Introduction to statistical quality control. 2nd ed. New York: John Wiley and Sons.

Morey, R. C. 1982. Estimating and improving the quality of information in a MIS. Communications of the ACM (May): 337-342.

Neter, J., and J. K. Loebbecke. 1975. Behavior of major statistical estimators in sampling audit populations.  New York: AICPA.

Paradice, D. B., and W. L. Fuerst. 1991. An MIS data quality methodology based on optimal error detection. Journal of Information Systems (Spring): 48-66.

Parsaye, K., and M. Chignell. 1993. Data quality control with smart databases. AI Expert 8, no. 5 (May): 22-27.

Ramage, J. G., A. M. Krieger, and L. L. Spero. 1979. An empirical study of error characteristics in audit populations. Journal of Accounting Research (Supplement): 72-102.

Redman, T. C. 1994. Data quality for telecommunications. IEEE Journal On Selected Areas In Communications, February 12, no. 2: 306-312.

Ross, S. M. 1987. Introduction to probability and statistics for engineers and scientists. New York: John Wiley and Sons.

Specht, P. H. 1986. Job characteristics as indicants of CBIS data requirements. MIS Quarterly (September): 270-86.

Shewhart, W. A. 1925. The Application of Statistics as an Aid in Maintaining Quality of a Manufactured Product. Journal of the American Statistical Association, December, pp. 546-548.

Shewhart, W. A. 1931. Economic Control of Quality of Manufactured Product. New York: Van Nostrand.

Svanks, M. I. 1988. Integrity analysis: Methods for automating data quality assurance. Information and Software Technology (December): 595-605.

Vasarhelyi, M. A., and F. B. Halper. 1991. The continuous audit of online systems. Auditing: A Journal of Practice & Theory (Spring): 110-25.

Yu, S. and J. Neter. 1973. A stochastic model of the internal control system. Journal of Accounting Research (Autumn): 273-95.

Wand, Y., and R. Y. Wang. 1996. Anchoring data quality dimensions in ontological foundations. Communications of the ACM. 39, no. 11 (November): 86-95.

Wang, R.Y., ed. 1993. Information technology in action: Trends and perspectives. Englewood Cliffs, New Jersey: Prentice Hall.

Wang, R.Y., and H. B. Kon. 1993. Toward total data quality management. In Information technology in action: Trends and perspectives, ed. R.Y. Wang, 179-97. Englewood Cliffs, New Jersey: Prentice Hall.

Wang, R.Y., and M. P. Reedy. 1993. Quality data objects. Working paper TDQM-92-06, MIT TDQM Research Program, E53-320, 50 Memorial Drive, Cambridge, Ma. 02139.

Wang, R.Y., V. Storey, and C. Firth. 1995. A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering 7, no. 4: 623-640.

Wang, R.Y., and D. Strong, D. 1996. Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems 4 (Spring): 5-34.

Weber, R. 1988. EDP Auditing, 2nd ed. New York: McGraw-Hill.

Weber, R. 1999. Information systems control and audit. Upper Saddle River, N.J.: Prentice Hall.

Willingham, J.J., and W.F. Wright. 1985. Financial Statement Errors and Internal Control Judgments. Auditing: A Journal of Practice & Theory, Fall, pp. 57-70.

8.  FIGURES

Figure 1     (return to text section 4.1)   (return to text section 5.4.1)


Figure 2   (return to text section 4.1)    (return to text section 4.4)  (return to text section 5.4.1)


Figure 3   (return to text section 4.2)    (return to text section 5.4.2)


Figure 4    (return to text section 4.2)    (return to text section 4.4)  (return to text section 5.4.2)


Figures 5 and 6   (return to text section 1)    (return to text section 3.2)  (return to text section 4.1)

                               (return to text section 5.4.1)   (return to text section 5.4.2)