Volume 4 Number 1 Copyright 1998
Author contact information:
Richard G. Mathieu (corresponding author), Department of Production & Decision Sciences, Cameron School of Business, University of North Carolina at Wilmington. Internet: mathieur@uncwil.edu.
Omar Khalil, Department of Marketing and Business Information Systems, College of Business and Industry, University of Massachusetts at Dartmouth. Internet: okhalil@umassd.edu
Current evidence indicates that poor data quality is pervasive and has a significant negative impact on business success. Information-system (IS) professionals are typically charged with managing the enterprise's data resources, yet few have received formal training in techniques for improving data quality. There is a clear need to integrate topics related to data quality into courses on database systems. This paper describes the importance of data quality and defines data quality in a four-dimensional framework. The contents of five popular database system textbooks are analyzed for their coverage of data quality concepts. This analysis is used to suggest four specific modules that can be used by information-system trainers and educators in a course on database systems.
Poor data quality is pervasive and costly to industry. Redman reports that error rates of 1-5% are typical, with an estimated immediate cost of about 10% of revenue (Redman, 1996). Customers, suppliers, distributors, and employees are negatively impacted through poor service, billing errors, and inconvenience. Data quality problems are exacerbated in large organizational databases where data are collected from multiple data sources. Strong, Lee, and Wang (1996) caution that information-system professionals should seek not only to improve data accuracy, but should also consider data accessibility and data relevance as they relate to the context of the data consumers' tasks.
Businesses have implemented programs to improve data quality to enhance competitive advantage. AT&T used its data quality program to suggest opportunities to reengineer their billing system; as a result, billing errors were reduced by two orders of magnitude (Redman, 1996). Data warehouses are used by organizations to improve customer service and managerial decision making. A major issue in building and maintaining a data warehouse is data quality. Typically, organizations will initially spend considerable time ensuring quality of data, but the focus on data quality gradually fades. Without proper data quality processes, the data warehouse will begin to accumulate "dirty data" (Garcia, 1997).
While the importance of data quality is recognized, most information-systems curricula do not directly address the issue. Textbooks in database systems, system analysis and design, management information systems, and decision support systems typically pay little attention to data quality. While prior model curricula gave little attention to quality issues, both the IRMA/DAMA Curriculum Model (1996) and the IS'97 Model Curriculum (Davis et al., 1997) have placed significant emphasis on quality. however, the primary focus in the IS'97 Model Curriculum is on the "principles of quality improvement" and on "software quality." No specific mention of data or information quality is made in the IS'97 Model Curriculum. Data quality is addressed indirectly through such topics as data integrity, EDP auditing, data dictionaries, and software development procedures. The IRMA/DAMA Curriculum Model lists "quality control and information" as an organizational issue in a course on Information Technology and "information quality and control" as a topic in a course on Information Resources Management Principles. Interestingly, the IRMA/DAMA curriculum proposes a course entitled "Data Resource Structures and Administration" that has as an objective "to provide students with a basic knowledge of various concepts of data resources administration and database management," yet nowhere does the course directly address data quality issues. Even with these two model curricula, most university information-system curricula still give little attention to data quality as an important concept. At best, the typical IS student is exposed to a variety of topics that have an impact on data quality, but is not equipped with a broad understanding of the principles behind measuring, tracking, and improving data quality in an organization.
The purpose of this paper is to assist information-system educators, instructors, and trainers in teaching data quality principles within a course on database systems. First, the literature on data quality is surveyed, and from this survey a data quality framework is presented. This framework is used to analyze the coverage given to data quality in a sample of database textbooks. Finally, implications drawing from the database-textbook analysis are used to suggest specific topics that should be taught in the undergraduate database course. These suggestions are made in the form of four teaching modules:
Many data quality programs have focused exclusively on improving data accuracy. Today, many researchers point out that data quality is a multidimensional issue that needs to be assess from the data consumer's perspective. Wang and Strong (1996) identified four dimensions of data quality consisting of fifteen measurable attributes:
Becker (1998) categorizes seven common data quality problems seen by end-users: (1) data corruption due to incorrect conversion, (2) historical and current data having different meanings, (3) the same data having more than one data definition, (4) missing data, (5) hidden data, (6) missing granularity, and (9) violation of integrity rules. The research strongly suggest that information system professionals should strive to deliver more than accurate and objective data to the users. Conventional control techniques such as edit checks, database integrity constraints, and EDP auditing have traditionally been used to improve data accuracy. However, IS professionals must give attention to techniques that help data consumers perform their tasks by developing and maintaining systems that are flexible enough to allow easy aggregation and manipulation of data, that are responsive enough to provide relevant and easily interpreted information, and that are secure and robust enough to prohibit accidental or criminal data corruption.
Data should be viewed as a business asset. Organizations hold managers accountable for business assets like capital, raw materials, machinery and equipment, and employees. Organizations that wish to possess high quality data should recognize data as a business asset and define explicitly management responsibilities for them. Unlike many other organizational assets, data are highly dynamic. Understanding the data life cycle is important to understanding the nature of data. Redman (1996) proposes a data life cycle that includes two cycles: the data acquisition cycle and the data usage cycle. The data acquisition cycle and the data usage cycle. The data acquisition cycle has the ultimate goal of storing data. Thus data modeling and obtaining data values are necessary components of acquiring data. The four steps of the data acquisition cycle are:
It is important to note that while database modeling and design and data storage are typically viewed as the province of an Information Technology group, the process of populating the database typically rests with the manager(s) responsible for the business processes that use the data.
The data usage cycle starts with data that have previously been stored. While many business uses of data are routine -- like customer billing and paying employees-- data are also used for making decisions. Using data in the decision-making process typically requires combining large amounts of data, putting them into a new context, and interpreting the output. The four steps of the data cycle are:
While query design and processing are viewed as the province of IT professionals, the manipulation of content and the presentation of results are often considered to be in the purview of the user (data consumer).
According to the U.S. Defense Information System Agency (Cykana & Stern, 1996), the root causes of poor data quality can be attributed to four primary areas:
An understanding of the processes that generate, use, and store data are essential to understanding data quality. Business processes typically extend horizontally across the organization, and process owners should be made responsible for the quality of data that they produce or use. Te'eni (1993) documents the importance of treating the behavioral aspects of data quality and proposes a model to deal with data quality problems. Redman (1996) recommends that use of an information model called Functions of Information Processing (FIP) to help process owners describe information chains. The FIP diagram models how data are created, moved, stored, filtered, queued, and associated in an information chain, and is quite useful in identifying sources of data quality problems. Experience in the United States Department of Defense revealed that a majority of data errors can be attributed to process problems (Dvir and Evans, 1996). Hence, IS analysts are encouraged to examine the existing processes that support data entry, the assignment and execution of data quality responsibilities, and methods used to exchange data.
Data quality problems often stem from system deficiencies rooted in poorly documented modifications, incomplete user training and user manuals, or systems that have been extended beyond their original intent. IS professionals should examine system modifications, user training, user manuals, engineering change requests, and problem reports in an effort to improve data quality. Poor data quality often stems from inadequate organizational policies. Clear management responsibilities should be established within the organization. A data policy should cover: (1) security, privacy, and rules of use; (2) inventory of data assets; (3) data sharing and availability; (4) data architecture; (5) planning; and (6) the role of quality (Redman, 1996).
Improving the quality of data in an organization is often a daunting task. Organizations often have enormous quantities of data spread over many divisions that often employ different technologies. A data quality program is essential for improving data quality within an organization. Redman (1996) states that a good data quality program should:
A Total Quality Management (TQM) framework for improving data quality has been proposed by both Dvir and Evans (1996) and Becker (1998). Inherent to his process is the need to translate data-customer needs into metrics, a team-oriented approach to continuous quality improvement, and benchmarking performance. Many of the classical techniques of statistical quality control, such as Pareto charts and control charts, can be applied to the measurement, tracking, and improvement of data quality.
The traditional course in database systems focuses on database modeling, query languages, database design, and database development. These topics dominate the database textbooks currently available. Frost (1997) argues that many graduates emerge unprepared to apply database technology to solve business problems, and goes on to suggest that a course on database systems should focus on placing students in teams so that they can learn to design solutions to real business problems. Rolier (1993) stresses the importance of a database project in the undergraduate IS curriculum and emphasizes the importance of the implementation experience in understanding concepts such as referential integrity, commit and rollback, and domains. Hossein (1992) describes a project-oriented, database life cycle approach to teaching the undergraduate database course. however, data quality is seldom explicitly mentioned in a database course or textbook. Topics that relate indirectly to data quality, such as data integrity, security, and concurrency control, are widely covered. Unfortunately, the student does not generally receive instruction on the overall importance of data quality in the design and implementation of databases.
Certainly, increased emphasis on the principles of quality management and software quality will greatly aid the IS professional wishing to deal with data quality. Indeed the principles of Total Quality Management have been suggested as tools to improve classroom teaching (thomson, 1994). However, it is felt that the omission of data quality as a core IS curriculum topic will leave the student with an incomplete perception of the role of data quality in the information system. This comes at a time when information system professionals are increasingly being held responsible for data quality within the organization.
A major objective of this paper is to analyze several of the popular database system textbooks to determine their coverage of topics related to data quality. This analysis will then be used to suggest specific topics that can be integrated into a database systems course that introduces students to data quality. There are many different approaches that can be applied to the study of data quality. Wang, Storey, and Firth (1995) created a framework for analyzing data quality research based on an analogy between product manufacturing and data manufacturing. Other approaches to data quality relate to the data life cycle, value chain analysis, EDP auditing, and database integrity.
Our data quality framework meets the following criteria: (1) complete coverage of topics necessary for a student to understand all aspects of data quality, (2) understandable and consistent with the concepts students need to learn, and (3) pedagogically consistent with the existing IS undergraduate curriculum. The ultimate purpose of the data quality framework is to analyze the curriculum topics that relate to data quality. The four dimensions of the framework consist of:
Within each of these dimensions several subdimensions exist. Associated with each subdimension is a set of specific topical areas. This detailed framework permitted a thorough and consistent analysis of the textbooks. Table 1 provides a detailed description of the topics covered within each dimension of the data quality framework.
| Data Quality Dimensions | Topics Covered within DQ Dimensions | |
|---|---|---|
| A | Data life cycle: Data acquisition | overall view |
| A1 | Define the View (Logical Database Design) | E/R diagrams, normalization, database models |
| A2 | Implement view (Physical database design) | integrity constraints, indexes, denormalization |
| A3 | Obtain values (Populating the database) | data entry, data import, downloading |
| A4 | Update records (Storage, update and deletion of records) | data dictionary, metadata, database architectures (distributed, client/server, centralized) |
| B | Data life cycle: Data use | overall view |
| B1 | Define subview | query design, SQL, QBE, other DDL's |
| B2 | Retrieve data | query performance and optimization, concurrency control (deadlock handling, locking) |
| B3 | Manipulate data | sort, aggregate, reformat, classify, analyze |
| B4 | Present results | report design and layout, forms, graphical presentation of results |
| B5 | Use of data | role of data in decision making, user interface design |
| C | Process management | importance of business processes on data quality |
| C1 | Business processes | role of data suppliers, data processors, and data users in a business process |
| C2 | Process design and modeling | data flow diagrams, workflow diagrams, enterprise modeling |
| C3 | Management issues | accountability for data, legal issues, policies |
| C4 | Issues in transaction processing | definition of transaction, rollback recovery, time stamping |
| D | Data Quality Control | importance and role of data quality |
| D1 | Data quality metrics | accuracy, timeliness, relevance |
| D2 | Management issues related to data quality | planning and administration of the data quality function |
| D3 | Quality control for data | auditing, sampling, tracking, assessment |
| D4 | Database security | authorization rules, authentication, encryption |
The data quality framework was used to evaluate coverage of data quality topics in several representative database textbooks. The analysis was not intended to rank textbooks in order to select a "best" textbook. Rather, the analysis was done to demonstrate the coverage given to data quality topics in popular, representative textbooks. Five textbooks suitable for the introductory undergraduate class on database systems were chosen for analysis, based on popularity (market share), pedagogical emphasis (MIS or computer science), and obvious coverage of topics related to quality. The textbooks chosen for analysis were:
The Kronke (1997), McFadden and Hoffer (1994), and Watson (1996) texts would generally be considered textbooks suited for MIS curricula, while the Date (1995) and Silberscharz, Korth, & Sudarhan (1997) texts are geared more towards a computer science curriculum. The Watson (1996) text was chosen for analysis because is it the only known database textbook with explicit material devoted to quality improvement. In order to evaluate the coverage that a textbook gives to a particular data quality dimension, book content was evaluated on whether the material presented was theoretical or definitional in nature and whether or not an example application was presented. Table 2 shows the legend for the textbook evaluations.
| Legend | Description |
|---|---|
| T | Theoretical (provides a theoretical framework of the topic with detailed explanation |
| D | Definition (provides a definition of the topic with a brief explanation) |
| A | Applied (shows an application or an example related to the topic) |
| T/A | Theoretical with Applied application or example |
| D/A | Definition with Applied application or example |
| N | None (topic not covered) |
Table 3 shows the analysis of the five database textbooks across the data acquisition phases of the data life cycle. As expected, all texts provided detailed theoretical coverage with examples related to logical and physical database design and the issues related to data storage. However, all texts were weak on material related to populating the database with data. Only the McFadden and Hoffer text provided extensive coverage on the overall database design process as it relates to system analysis, design, and implementation.
| Data Acquisition topic | Kronke | Date | McFadden & Hoffer | Watson | Silberscharz, Korth & Sudarhan | |
| A | Overall view of data life cycle | D/A | D | T/A | D | D |
| A1 | Database design (logical) | T/A | T/A | T/A | T/A | T/A |
| A2 | Database design (physical) | T/A | T/A | T/A | T/A | T/A |
| A3 | Populating the database | D | D | D/A | D | D |
| A4 | Storage, update, and deletion of records | T/A | T/A | T/A | T/A | T/A |
Table 4 shows the analysis of the five database textbooks across the data use phases of the data life cycle. None of the texts provides a detailed explanation of the data-usage life cycle, although McFadden & Hoffer do present an example that demonstrates all stages of data in a database application. All texts provided detailed coverage of query languages with a strong emphasis on SQL. No material was presented on techniques for planning a query (e.g., concept mapping). Only the more technically focused books provide coverage of issues related to query performance and optimization (retrieve data). While all textbooks provide strong coverage of data manipulation, this coverage is limited to manipulation within the query itself (manipulate data). Only the Silberscharz, Korth, & Sudarhan text covers material related to human manipulation of the data as a pre- or post-query activity. Little coverage is given to how data are presented to the user (report and form design) or to how information is used in organizational decision making.
| Kronke | Date | McFadden & Hoffer | Watson | Silberscharz, Korth & Sudarhan | ||
| B. | Overall view of data life cycle: data use | N | N | D/A | D | N |
| B1. | Define subview | T/A | T/A | T/A | T/A | T/A |
| B2. | Retrieve data | D/A | T/A | T/A | D/A | T/A |
| B3. | Manipulate data | T/A | T/A | T/A | T/A | T/A |
| B4. | Present results | A | N | D/A | N | D |
| B5. | Use of data | N | N | D | D | D/A |
Table 5 shows the analysis of the five database textbooks in terms of business process concepts. None of the texts provides good overall coverage on the role of business processes in data quality. McFadden & Hoffer do provide an example application that shows the role of data in a business process. All of the texts provide detailed coverage of issues related to transaction processing. The text by Silberscharz, Korth, & Sudarhan provides the most detailed technical coverage in this area.
Table 6 shows the analysis of the five database textbooks in terms of data quality control concepts. All of the textbooks give detailed coverage on database security. Only the Watson text provides detailed coverage on the overall importance of data quality and management issues related to data quality. However, even the Watson text does not cover the need for a program that involves measuring, tracking, assessing, and improving data quality.
| Kronke | Date | McFadden & Hoffer | Watson | Silberscharz, Korth & Sudarhan | ||
| C. | Overall importance of process management on data quality | N | N | D | D | N |
| C1. | Business processes | D | N | D/A | D | N |
| C2. | Process design and modeling | N | N | T/A | N | N |
| C3. | Management issues (Policies, legal, accountability) | T | N | T | T | D |
| C4. | Issues in transaction processing | T/A | T/A | T/A | T | T/A |
| Kronke | Date | McFadden & Hoffer | Watson | Silberscharz, Korth & Sudarhan | ||
| D. | Overall importance of data quality control | N | N | N | T | N |
| D1. | Quality metrics | N | N | N | D | N |
| D2. | Management issues related to data quality | D | N | D | T | N |
| D3. | Quality control for data | N | N | N | N | N |
| D4. | Database security | T/A | T/A | T/A | T/A | T/A |
We conclude that data quality is not explicitly covered in database system textbooks. However, database design issues that relate to data quality are typically covered. These topics include: logical and physical database design, data storage, defining a subview (designing a query), manipulating data in a query, transaction processing, and database security. These topics are more technical in nature.
The organizational component of a database system is not typically emphasized in an undergraduate database class. Nor is data quality from the user's perspective. These topics include, among others, report and form design, the role of data in business process design, populating a database, and system quality. While many of these system issues are, or should be, covered in courses on system analysis and design, the database course is the ideal place for the student to see relationships between system issues and database design issues that relate to the quality of data.
A context analysis of our small sample of database textbooks shows that none gave coverage of techniques for assessing, tracking, and improving data quality in an organization. To raise the awareness of data quality in the database system class, the following classroom "modules" are suggested for use by the IS instructor. While these modules have been used in the database course, they may also be integrated into a course on systems analysis and design, information resource management, or the introductory course in management information systems.
Through the use of examples, demonstrate to the class the importance of data quality. Redman (1996) provides many examples taken from the popular press. Perhaps the most convincing argument is a personal example in which a hotel chain incorrectly recorded a hotel reservation made by Redman and his wife; the result was an estimated loss of over $5,000 to the organization. Demonstrate how the business process is the root cause of the data quality issue. Won't every student in your class have a personal example where inaccurate data caused them time and/or money?
Show students both the Data Acquisition and the Data Use Life Cycle. Emphasize that data are dynamic and change over time. Demonstrate that data quality is determined both by effective database design and by the end-user's perception of the data. It is important that IS professions work with end users to deliver high quality reports and queries. Give examples of a database with highly accurate data, but with long response times and poorly written reports. Who will be responsible for this low quality database?
Give students a hands-on project where they are asked to measure, assess, track, and improve data quality. Ideally, this would be a project that uses an extant database. Student teams would develop metrics for assessing data quality, develop diagrams to assess existing business processes, develop techniques for tracking data quality metrics over time, and develop solutions for improving data quality. Notices that this project goes beyond a data audit in that it analyzes the existing system and generates solutions. Solutions may be in the form of improvements to the business process, the database design, the system architecture, or the existing policies and procedures.
Data quality can be emphasized throughout the database course by supplying relevant tips for improving data quality. Database Design Tips (Table 7) and Tips for Database Testing (Table 8) illustrate strategies used by information system professionals and IS faculty teaching the database course.
| Database Design Tip | Description of Design Tip | |
| 1 | Create a data value as few times as possible. | Inconsistencies between multiple values often go unnoticed until they are the source of a problem. |
| 2 | Store data in as few databases as possible. | Multiple storage makes it difficult to maintain consistency, especially when data change. |
| 3 | Put data in machine-readable form as early in the business process as possible. | Computers and scanners are better than people at activities such as reading and inputing data. However, do not assume that computerized data collection is 100% accurate. |
| 4 | Minimize data format changes within the business process. | If format changes are necessary, use computers, not people, to make format changes. |
| 5 | When obtaining data for the first time, obtain them just before they are first needed. | Existing data values change rapidly. Capture changes to data values as soon as possible after they change. |
| 6 | Discontinue gathering and storing data that are no longer useful. | Plan for periodic review of data needs. When data are no longer useful, they need not be destroyed, simply moved to secondary storage. |
| 7 | Employ codes that are easy for data creators and users to understand. | Avoid long, numeric, meaningless coding conventions in favor of short, meaningful words or abbreviations. |
| 8 | Place edits as near as possible to data creation or modification. | Use edits as input criteria to a database, as opposed to exit criteria from a database to an application. |
| 9 | Employ single-fact data wherever possible. | Single-fact data help reduce code complexity and simplify operators' jobs. |
| Database Testing Tip | Description | |
| 1 | Create test data | Crete sample files that are small enough to be manageable, but large enough to be representative. Use mostly real data. Include a full set of matching records from each group of related table. |
| 2 | Run the new database in parallel with the existing system | This helps to uncover gaps in the application such as missing reports, transaction posting or archiving procedures. Run through a full testing cycle. |
| 3 | Check for duplicate values and errors in data | Where errors are found make sure that all necessary integrity constraints are implemented. |
| 4 | Verify contents of database through report and query generation | Use record counts, batch totals, has totals, and cross-footing tests to verify data. Prime trouble spots are end-of-month and end-of-year procedures. |
| 5 | Develop auditing procedures as part of a data quality program | Regular procedures should be developed to test the accuracy of the database. This should be part of an overall program to improve data quality in the organization. |
Data quality is an increasingly important topic in business and research. However, most information system professionals have not received training in the skills necessary to identify and improve data quality. Textbooks suitable for a course in database systems typically do not explicitly address the issue of data quality. While many of the database design issues associated with data quality are covered in the database textbooks, the system issues that involve the interface between the users, the database, and the business processes are typically ignored. Some of these topics are addressed in analysis and design course, but usually not within the context of data quality.
The teaching of data quality principles must extend beyond the university curriculum and be integrated into corporate training programs for information system professionals. Professional certification programs such as those run by Institute for Certification of Computing Professionals (ICCP) and the Data Resource Management Association (DAMA) should incorporate data quality as a core concept. In addition, software-specific training should seek to educate information system professionals on data quality issues. This paper presented four modules that might alert information system professionals to data quality and emphasize the relationship between data quality and the database design techniques that are the traditional domain of a database systems course.