Volume 7 Number 1 Copyright 2001 ISSN 02308
Data Quality in the Healthcare Industry
Michael S. Gendron and Marianne J. D'Onofrio, Central Connecticut State University
Key Words: Quality, Data, Healthcare, Managers, Survey
Abstract:
We examined Wang and Strong's data quality dimensions for three sectors of the healthcare industry. We eliminated five dimensions from Wang and Strong's model and analyzed the efficacy of the remaining fifteen. Our statistical analyses indicate that fifteen of Wang and Strong's data quality dimensions are sufficient to define data quality in all sectors of the healthcare industry. We found, however, that each segment of the healthcare industry must develop a set of domain specific dimensions the supplement the generic fifteen.
Contents:
1. Introduction
The healthcare industry is composed of a complex set of organizations. This creates an environment with stringent needs for data quality. As in all organizations, healthcare managers rely on data to manage and lead. The business contribution of leadership and management activities depends on the quality of the decisions that are made and, concomitantly, the quality of the data used to make them. Therefore, all data must have a level of quality appropriate for decisions for which they will be a part (Ballou and Tayi, 1999). Consequently, workplace managers are data consumers. And the perceived quality of data and information sources affect how managers use them to make decisions (Larcker and Parker, 1980). This study examines how healthcare managers perceive data quality.
Data Use Within the Healthcare Industry
Organizations may be divided along a continuum from for-profit to non-profit (Bozeman, 1987). In this study we divide the healthcare industry into three sectors: for-profit, mixed, and non-profit. While each of the three sectors participates in the delivery of healthcare services, each is somewhat unique in its orientation. For example, within public health agencies data is often used inferentially to suppost public policy. Death certificates are an example of data used to derive public policy. Death certificate data are often incorrect, in part because of the decline in autopsies over the past several decades in the United States (Altman, 1998). One may compare death certificate data to data used within health maintenance organizations, where public health statistics often set health-plan utilization policies, but financial data is used to run the HMO. Some health maintenance organizations have gone out of business because managers don't pay attention to financial data (Winslow, 1997).
| Organization Type | Rationale | Type |
|---|---|---|
| Public Health Agency |
Receives funding primarily through government sources, with some grant
funds.
Not market driven. Produces intellectual property and public goods. |
Non-Profit |
| Health Maintenance Organization |
Receives some funding from the U.S. Government through Medicare and Medicaid.
Driven by market forces. Produces private goods for individual consumers
and government |
Mixed |
| Pharmaceutical Corporation |
Receives the majority of its funding through market forces and a small
amount through government grant programs.
Almost exclusively driven by by market conditions. Produces corporate
intelligence and goods |
For-Profit |
Data Quality Attributes
There is a long-standing belief that data quality has many attributes (Agmon and Ahituv, 1987; King and Epstein, 1983; Seddon, 1997). Recently Wang and Strong proposed a data quality framework that includes the categories of accuracy, relevancy, representation, and accessibility (Wang and Strong, 1996). Wang and Strong proposed 20 data quality dimensions, which they later pared down to 15, and which they assembled into 4 categories. They demonstrated that 15 dimensions were relevant to a generic population (generic in the sense that the study diid not sample any specific domain or industry, but a large number of domains and industries). Since we felt that several of the dimensions Wang and Strong eliminated had validity in the healthcare industry, our work uses their original 20 dimensions (Table 2). This allowed us to test the "eliminated dimensions" within the healthcare industry.
| Category | Dimension |
|---|---|
| Accuracy of Data | Believability |
| Accuracy | |
| Objectivity | |
| Reputation | |
| Relevancy of Data | Value Added |
| Relevancy | |
| Timeliness | |
| Completeness | |
| Appropriate Amount of Data | |
| Representation of Data |
Interpretability |
| Ease of Understanding | |
| Representational Consistency | |
| Concise Representation | |
| Accessibility of Data |
Accessibility |
| Access Security | |
| Eliminated Dimensions |
Traceability |
| Variety of Data Sources | |
| Ease of Operation | |
| Flexibility | |
| Cost-Effectiveness |
We also looked at the domain specificity of the data quality dimensions to determine whether there were any significant differences among the sectors of the healthcare industry that we studied.
2. Methods
In this section we explain the methodology for our study by describing the sampling frame, survey instruments, and the survey distribution process. We also explain how we derived our hypotheses.
Sampling Frame
The population we studied was healthcare executives in the United States. We examined three sub-populations -- healthcare executives employed by pharmaceutical companies, health maintenance organizations, and public health agencies. Our sampling frame was drawn from a national database of business maintained by InfoUSA. (INFOUSA.COM, 1999).
We drew stratified random samples using Standard Industrial Classification (SIC) codes, and job titles based on using chi-square as our primary statistical test. The SIC codes we used were those that appeared to best capture the industry sub-populations under study. We used job titles to stratify by management level.
| Public Health Agencies | 943102 | Government Public Health Programs |
| Health Maintenance Organizations |
809904 | Health Maintenance Organizations |
| Pharmaceutical Corporations |
283498 308916 355913 591207 873108 |
Pharmaceutical Preparation Pharmaceutical Equipment and Supplies Pharmaceutical Machinery Pharmaceutical Consultants Pharmaceutical Research Labs |
InfoUSA codes its data with a finite number of job titles. We grouped the titles into management levels according to Anthony's framework (Anthony, 1965). We drew a stratified random sample of subjects, by management level, from within each group of SIC codes.
The sample was stratified using the membership percentages from the public health organizations sampling frame to determine the number of subjects to select from each management group. We feel these percentages are representative of those found in most organizations. We combined middle and lower management sampling frames because the database provided by InfoUSA combined these two management levels by using similar titles, making it impossible to differentiate betwen them. Because of ambiguity among titles, management level (according to Anthony), and sub-populations, we asked subjects to self-report their management level.
Survey Instruments
The survey questionnaire used in our study was based on items reported by Wang and Strong (Wang and Strong, 1996). Items to be included on the questionnaire were developed during a four-step process to determine a 4-category data quality framework which included 15 dimensions of data quality. Our study essentially used the data quality dimensions determined in step 2 (below), but we include steps 1,3, and 4 to show our complete methodology:
Step 1. We conducted a survey of two groups of subjects. The first group was composed of 112 data consumers working in industry. The second group was 25 students in an MBA program at a large university in the Eastern United States. We derived 179 data quality attributes from this survey process.
Step 2. We conducted a second survey which used the 179 attributes as items. We mailed questionnaires to1500 respondents who were randomly chosen from the alumni of an MBA program at a large university in the Eastern United States to rate the importance of the 179 data quality attributes. We obtained 355 usable survey questionnaires. We performed a factor analysis on the survey response using a principal components method to rate attributes. Our factor analysis yielded 29 components which expalined 73.9% of the variance. We excluded 9 of the components, based on several exclusion criteria, leaving 20 components, or dimensions of data quality. These dimensions explained 59.3% of the variance. We next performed a reliability assessment of the item sets forming the 20 data quality dimensions using Chronbach's Alpha test, which yielded coefficients between 0.69 and 0.98.
Prior to performing sorting studies (Steps 3 and 4), we proposed a 4-category preliminary data quality framework. We used sorting studies to test intial groupings and adjust dimensions to encompass target categories. We selected 30 students, who were enrolled in an evening MBA program, to participate in a two-stage sorting study.
Step 3. We randomly selected 18 subjects from among the evening MBA students to participate in our first sorting study. Subjects were each given 20 3x5 index cards with a data quality dimension and a short description of the dimension printed on each card. Subjects were asked to sort their 20 cards into three, four, or five category piles. The subjects were then asked to label each of the piles.
We then adjusted the original assignment of dimensions into categories, based upon this phase-one study. Specifically, five data quality dimnsions were eliminated because they were not rated as important by the subjects or were not placed consistently in category piles. Our first sorting study determined the adjusted data quality categories to be used in the second phase sorting study (Step 4).
Step 4. The remaining twelve MBA students participated in the second phase of the sorting study. We gave subjects 15 3x5 index cards. But rather than being asked to place the cards into self-defined piles, the subjects were given an updated list of categories developed during the phase 1 sorting study. We then asked subjects to place cards into the updated category that best represented a data quality dimension. This sorting study confirmed the placement of 15 dimensions into 4 categories and yielded the the final generic four-category framework of data quality (see Table 2).
Thus, we started with Wang and Strong's 20 data quality dimensions, then pared the dimensions down to 15. Our study used Wang and Strong's 20 data quality dimensions as items for our survey questionnaire. Then (Steps 3 and 4) we employed different techniques to confirm and refine the Wang and Strong's dimensions. We believe our methodology is appropriate for three reasons. First, we wanted to verify that 15 data quality dimensions are important to healthcare managers. Second, we wanted to determine whether the 5 dimensions we eliminated were not important to healthcare managers, thus supporting the notion that 15 data quality dimensions are base elements of data quality but that different sectors may need to add industry-specific dimensions. Third, we note that Wang and Strong used all 20 data quality dimensions to describe a generic population.
The survey questionnaire we used in our study requested that the subjects provide their job function and that they rate the importance of 20 data quality dimensions to their job decisions. To determine job function we asked respondents; "Which most closely describes your function?" and provide four choices (see Table 4). If the respondents didn't see their job function listed (i.e., they selected option "D"), they were asked to provide their title and a brief description of their duties. This additional information was collected so we could re-code the response to the correct management level.
| A | SENIOR MANAGEMENT - My job is long term and large scope. It deals with
corporate goals and external factors. The tasks are non-repetitive and usually not well structured. I set organizational goals. |
| B | MIDDLE OR DEPARTMENT MANAGEMENT - My job scope is medium range. I acquire
resurces to meet long term goals. The tasks are more frequent, more structured, and well defined. I initiate procedures to meet company goals. |
| C | PROJECT MANAGER OR GENERAL EMPLOYEE - My job scope is short range. I
implement corporate goals. My job requires that I perform daily tasks effectively and efficiently. My tasks are repetitive, well understood and well defined. Most transactions use the same information and the same process. |
| D | My job is not described above. |
The importance rating section of the survey was composed of 20 items and descriptions (see Table 5). The items and their descriptions are identical to Wang and Strong's 20 dimensions of data quality, presented on a mapped Likert scale: 1 -- extremely important, 2 -- very important, 3 -- important, 4 -- not very important, 5 -- not important at all. Respondents were asked to rate the importance of the 20 data quality dimensions to decisions they make on their job.
| Dimension of Data Quality (Definition) | |
|---|---|
| ACCESS SECURITY - (data cannot be accessed by competitors, data are of a proprietary nature, access to data can be restricted, secure) |
FLEXIBILITY - (adaptable, flexible, extendable,
expandable) |
| ACCESSIBILITY - (accessible, retrievable, speed of access, available, up-to-date) |
INTERPRETABILITY - (interpretable) |
| ACCURACY - (data are certified error-free, accurate, correct, flawless, reliable, errors can be easily identified, the integrity of the data, precise) |
OBJECTIVITY - (unbiased, objective) |
| APPROPRIATE AMOUNT OF DATA (the amount of data is appropriate to the task at hand) |
RELEVANCY - (applicable, relevant, interesting, usable) |
| BELIEVABILITY - (believable) | REPRESENTATIONAL CONSISTENCY - (data are continuously represented in the same format, are consistently represented, consistently formatted, data are compatible with previous data) |
| COMPLETENESS - (the breath, depth and scope of information contained in the data) |
REPUTATION - (the reputation of the data source, the reputation of the data) |
| CONCISE - (well-presented, concise, compactly represented, well-organized, aesthetically pleasing, form of presentation, well formatted, format of the data) |
TIMELINESS (age of data) |
| COST EFFECTIVENESS - (cost of data accuracy, cost of data collection, cost effective) |
TRACE-ABILITY - (well-documented, easily traced, verifiable) |
| EASE OF OPERATION - (easily joined, easily changed, easily updated, easily downloaded/uploaded, data can be used for multiple purposes, manipulatable, easily aggregated, easily reproduced, data can be easily integrated, easily customized) |
VALUE ADDED - (data provide competitive advantage, data add value to operations) |
| EASE OF UNDERSTANDING - (easily understood, clear, readable) |
VARIETY OF DATA AND DATA SOURCES - (a variety of data and data sources are available) |
Questionnaire Mail Out - Mail Back Procedure
We mailed survey questionnaires to selected subjects, together with a laser printed cover letter addressed to each respondent. A prepaid business reply envelope was included. Letters and envelopes were addressed to individuals, since the literature indicates this increases the response rate (Rea and Parker, 1997). Three weeks after we mailed the questionnaires we mailed a follow-up letter to non-responders, together with another survey questionnaire. After six weeks, we mailed a final follow-up letter to non-responders. Our final follow-up letter included a 3" by 5" "lift letter" printed on bright paper and stapled to the cover letter. Three weeks after the final follow-up letter, we ceased data collection.
Each questionnaire had a unique control number, to help us manage returned questionnaires and subsequent follow-up mailings. Our correspondence with respondents explained that we used control nimbers to manage mailings and that control numbers were not associated with individual respondents. We assured subjects that their responses would be confidential and their anonymity would be protected.
Hypothesis Testing
Our research builds on Wang and Strong's study of data quality dimensions (Wang and Strong, 1996). We believe that our 15 data quality dimensions can be used as a foundation, but that managers need to supplement our dimensions with additional dimensions that are domain specific. To that end, we evaluated the five dimensions that we eventually eliminated. We believe that further study will uncover more relevant data quality dimensions.
We examined 20 data quality dimensions so we could develop an understanding of how the dimensions might be perceived differently within different sub-populations in the healthcare industry. We evaluated healthcare managers' ratings for all 20 dimensons of data quality. The following hypotheses support our objectives. For the sake of brevity, we state only the alternate hypotheses. These represent our hypothesized outcomes.
H1 Validation of 15 Data Quality Dimensions
(Alternate) We considered a generic dimension to be relevant to healthcare managers' inclusion criteria if it attained a mean rating greater than "important" (mean rating less than or equal to 3.00). We used this hypothesis to test 15 generic data quality dimensions. We hypothesized that all 15 dimensions would receive enough support from healthcare managers to support their inclusion in a generic definition of data quality.
H2A Importance of Traceability and Cost-Effectiveness
(Alternate) We considered that the cost-effectivness and traceability of the 5 data quality dimensions we eliminated were relevant to healthcare managers' inclusion criteria ONLY if they attained a mean rating greater than "important' (mean rating less than or equal to 3.00).
H2B Variety of Data Sources, Ease of Operation, and Flexibility
(Alternate) Health care managers did not consider these data quality characteristics "important" unless they awarded them a mean rating greater than or equal to 3.00.
H3 Importance Ratings Comparing Sub-Populations
(Alternate) There will be a difference in the importance rating of data quality dimensions between sub-populations. We tested this hypothesis on all 20 dimensions, using the Chi-Square test (alpha = .05, and 2 degrees of freedom).
Our hypotheses H1, H2A, and H2B are necessarily parsimonious, and restricted to the use of mean ratings attained for those dimensions under test. That allowed us to conform to Wang and Strong's original study, where they eliminated dimensions that were not uniformly placed into a categorical framework nor highly ranked in importance. Our study tested individual dimensions rather than the categorical framework. We felt that if healthcare managers ranked data quality dimensions as "important" the dimensions were, in fact, important in the healthcare industry.
3. Results
We mailed survey questionnaires to 900 subjects, in three different sub-populations in the healthcare industry (300 subjects per sub-population). The overall response rate was 30.55%. Individual sub-populations had different response rates, which are summarized in Table 6.
| Public Health Agency | Health Maintenance Organization | Pharmaceutical Corporation | |
|---|---|---|---|
| Number of Subjects | 300 | 300 | 300 |
| Number of Respondents | 149 | 61 | 65 |
| Response Rate | 49.67% | 20.33% | 21.67% |
Non-responeses fell into four categories: bad addresses, those unable to respond because of confidentiality concerns, those unwilling to respond, and non-responses for unknown reasons. We summarize non-response rates in Table 7.
Four subjects who returned the survey questionnaire uncompleted cited confidentiality issues. These subjects indicated that they could not respond to our study because they were in litigation over data related issues or they feared being drawn into litigation at some future time. They indicated they were concerned that, despite the promise of anonymity given all respondents, their individual responses might someday become evidentiary material in a legal case. These anecodotes, and the response clustering described below, may point to a high level of social desireability bias when subjects are asked questions about data quality perceptions. We believe this was the greatest cause of survey non-response; and we feel that the response clustering we experienced is directly related to social desireability. Our survey asked about data quality and decision-making, which is clearly a sensitive topic in the healthcare industry. Our experience is similar to what other researchers have seen when asking sensitive questions. We are evisioning ways to control for this bias in future studies.
| Reason | Number of Subjects | Percent Non-Responding |
|---|---|---|
| Bad Addresses | 51 | 8.16% |
| Confidentiality Related | 4 | 0.64% |
| Unwilling to Respond, but Returned Survey |
6 | 0.96% |
| Did not respond in any way |
564 | 90.24% |
| Total Non-Respondents | 625 | |
| Non-Response Rate | 69.44% |
Overall, the response set contained an unexpectedly narrow clustering of responses, yielding little variability across the importance ratings of the 20 data quality dimensions. Respondents almost always (96.23% of all responses) rated the 20 data quality dimensions as "extremely important," "very important," or "important." Only 3.77% of the responses were rated as "not very important" or "not important at all." This may by explained by: 1) the earlier work upon which this study is based determined the 20 dimensions should be rated as at least "important" within a generic population (Wang and String, 1996), or 2) respondents did not want to appear flippant about any data quality dimension, and therefore did not rate anything as less than "important." We will explore these hypotheses further in a future study.
Due to the lack of response variability, we collapsed the rating categories into a new rating named "LVI" (less than very important) for our chi-square analysis. To ensure that we lost no information, we performed chi-square tests on both collapsed and un-collapsed responses. There was no significant difference between the two.
4. Testing Hypotheses
We tested Hypothesis 1 to confirm whether the 15 generic dimensions are applicable to the healthcare industry by determining their importance to managers. As we stated, all 15 dimensions attained a mean rating of at least "important" (< 3.00, with a lower value denoting greater importance). The mean importance ratings for the 15 generic data quality dimensions are provided in Table 8.
| Generic Dimension | Mean | Number |
|---|---|---|
| Access Security | 1.67 | 270 |
| Accessibility | 1.65 | 271 |
| Accuracy | 1.47 | 271 |
| Appropriate Amount of Data | 2.09 | 271 |
| Believability | 1.73 | 265 |
| Completeness | 1.84 | 267 |
| Concise | 2.01 | 270 |
| Ease of Understanding | 1.78 | 271 |
| Interpretability | 1.98 | 270 |
| Objectivity | 1.87 | 264 |
| Relevancy | 1.89 | 288 |
| Representational Consistency | 2.03 | 268 |
| Reputation | 1.97 | 270 |
| Timeliness | 1.91 | 271 |
| Value Added | 2.06 | 269 |
1 = Extremely Important, 2 = Very Important, 3 = Important,
4 = Not Very Important, 5 = Not Important At All
We tested Hypothesis 2 to determine whether two of the eliminated dimensions (H2A: cost-effectiveness and trace-ability) could attain a level of statistical significance that would indicate that they should be included in a domain specific definition of data quality while three (H2B: ease of operation, flexibility, and variety of data sources) would not. Based on our experience in the healthcare industry we felt that cost-effectiveness and trace-ability were significant dimensions in the healthcare industry, but we were not convinced that ease of operation, flexibility, and variety of data sources were. Our statistical analysis showed that all 5 eliminated dimensions exceeded a rating of "important" and were skewed toward "extremely important."
| Eliminated Dimensions | Overall Mean |
|---|---|
| Cost Effectiveness | 2.07 |
| Ease of Operation | 1.92 |
| Flexibility | 2.14 |
| Traceability | 1.98 |
| Variety of Data Sources | 2.31 |
1 = Extremely Important, 2 = Very Important, 3 = Important,
4 = Not Very Important, 5 = Not Important At All
We tested Hypothesis 3 to determine whether the healthcare industry managers' rating of data quality dimensions is homogeneous. Of the 20 dimensions of data quality we tested only value added (Chi-square = 11.160, df = 4, p = .025) exhibited any significant differences between healthcare industry sub-populations. Table 10 shows the descriptive statistics for value added.
| Organization | Mean | Mode | Number |
|---|---|---|---|
| Overall | 2.06 | 3 | 269 |
| Public Health | 2.21 | 3 | 146 |
| Health Maintenance | 1.87 | 1 | 60 |
| Pharmaceutical Corporations | 1.90 | 1 (note *) | 63 |
Rating Endpoints: 1 = Extremely Important and 3 = LVI
* Multiple modes exist, smallest value is shown
5. Discussion
Our study determined that healthcare managers felt that not only Wang and Strong's 15 data quality dimensions are important within the healthcare industry, but the managers felt that the 5 eliminated dimensions are important as well. Because only 15 dimensions were relevant to the generic sample studied by Wang and Strong, while we found that all 20 dimensions were important to healthcare managers, we believe that 15 dimensions are a base set that needs to be supplemented with other domain specific dimensions.
When we examined the healthcare by sub-population, we found there was little significant variability among the individual dimensions. Value-added was the only dimension among sub-populations that showed signfiicant variability among sub-populations of healthcare managers. Our analysis of the value-added rating show an interesting pattern. Health maintenance organizations and pharmaceutical corporations appear to exhibit similar importance ratings, indicating similar perceptions of value-added. Their ratings of value-added were skewed toward "very important." When we examined public health agencies, we found the perceptions were reversed, with 43.5% of respondents rating value-added as LVI (less than "very important;" that is, "important," "not very important," or "not important at all"). After reviewing a statistical breakdown of LVI it appears that public health agencies are less concerned than HMOs and pharmaceutical companies about decision-making data adding value to their organization. We feel that this is most likely due to public health agencies' external public-good focus and their usual lack of concern about profitability.
6. Conclusion
We suggest that researchers explore the domain specificity of data quality and examine specific industries, with the goal of discovering data quality frameworks that are industry specific. The four-category framework proposed by Wang and Strong also needs to be tested with industry specific samples.
References
Agmon, N., and N. Ahituv. 1987. Assessing data reliability in an information system. Journal of Management Information Systems. 4: 34-44.
Altman, L., 1998. The doctor's world: getting it right on the facts of death. The New York Times, 16 January, p. F7.
Anthony, R.N. 1965. Planning and control systems: a framework for analysis. Cambridge: Harvard University Press.
Ballou, D., and G. Tayi. 1999. Enhancing data quality in data warehouse environments. Communications of the ACM, 42: 73-78.
Bozeman, B. 1987. All organizations are public. San Francisco: Josey-Bass, Inc.
INFOUSA.COM. 1999. http://www.infousa.com.
King, W., and B. Epstein. 1983. Assessing information value: an experimental study. Decision Sciences, 14: 34-35.
Larcker, D,. and L. Parker. 1980. Perceived usefulness of information: a
psychmetric examination.
Decision Sciences, 11: 121-134.
Rea, L., and R. Parker. 1997. Designing and conducting survey research: a comprehensive guide, 2nd ed. San Francisco: Josey-Bass, Inc.
Seddon, J. 1997. In pursuit of quality. Dublin: Oak Tree Press.
Wang R., and D. Strong. 1996. Beyond accuracy; what data quality means to data consumers. Journal of Management Systems, 12: 5-34.
Winslow, O. 1997. Oxford health plans nemesis - data quality. The Wall Street Journal, 11 December, p. B1.
Go to: Data Quality Home Page
Comments: dqemail@aol.com (12-01-2001)