Volume 3 Number 1 Copyright 1997
Modeling Database Error Rates
Elizabeth Pierce, Indiana University of Pennsylvania
Key Words: Simulation, Queues, Alumni Data, Out-of-Date Addresses, Errors
Abstract: The author developed a data quality simulation model that allows data managers to measure, forecast, and analyze the data quality of their databases. Erroneous data values in a database are modeled as entities in a queuing system. In this queuing system model, the important parameters are the arrival patterns of new errors and the length of time an error remains in the database. These queuing parameters are influenced by several risk factors, categorized as the inherent, control, and detection risks. These risk factors are, in turn, influenced by variables describing the types of error, entity, attribute, and system involved. By modeling the queuing parameters as a function of variables (such as a population's demographic characteristics) or the way the database is managed, these variables can be related to the quality of a database. Using simulation, we studied the effect of changes in these variables on data quality levels in a database. We demonstrate the methodology using out-of-date address data in an alumni information database at the University of Michigan.
How good are my data? This question is being asked more and more often as managers use data stored in databases for decision-making. Redman estimated that payroll record changes have a 1% error rate, billing records have a 2-7% error rate, and the error rate for credit records may be as high as 30%. [1] In 1992, The Wall Street Journal reported that 25 of 50 information executives it surveyed believed their corporate information was less than 95% accurate. Almost all of them said that databases maintained by individual departments were not good enough to be used for important decisions. [2] Knight reached a similar decision after surveying 501 corporations having annual sales of more than $20 million. Two-thirds of the Information Systems managers he polled reported data quality problems. [3]
Data Quality
In this study, "poor quality data" are defined as erroneous values assigned to attributes of some entity. A "datum" is a single piece of information. "Data" are items of information. A datum also contains the triple specification of an entity, attribute, and value. An "entity" is a person, place, thing, or event about which data is created and stored. An "attribute" is a field of meaningful data about an entity. A record in a database or file, for example, contains attributes of one entity.
Because an erroneous datum can have a wide range of problems, for simplicity's sake we will separate errors into four categories - completeness, consistency, currency, and accuracy - suggested by Redman. [4] A datum is consistent if its value satisfies a set of constraints such as formal rules, logical requirements, or relational requirements vis-a-vis other variables. A datum is non-current or out-of-date if its recorded value was true in the past but no longer agrees with the present true value. A datum is accurate if its true value agrees with its recorded value. One should also note that "accuracy" is not necessarily equivalent to "currency." An accurate datum might never have been current, and a current datum might not be accurate.
Several types of risks can influence the percentage of incomplete, inconsistent, or out-of-date data for a given attribute in a database. One is inherent risk-the probability that a record entering a database will possess an erroneous datum. Another risk is the control risk-the probability that the system's edit checks will not detect the error and will allow the erroneous datum to enter the database.
If there is no correction of the erroneous data in a database, we can compute the database error rate as the product of the inherent risk and the control risk (IR * CR). Redman made the analogy that a database's error rate is like a lake where the pollution level rises and falls with the pollution levels of its incoming streams. [1] For example, if the error rate for incoming data in the transaction stream is 10% and the control systems monitoring the stream are 50% effective in error prevention, the database error rate is 5%.
Many times error correction is done after erroneous data reach a database. Then database administrators must consider the detection risk. This is the probability that, over time, database administrators or those using the database will not detect and correct erroneous data. If a database administrator routinely checks and corrects erroneous data in a database, the database error rate should be lower than the incoming error rate.
Risk Management Strategies
The previous section suggests that a database manager has three options: manage the arrival of database errors (Inherent Risk and Control Risk); manage the time an error can survive in the database before correction takes place (Detection Risk); or manage a combination of both. Deciding which option to pursue depends on a variety of characteristics that influence the error arrival rate and error survival time in a database. By modeling the error rate and error survival time as a function of these variables, a data manager can understand how they alter database data quality. As figure 1 shows, a database's inherent and control risks govern the error arrival rate. The detection risk governs the time an error spends in a database. The type of error, entity, attribute, or system can influence all three types of risk.
Figure 1. Model of Error Arrival Rates and Survival Times Factors
Risk Factors
Error types influence risk levels. Because incomplete and inconsistent errors lend themselves to computerized edit checks, these types of errors have a low control risk because they are easy to detect upon arrival. On the other hand, it may be very difficult to control the rate information becomes out-of-date (inherent risk) or to detect inaccurate information on input (control risk). Thus a data manager may be forced to focus on detection strategies to increase the probability that these types of erroneous data will be detected over time in the database (detection risk).
The type of entity (person, place, thing, or event about which you want to store data) can also determine the risk level. The entity may possess demographic characteristics. Different sources of information can cause some entities to have a higher error risk than others. Errors in records may be easier to detect for some entities than for others (detection risk). It may be easier to verify the information for certain groups of entities upon input (control risk).
The type of attribute (a field of meaningful data about an entity) can also determine the risk level. Certain attributes may have a higher level of inherent risk, control risk, or detection risk depending upon the attribute's characteristics; for example, how often the attribute's values change, how complex an attribute's value can be, and how often the attribute is referenced. Other relevant characteristics could include whether or not the attribute can be assigned a default value, uses meaningful codes, or is a key field. Finally, the system itself can determine risk factors. System characteristics include variables like hardware, software, complexity, stability, auditing, training, design, and reliability.
Complicating things further is time. Databases are dynamic. Records are constantly being added, deleted, or updated in the database. An error may exist in the database for some time before it is detected and eliminated. In other cases, an erroneous datum may not be detected and may be deleted only when a record is purged from the database.
One way to capture variables in a cohesive model is to treat the database errors like entities in a queuing system. In such a system, consider how the error arrival rate and the time-to-detect and service errors influence the number of errors that accumulate in the database. Moreover, by modeling the error arrival rate and service as functions of variables (such as the entity's demographics or the type of detection method), we can change these variables and examine their effect on the database's error rates. By simulating queuing system performance over time, it is possible to predict not only the amount of erroneous data in the database, but their time in the database and the impact of various changes on the amount of erroneous data in the database.
Applying a Queuing System Model
To demonstrate the use of a simulated queue to analyze data quality, we looked at the alumni database of the University of Michigan Business School. The data owners were particularly interested in the problem of out-of-date home address information. The home address is the primary way the university contacts alumni for fund raising and reunions. Each time an alumnus moves, that individual's address information becomes out-of-date and, therefore, erroneous.
How the Alumni Office Collects Its Information
The Central Alumni group at the University of Michigan is responsible for maintaining the Oracle database containing alumni records. Each college within the university has an alumni coordinator who coordinates alumni activities specific to that office. In addition to collecting information on its own, the Central Alumni group receives alumni information collected by individual colleges.
Using the College of Business as an example, the collection process works like this. The University of Michigan maintains names and addresses of Business School alumni in a database. (There may be several addresses stored for an alumnus, such as work address and seasonal address, but to simplify this analysis we used the home address, which is the most common.) Unless an alumnus chooses to inform the alumni office in advance of a move (and this is rare), the Central Alumni group usually learns about the move in one of two ways.
In the first instance, the Central Alumni group sends an address file downloaded from the alumni database to an address change vendor called NCOA (National Change of Address). This vendor matches the alumni address file to their master file and sends the Central Alumni group a file of address changes. The Central Alumni group will then update the alumni database. This occurs about three times per year. The university assumes that NCOA has up-to-date and accurate address information. The head of the Central Alumni group reports he is satisfied with the NCOA updates and has not had any major problems using itsdata.
The second way the Central Alumni group learns of a move is when they receive information from an individual college. For the College of Business, news about alumni moves usually arrives when the college mails alumni publications. This occurs quarterly. The mailing label affixed to the College of Business alumni magazine requests address change information. If the postal service is unable to deliver the magazine, it will be returned with any available updated address information. Administrative assistants at the College of Business's alumni office are responsible for sorting incoming mail, separating the address change information, and passing it to the Central Alumni group so it can be incorporated into the Central database. This process is also subject to error, because the postal service or college alumni group can make mistakes in recording or transmitting the address change information.
Finally, erroneous information can find its way into an alumni database in a number of ways besides those already mentioned. Information may be miskeyed, mistakes in identity are sometimes made, and some alumni try to hide their whereabouts. Both the Central Alumni group and college alumni groups spend a substantial amount of time trying to locate "lost" alumni.
Collecting and Analyzing Alumni Data
The author used several data sources when modeling the University of Michigan alumni database update processes. These included alumni mortality rates, new alumni arrival rates, alumni demographic information, alumni database operations, the alumni move rate, and the error life of out-of-date addresses. The simulation is based upon data collected during a single year (Spring 1994-Spring 1995). While the simulation produced move statistics that were a reasonable approximation to the actual number of moves that year, one can think of scenarios where a boom economic year or a depressed economic year could produce a far different move rate.
Another assumption that should be stressed is that, for simplicity's sake, the author restricted the study to changes in the home address field. While the home address field is good for tracking the whereabouts of the majority of alumni, it is not perfect. Some alumni prefer mailings to go to their offices. Some University of Michigan retirees maintain two addresses (in Florida and Michigan, for example). The alumni office designs its mailing software to produce mailing labels for one seasonal address or another, depending upon the time of year. Some alumni are tired of requests for donations, so they deliberately try to fool the alumni office as to their whereabouts. And some alumni may reside in nursing homes and, due to failing health and memory, may have difficulty communicating changes in their status to the alumni office.
In the simulations, the author assumed that alumni move at most once every three months. The data indicated this assumption was true for 99% of the alumni in the study. The simulations assumed that the number of out-of-date addresses in the database starts at zero (i.e., the database is initially accurate). This was done, in part, to determine how quickly the database can become contaminated. The author also lumped students pursuing doctorates with MBAs, since Ph.D. candidates were fewer than 1% of the Business School's alumni population.
The author also assumed that both NCOA and the postal service reported back accurate information with equivalent data quality levels. This was done for simplification because the author did not have enough data to know what the differences in data quality were. The simulations did consider the growth in "lost" alumni. Typically, the Central Alumni group waits until the "tracer population" reaches some "critical mass," then tries to determine if these alumni are alive, in nursing homes, or living at some new address.
The simulations assumed that the NCOA and postal service feedback used by the alumni office was based upon NCOA and postal service returns from major alumni publications. These are the principle sources of address change data used in this paper. The author did not incorporate mailings, self-reporting, or miscellaneous sources of information into the study. Because the simulations were based upon the characteristics of the University of Michigan Business School during 1994-95, the author intends the findings to be used to demonstrate that data managers can use basic statistics, queuing theory, and simulations to develop strategies for improving data quality. Every data manager or database administrator should consider database growth rates and the error patterns of various fields in the database.
(In June, 1997, the author contacted the University of Michigan Central Alumni office to determine what, if any, changes had occurred in the two years since she conducted the research that is the basis for this paper. Those in the alumni office who work with the system again stated that, overall, they were satisfied with the way the system worked.)
Every system has its own set of characteristics that may influence the number and lifetimes of data errors. These, in turn, may dictate different management strategies for improving data quality.
Sources Used
As previously mentioned, the author used a number of sources to understand and model address updating of the University of Michigan alumni database. What follows is a discussion of these sources,
Alumni Mortality Rates: The mortality rates used in the simulations were taken from the 1991 Metropolitan Life Tables (the total persons categories). In the simulation runs, these mortality rates were applied against the number of alumni in the various age groups to simulate the change in the size of the various age groups over time. One major assumption the author made is that the Business School alumni population matched the demographics upon which the Metropolitan Life Tables statistics were based.
Arrival Rates of New Alumni: These data were obtained from the alumni Oracle database at the University of Michigan, which contains the list of graduates, their degrees (Master of Business Administration, MBA, and Bachelor of Business Administration, BBA), and graduation dates. We were primarily interested in the population of graduates who had one business degree. We used a SAS program to create a report identifying the number of first-time business degree graduates in each year for each degree category.
To model future arrivals of new alumni, we graphed the number of new alumni by year with first-time BBA and MBA degrees. In examining the graph, we felt that a normal distribution best described the number of graduates with BBAs and MBAs. A goodness-of-fit test confirmed that this was a reasonable assumption.
We used this information in a second set of simulations to model the number of new alumni being added each year to the business alumni population. In using this distribution, we assumed that the number of alumni added each year would be stable over time. While this may be true in the near-future, this assumption becomes less valid over time as the University's programs and policies change.
Alumni Demographic Information: We obtained these data from the alumni office Oracle database at the University of Michigan. We assumed that common demographic information such as name, sex, status of alumni (living, dead, inactive), birth date, and degree information was accurate. We reviewed other demographic information (participation in alumni events and employment history) as being to unreliable for analysis.
Operations of the Alumni Database: We obtained information about alumni database management by interviewing Central Alumni office managers and staff in the Business School. During the course of a year, the author made daily visits to the alumni office to examine new information about alumni who had changed their addresses.
Alumni Move Rate: During the year of the study, we obtained 4,866 records of changed addresses involving 4,391 business school alumni (some moved more than once during the year). This information was supplied by the U.S. Postal Service, which provided returned address-change information whenever it could not deliver a Business School alumni publication. These data include a date (Month/Year) on which mail forwarding is to begin. The Postal Service retains this information, so it can stop forwarding mail after a year has passed. In addition, we searched the Central Alumni database to find alumni whose home address was corrected by the NCOA address vendor before an alumni publication was mailed (some alumni do not receive alumni publications or have publications sent to their work address).
We merged both hand-collected postal records and the query records from the database into a large file, which we then checked manually. For a move to have occurred, there had to be some indication from either the database or postal records that the alumnus had moved from one address to another during the 1994-1995 study period. We eliminated a significant number of records because there was evidence to suggest that they were corrections to existing current addresses.
There were also records that we believed indicated valid moves, but we were unsure of the exact move dates. If the Postal Service did not supply move dates, we used the earliest date we had on file when the change was detected. Because the alumni office publishes frequently, we could almost always identify the quarter in which the move occurred. While our final file contains some errors, we believe it is a good indicator of the number of moves during that period for the business alumni population.
Using this final move file, we matched the business alumni move data to a file of demographic data, using their unique identification numbers, so we could study the demographic characteristics of relocated alumni. We also broke the data down by quarter in which the move occurred and by demographic groups (e.g., alumni age and time elapsed since graduation).
Error Life of Out-of-Date Addresses: This information was the most problematic. For about half the moves we did not have an exact move date, which made identifying the exact time between the move and final database update difficult to establish. Moreover, several quirks in the update process also made it difficult to establish a definitive distribution model for the time an erroneous address could survive in the database. Consequently, we augmented the small amount of data we felt were accurate with information obtained by interviewing alumni office employees about their database update procedures. Using this strategy, we felt we could tentatively find a distribution for time-to-update address records data.
For the NCOA records, the head of the alumni records office mentioned that he could count on the data being processed sometime between 3 and 7 days, with 5 days being the average. He also felt the database updates followed a similar pattern. This suggested that the updates follow a uniform distribution. Similarly, in the case of mail returns, we again took what little information we had and augmented it with information from alumni office employees. The distributions are shown in Tables 1 and 2.
Table 1. Detection Distribution for Mass Mailing Returns
| Mailing Detection Time Distribution | |
| 0-14 days | 0% |
| 14-21 days | 5% |
| 21-28 days | 25% |
| 28-35 days | 35% |
| 35-42 days | 15% |
| 42-49 days | 10% |
| 49-56 days | 5% |
| 56-63 days | 2% |
| 63-70 days | 1% |
| 70-77 days | 1% |
| 77-84 days | 1% |
Table 2. Correction Distribution for Mass Mailing Returns
| Mailing Correction Time Distribution | |
| 0-3 days | 0% |
| 3-7 days | 25% |
| 7-14 days | 25% |
| 14-35 days | 25% |
| 35-56 days | 25% |
Modeling the Move Arrival Rate
To model the alumni move rate, we used a year's mail returns and alumni and census information about household duration (time-in-household between moves). In addition, the alumni office supplied demographic data about alumni and information about the alumni database, which included the times when address changes were detected and the database updated.
We found that a strong relationship existed between move rates and the age of the alumni. As alumni aged, the move rates gradually decreased. The one exception is for BBAs during the first few years after graduation. In this case the move rate increased before declining to converge with the MBA move pattern. This may be caused by BBAs working a few years and either relocating to new jobs or attending graduate school before settling down. We also found that move patterns varied by quarter. As one might expect, the peak move periods occurred in the spring and summer, with fewer moves in the fall and winter.
Because we had only a year of move data, we made the assumption that the
number of alumni moves per quarter follows a Poisson distribution with a
rate
. The
Poisson distribution is particularly well suited for modeling current events
occurring per time or unit of space and is characterized by the equation:
where I = 0, 1, 2, ...., maximum number of moves, t represents the
quarter in which the moves occur, and j refers to the specific category
of alumni whose move rate we are interested in determining. By "category"
we mean some subset of all Business School alumni such as all first time
MBA alumni who graduated in 1965. By identifying the move rates for the
individual categories of alumni, we planned to sum these rates to obtain
an overall move rate
, for the total
business alumni population for that quarter.
Two approaches were used to model the regression equations that related alumni characteristics to move rates. We first tried a classic regression approach using SAS. We also tried a more sophisticated approach using Markov Chain Monte Carlo (MCMC) sampling methods to create a hierarchical Bayes regression model [5, 6, 7, 8]. Both techniques eventually led to the same type of regression equation, with almost identical final results.
After trying a number of models, we decided the best fit for MBAs and older BBAs was a model that related the quarterly number of moves to the log of the independent variable "time elapsed since graduation." This variable was used instead of age because it was both easier to collect and was strongly correlated with age.
Modeling the Servicing of Errors
Detection of out-of-date addresses was straightforward for the alumni database. Therefore, the survival time for an out-of-date address was based upon three components: the time until some sort of feedback is initiated (queue time), the time to detect the out-of-date address once the feedback starts (feedback or detect time), and the time to correct the out-of-date address once it is detected (correct time). The University of Michigan Alumni Office primarily uses information from mass mailings and address-change vendors to detect out-of-date addresses. Self-reporting appears to be a minor contributor of address changes. Mass mailings are regularly scheduled events during the course of the year. As previously mentioned, according to alumni office employees the vendor returns the deleted addresses according to a 3-7 day distribution. Relying on the employees's experiences and our own observations of the mailing process, we decided to model the time to detect and correct the out-of-date addresses from the mass mailings according to the empirical distributions in Tables 1 and 2.
Modeling the Growth of the Database
Because the number of alumni in the database changes over time, it is important to model the arrival rate of new alumni as well as the death rate of current alumni. Using estimates based on new alumni arrival data from the past 15 years, the annual number of first time BBA graduates was approximated by a normal distribution with a mean = 303 and a standard deviation = 17. The annual number of new MBA graduates was approximated by a normal distribution with mean = 534 and standard deviation = 55. As stated previously, we modeled the alumni mortality rate using statistics from the Metropolitan Life tables.
Implementing Queuing System Simulation
We simplified the initial simulations of the database's out-of-date address rates by using several assumptions. To make time calculations easier, we used a simulated year of 360 days with 30-day months. As the simulation ran through the course of the year, we updated the move rate at the beginning of each quarter. We also ran the simulation through an initial "warm up" year so that the first few months of the year would not start empty. Finally, we assumed for the initial simulations that we could obtain the correct updated address information for all out-of-date addresses.
For the simulation, we generated the moves that represented out-of-date addresses
using a Poisson distribution with the rate
specified for that quarter. We gave each alumni move a unique
identifier and recorded the time of the move. At the times when either the
vendor's address checks or a mass mailing took place, we assigned those alumni
moves that accumulated since the last detection period a queue time, a feedback
time, and a correction time. All information about the moves was recorded
in an address-change log. We ran each simulation experiment at least ten
times, using a different random number seed each time.
At the end of the simulated year, we ran a second program that analyzed the address-change log information. This program counted the total number of addresses that were considered out-of-date during all or part of each month under consideration. We used this information to calculate L, the expected number of out-of-date addresses in the database each month. The program also calculated the time the out-of-date addresses existed. We used this information to calculate W, the expected time for out-of-date addresses to exist in the database over the course of a year. We also kept track of the expected size of the alumni population, Lr , so we could calculate the mean monthly estimated out-of-date address percentage rate, (100 x L/Lr ).
Effect of Varying Error Detection and Correction Strategies
We designed our first set of simulation runs to investigate the effect of changing the detection strategy used to detect out-of-date addresses over one time period. We simulated seven detection strategies, which differed by the type and timing of the feedback used to detect erroneous address information. Our results for the mean out-of-date address percentage rates and survival times for the seven different detection strategies are summarized in Table 3.
Table 3. Out-of-Date Address Rates Under Different Detection Strategies
| GPSS/H Estimates of Queuing Characteristics for the Different Detection Strategies | |||||||
| Statistics | Detection Strategy | ||||||
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| Monthly Mean Out-of-Date Address Rate (L/Lr) | 4.43% | 6.77% | 4.84% | 4.44% | 5.10% | 5.04% | 4.40% |
| Mean Out-of-Date Address Life (W) in Days for the Year | 60.4% | 107.5% | 68.3% | 56.0% | 69.1% | 68.3% | 69.7% |
Detection Strategy 1: The Current System
Under the current system (i.e., the system used during the 1994-95 study
year), the alumni office schedules vendor change-of-address checks in
mid-January, early June, and early October. Mass mailings are sent out in
mid-March, mid-June, mid-October, and early December.
Detection Strategy 2: Mass Mailing Only
If the alumni office did not use an address-check vendor, its update
schedule would be based solely on the postal returns from the mass mailings.
Detection Strategy 3: Vendor (Version 1)
If the alumni office decided to no longer rely on postal returns from mass
mailings and used the address-check vendor exclusively, its update schedule
would be based on runs in mid-January, early June, and early October.
Detection Strategy 4: Vendor (Version 2)
If the alumni office decided to schedule vendor runs four times a year, they
might try a schedule with address-check runs at the beginning of January,
April, July, and October.
Detection Strategy 5: Vendor (Version 3)
Another possible schedule for vendor runs would be to make the runs at the
beginning of May, September, and January.
Detection Strategy 6: Vendor (Version 4)
Another revised schedule for vendor runs would be to make the runs at
the beginning of January, June, and September.
Detection Strategy 7: Vendor (Version 5)
One more revised vendor run schedule would involve making runs in mid-February,
mid-June, and mid-October.
Other Possible Detection Strategies
Although we only modeled seven detection strategies, the possibilities are endless for incorporating any desired detection strategy into a simulation. But one drawback of simulation is that while it helps to identify the best strategy among a given set of choices, it will not identify an optimal detection strategy or schedule beyond the given set.
Detection Strategy Performance
When comparing other detection strategies to strategy 1 (the strategy used during the 1994-95 research period), we observed that dropping vendor runs (detection strategy 2) would increase the average monthly out-of-date address rate by 53% and increase the time an out-of-date address remains in the database by 78%. This occurs because it takes much longer to detect and correct a changed address using mass mailings than it takes when a vendor address-change service is used. Using address-change vendor services four times per year (detection strategy 4) would give approximately the same average monthly out-of-date address rate, while reducing the expected out-of-date address time in the database by 7%. If the alumni office used "vendor only" strategies (detection strategies 3, 5, 6, or 7), we observed several results, which depended upon when the vendor runs were scheduled. Strategies 3, 5, and 6 caused 9% to 15% increases in the average monthly out-of-date address rates, and approximately a 14% increase in the expected out-of-date address life in the database when compared to detection strategy 1. Only detection strategy 7 gave a slight decrease in the average monthly out-of-date address rate, while its expected out-of-date address life was 15% higher than observed using detection strategy 1.
These results suggest that it is possible to use address-change vendor services only three times per year and still have a similar out-of-date address rate as either the current system or using "vendor only" services four times per year. The optimum strategy is to select those vendor run times that complement the peak move times. Therefore, if one's error rate varies throughout the year (as is typical of address changes), one should consider an irregular update schedule to minimize the monthly error rate.
Figure 2. Average Monthly Out-of-Date Address Pattern for Strategy 7
This is shown in Figure 2, which depicts the simulated average monthly out-of-date address rates for detection strategy 7. Figure 2 demonstrates that the percentage of out-of-date addresses peaks just prior to the initiation of a vendor run and then falls off dramatically. By not updating out-of-date addresses in low move rate periods, the alumni office can concentrate on scheduling vendor runs during peak move periods. In effect, the alumni office can keep the overall mean out-of-date address percentage low over a longer period of time.
It is important to note that minimizing the overall month-to-month out-of-date address rate may not necessarily be the best strategy to follow. A continuously used database may require a low month-to-month out-of-date address rate, while a database that is used only periodically may operate effectively with a high out-of-date address rate, provided that the corrections are applied before the database is used.
Effect of Varying Error Arrival Rates
Another parameter of the queuing model that we can change is the error arrival rate. For out-of-date addresses, it is the move rate of the alumni population that dictates how quickly address information becomes obsolete. younger alumni move more frequently than older alumni. As the alumni population grows and ages, the number of alumni in various age groups changes and, as a consequence, the address move rate in the alumni database changes as well. To calculate the changes over time, we assumed that the arrival pattern for new alumni remained stable over the time period being considered.
Based on our assumptions about the changing move rate and alumni population growth, we calculated the out-of-date address rates for 1, 5, 10, 15, 20, and 30 years using detection strategy 7. Our results are summarized in Table 4.
Table 4. Out-of-Date Address Rates Over Time
| GPSS/H Estimates of Queuing Characteristics Over Time | ||||||
| Statistics | Elapsed Time in Years | |||||
| 1 | 5 | 10 | 15 | 20 | 30 | |
| Monthly Mean Out-of-Date Address Number (L) | 1192 | 1267 | 1341 | 1396 | 1436 | 1610 |
| Monthly Mean Out-of-Date Address Rate (L/Lr) | 4.398% | 4.254% | 4.081% | 3.926% | 3.795% | 3.548% |
An analysis of the simulation runs showed that although the number of movers is gradually increasing over time, the percentage of movers in the alumni population is decreasing. Currently the largest alumni classes are the most recent. Because younger alumni move more than older alumni, the movers are a larger percentage of the alumni population. Assuming that future class sizes are not dramatically larger than present classes, the population of older alumni will increase, despite their increased mortality rate. The increase of the alumni population that doesn't move often will cause the overall percentage of out-of-date address to decrease over time, despite the growing number of movers in the alumni population.
For these simulations, we assume that the Poisson distribution used to describe the arrival of address changes holds, but that the Poisson rate parameter changes over time due to demographic shifts in the alumni population. By simulating the alumni population's aging (using mortality data and then using this new demographic data in the move-rate regression equation) we produced a new move rate. Our results depended on the mortality statistics and our assumptions about the shape of the move-rate distribution.
Effect of Varying Assumptions About the Database
We explored what happens when the simulations allow a certain percentage of address information to remain out-of-date because correct information is not available. Using1994-95 data, it appeared that about 5.5% of the moves resulted in erroneous addresses that were detected. However, we lacked the correct information needed to complete the update. We assumed this was a typical year for "losing" alumni and, taking into consideration the growth of the alumni population and changes in the move rate over time, we created a program to simulate the number of "lost" alumni. These lost alumni usually remain lost in the alumni database until a special effort is made to find them. Since special searches can be expensive and time consuming, the alumni office usually waits until the number of lost alumni reaches a certain level before conducting these searches.
Therefore, we designed these simulations to anticipate how quickly the number of lost alumni would grow under the assumption that each year the alumni office loses track of approximately 5.5% of alumni who move. The simulation results showed that it takes about six years for the tracer rate to reach about 5% of the total alumni population, assuming one started from a database where 100% of the alumni are "found," as depicted in Figure 3. Using this technique, it is possible to experiment with other yearly rates and initial levels of out-of-date addresses to get an idea of how fast alumni are being lost. This information may be useful in determining when to schedule alumni censuses or special searches to find missing alumni.
Figure 3. Growth of Lost Alumni Over Time
The author's research with the University of Michigan's alumni database demonstrates that the number of database errors can be simulated as entities in a queuing system. By measuring the effects of various elements on the error arrival rate and error survival time in the database, we can use simulation to model the existing database queuing system. We can then change parts of the simulation model. Our experiments gave alumni database managers valuable insights about the effectiveness of current detection and correction strategies and alternative strategies. Using a queuing system model also allows a data manager to estimate the impact of current error arrival rates on the future number of database errors. Finally, a data manager can use the queuing system model described in this paper to test various assumptions about the behavior of variables and parameters that describe the database.
1. T.C. Redman, Data Quality: Management and Technology, (New York, Bantam Books, 1992). [return to text]
2. W.M. Bulkeley, "Databases Are Plagued by Reign of Error," The Wall Street Journal, 26 May 1992, B2. [return to text]
3. B. Knight, "The Data Pollution Problem," ComputerWorld, 28 September 1992, 81-84. [return to text]
4. T.C. Redman, "The Notion of Data and its Quality Dimensions," Information Processing and Management, Vol. 30 no. 1 (January, 1994): 9-19. [return to text]
5. A.E. Gelfand and A.F.M. Smith, "Sampling-Based Approaches to Calculating Marginal Densities," Journal of the American Statistical Association, Vol. 85 (1990): 398-409. [return to text]
6. S. German and D. German, "Stochastic Relaxation, Gibbs Distribution and the Bayesian Restoration of Images," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 6 (1984): 721-741. [return to text]
7. W.K. Hasting, "Monte Carlo Sampling Methods Using Markov Chains and Their Application, Biometrika, Vol. 57 (1970): 97-109. [return to text]
8. A.F.M. Smith and G.O. Roberts, "Bayesian Computation Via the Gibbs Sampler and Related Markov Chain Monte Carlo Methods," Journal of the Royal Statistical Society Series B, Vol. 55 (1993): 3-23. [return to text]
Go to:
Comments: dqemail@aol.com (10/03/98)