DATA
QUALITY September 1999

Volume 5  Number 1   Copyright 1999


Data Errors in Neural Network and Linear Regression Models: An Experimental Comparison

Barbara D. Klein and Donald F. Rossin, University of Michigan-Dearborn, School of Management

Key Words:  Neural Networks; Linear Regression

Abstract:

Neural network and linear regression models are used in a wide variety of business domains. Typically, data used to construct and use these models are assumed to be free of errors. Many organizations use databases that contain errors, and these errors may have a significant effect on the predictive accuracy of the models. However, practitioners working with data containing errors have little basis for selecting an appropriate type of forecasting tool because little is known about the effect of data errors on predictions made by models such as neural networks and linear regression.

This paper uses a real-world example, the prediction of the net asset values of mutual funds, to investigate the effect of data errors on neural network and linear regression models. We report the results of two experiments. The experiments show that the error rate and magnitude of error in data used in both model prediction (test data) and model construction (training data) significantly affect the predictive accuracy of neural network and linear regression models. We found linear regression models to be more severely affected by errors in test data, while neural network models are found to be more severely affected by errors in training data. Our findings have implications for practitioners seeking to understand the comparative sensitivity of linear regression and neural network models to errors in test and training data.

Contents


1.  Introduction

There is strong evidence (e.g., Laudon, 1986; Morey, 1982; Redman, 1992, 1995, 1996) that data stored in organizational databases have a significant rate of errors. Several researchers have investigated the effect of data errors on the outputs of computer-based (e.g., Ballou and Pazer, 1985; Ballou et al., 1987; Bansal et al., 1993). Our investigation builds on this prior research by comparing the effect of data quality on neural network and linear regression models. We use a financial application in our research.

Data errors may affect the predictive accuracy of models in two ways. First, the training data used to build a model may contain errors. Second, even if training data are free of errors, once a model is used for forecasting a user may input test data containing errors to the model.

In general, when claims about the predictive accuracy of neural network and linear regression models are made, it is assumed that data used to train the models and data input to make predictions are free of errors. In this study we relax this assumption by asking two questions: (1) What is the effect of errors in test data on predictions made using neural network and linear regression models? and (2) What is the effect of errors in training data on predictions made using neural network and linear regression models? The first question examines the effect of data errors when a model is used for forecasting. The second question examines on the effect of data errors during model construction.

We believe an understanding of the effect of data errors on neural network and linear regression models is particularly important because the availability of inexpensive software packages for personal computers makes the development and use of these models by end-users feasible. Researchers have argued that end-user computing has increased the potential for data errors in computer applications (Boockholdt, 1989). As end users develop applications, it is possible that fewer data validation methods such as logic tests and control totals will be in place and it is likely that less rigorous testing will occur before applications are used in production (Corman, 1988; Davis, 1984; Davis et al., 1983; Panko, 1998).

The remaining sections of this paper present (1) a review of relevant prior data research data quality, (2) a brief explanation of neural network and linear regression models, (3) a description of the neural network and linear regression models constructed in the study, (4) a discussion of the methodology of two experiments, (5) the results of two experiments, and (6) conclusions.         (Return to Table of Contents)


2. Background

Data quality is generally recognized as a multidimensional concept (Wand and Wang, 1996; Wang and Strong, 1996). While no single definition of data quality has been accepted by researchers working in this area, there is agreement that data accuracy, currency, completeness, and consistency are important areas of concern (Agmon and Ahituv, 1987; Ballou and Pazer, 1985; Davis and Olson, 1985; Fox et al., 1993; Huh et al., 1990; Madnick and Wang, 1992; Wand and Wang, 1996; Wang and Strong, 1996; Zmud, 1978). This investigation adopts the ideas of Ballou and Pazer (1985) which include four dimensions: accuracy, timeliness, completeness, and consistency. This study is primarily concerned with data accuracy, defined as conformity between a recorded data value and the corresponding actual data value.

Prior research determined that organizational databases are not, in general, free of errors. Between one and twenty percent of data items in critical organizational databases are estimated to be inaccurate (e.g., Laudon, 1986; Morey, 1982; Redman, 1992, 1995).

Data quality problems have been found to affect the accuracy and timeliness of economic data published by the United States government (Hershey, 1995; Morgenstern, 1963). Both Standard & Poors CompustatÒ (with its Price Earnings Dividend tape) and the Center for Research in Security Prices (with its monthly stock return CRSP tape) sell a data base containing monthly price information. Two studies (Bennin, 1980; Resenberg and Houglet, 1974) found large errors possible in each database. Inaccurate data have also been reported in a student loan database maintained by the U.S. Department of Education (Knight, 1992), in records maintained by the U.S. Department of Agriculture (Dead farmer, 1992), and in records maintained by credit reporting bureaus (Consumer enemy, 1991).

Data errors are acknowledged to be a significant problem by at least some information system managers. In a survey of fifty Chief Information Officers of large organizations, half were found to believe that the usefulness of their organization's data is limited because of data accuracy problems (Nayar, 1993). Knight (1992) reports the findings of a study in which two-thirds of surveyed organizations acknowledged problems stemming from inaccurate or incomplete data.

Several studies have investigated the effect of data errors on the outputs of computer-based models. Bansal et al. (1993) studied the effect of errors in test data on predictions made by neural network and linear regression models. Ballou and his colleagues have conducted a stream of research on the effect of data errors on information system outputs (Ballou and Pazer, 1985, 1995; Ballou et al., 1987; Ballou and Tayi, 1989; Ballou et al., 1998). O’Leary (1993) investigated the effect of data errors in the context of a rule-based artificial intelligence system. Each of these studies is discussed in turn below.

Bansal et al. (1993) compared the effect of errors in test data on linear regression and neural network models. Their research used models to predict the prepayment rate of mortgage-backed security portfolios with a training data set that was free of errors. They then constructed test data sets containing data errors to evaluate the sensitivity of these models to data errors. The size of the data errors (5 percent, 10 percent, 15 percent, and 20 percent) and the fraction of the data set containing errors (4 percent, 8 percent, and 12 percent) were manipulated. The linear regression and neural network forecasts were evaluated using two metrics: (1) R2 as a measure of predictive accuracy and (2) a payoff measure designed to capture the value of an accurate forecast to a portfolio manager. Error size had a statistically significant effect on predictive accuracy for both the linear regression and the neural network models, and on the measure of payoff for linear regression. The fraction of the data set containing errors had a statistically significant effect on predictive accuracy and the payoff measure for linear regression but had no effect on either metric for the neural network model. They concluded that the neural network model is more robust than the linear regression model as data quality decreases.

Ballou et al. (1985) presented a model for analyzing the effect of data errors on the outputs of information systems. The objective of this model was to understand the way in which data errors are magnified or dampened as data are manipulated in an information system. The model was applied to an analysis of the impact of data errors in a spreadsheet model (Ballou et al., 1987). The researchers examined the problem of the selection of an appropriate forecasting model. Four variables were forecasted using ten different historical data sets containing errors. For the four variables, data errors were found to affect the selection of a forecasting model for at least six and for as many as nine of the ten historical data sets.

In other studies, Ballou and his colleagues have examined the allocation of resources to data quality improvement projects (Ballou and Tayi, 1989), developed a framework for analyzing tradeoffs between the accuracy and timeliness dimensions of data quality (Ballou and Pazer, 1995), and developed a framework applying total quality management to the measurement of data quality (Ballou et al., 1998).

O’Leary (1993) presented a general methodology for analyzing the impact of data accuracy on the performance of an artificial intelligence system designed to generate rules from data stored in a database. The methodology can be applied to artificial intelligence systems that analyze data and generate a set of rules of the form "if x then y." It is often assumed that a subset of the generated rules is added to the system's rule base on the basis of a measure of the "goodness" of each rule. O'Leary showed that data errors can affect the subset of rules that are added to the rule base and that inappropriate rules may be retained while useful rules are discarded if data accuracy is ignored.           (Return to Table of Contents)


3.0 Neural Network And Linear Regression Models

Both neural network and linear regression models are widely used as forecasting tools. We will briefly discuss each, and discuss financial applications research using both models.

3.1 Neural Network Models

A neural network is a type of model that can be used to predict continuously-valued outputs or to classify observations. Neural network models have been applied to a variety of problem domains such as the prediction of graduate student success (Hardgrave et al., 1994), the prediction of bank failure (Tam and Kiang, 1992), the analysis of product quality in refineries (Wadi, 1996), and the forecasting of extended warranty claims (Wasserman and Sudjianto, 1996). Typically one has a neural network learn about a problem by training it with examples. Training algorithms search for a set of weights that offer the best fit with the given examples. Once trained, a network can be used to make predictions.

Although several architectures for neural networks have been developed, the scope of this study is limited to the back propagation feed forward neural network architecture. This is the most popular type of architecture among researchers and users of neural network models (Jain and Mao, 1996).

Figure 1 shows a visual representation of a typical back propagation neural network. It has three layers: an input layer which receives information from the environment, a hidden layer, and an output layer which transmits a response back to the environment. Connections denote whether information flows between processing elements occur. In this network, inputs are processed through a hidden layer to an output layer. The basic objective of back propagation is to minimize the mean squared error between the actual output and the desired output as specified in the training set.

Updating of a back propagation neural network consists of two phases, a forward phase and a reverse phase. During the forward phase, input in the sense of paired values for Ain and Bin is presented, and propagated forward through the network to compute an output value for each processing element (PE). This is accomplished by summing the results of multiplying the weights associated with the connections to a particular PE and outputs associated with those weights.

Using a linear output or activation function for Fout allows the output value to take on any value. Sometimes it is desired that the output range be between 0 and 1. In this case, a sigmoid or a sine function is used.

The backward phase adjusts the weights associated with the nodes. Starting at the output node where the error measure of desired output minus actual output is readily available, the error measure is propagated back through the layers toward the input node. More detailed information can be found in summary papers (e.g., Masson and Wang, 1990; Wang and Malakooti, 1992: Zahedi, 1991).

Neural networks are used by both academics and practitioners in the area of financial analysis. Chiang et al. (1996) developed neural network and linear regression models to forecast the net asset value of mutual funds and found that neural network models perform better than linear regression models. Yoon et al. (1993) compared the performance of back propagation neural network models and discriminant analysis for predicting the performance of stocks and found that neural network models made more accurate classifications than discriminant analysis. Schoneburg (1990) developed neural network models to predict daily stock prices for three German stocks. Jain and Nag (1995) applied a neural network to the problem of pricing initial public offerings. Neural network based software to make predictions about financial instruments is also commercially available (e.g., Neural Applications, 1997; Trendy Systems, 1997).

3.2 Linear Regression Models

Linear regression is a statistical tool for modeling the relationship between a dependent variable and one or more independent variables. Linear regression models the dependent variable as a linear function of one or more independent variables as shown in the equation below.

The parameters of the linear regression model are typically estimated using the least-squares method which results in a line that minimizes the sum of squared vertical distances from the observed data points to the line (Lewis-Beck, 1980; Neter et al., 1990).

Practitioners employed linear regression as a forecasting tool in tasks such as sales prediction (Mentzer and Cox, 1984; Sanders, 1994). Linear regression is also a recognized forecasting tool for financial applications (Bansal et al., 1993; Chiang et al., 1996; Cole, 1994; Jabbour, 1994; Jankus, 1997; Mark, 1995).         (Return to Table of Contents)


4.0 Model Construction

Most applications of neural network and linear regression models assume that all data used to construct the model and all data input to the model for forecasting are accurate. The remainder of the paper presents an investigation into the performance of these models when this assumption is relaxed.

The application for study in this paper is the prediction of prices or net asset values (NAV) of mutual funds. Mutual funds consist of diversified portfolios of stocks that are managed by professionally trained individuals. They have become a major investment vehicle for Americans. We selected the prediction of NAV of mutual funds as the application domain for examining our research for two reasons. First, prior research shows that NAV can be predicted with a reasonable level of error (Chiang et al., 1996). Since the objective of our study is to compare NAV predictions made with data containing no errors to NAV predictions made when data are corrupted, the most important criterion for selecting a sample application domain is that predictions made with input data that are free of errors are reasonably good. The prediction of NAV meets this test.

Second, prior research provides insights about a set of relevant input variables that predict the NAV of mutual funds (Chiang et al., 1996). Recent studies (Balvers et al., 1990; Breen et al., 1990; Campbell, 1987; Cochrane, 1991; Fama and French, 1989; Ferson and Harvey, 1993; French et al., 1987; Glosten et al., 1993; Pesaran and Timmermann, 1994, 1995) show that economic variables can be used to predict stock returns. Because mutual funds are simply groupings of stocks, prices or net asset values (NAV) of mutual funds should reflect known economic information. Economic variables such as gross national product and the consumer price index have been used as exogenous variables in prior research on the prediction of the NAV of mutual funds (Chiang et al., 1996).

The remainder of this section explains how the neural network and linear regression models used for forecasting the NAV of mutual funds in this study were built.

4.1 Data Collection

To start the construction of the neural network and linear regression models for predicting the net asset value for a mutual fund, 14 economic variables were identified as input. They are specified and defined in Figure 2. A 10-year economic data set (1986-1995) was constructed (Statistical Abstract, 1996). In addition, end-of-year net asset values for 213 U.S. mutual funds were obtained (Individual Investor’s Guide, 1997). The criteria for inclusion was having historical net asset value figures back to 1987.

Figure 2: Potential Independent Variables
Name Description
GDP Gross Domestic Product (in billions of dollars). Output attributable to all labor and property supplied by United States residents.
CD* Consumption Demand (in billions of dollars). Personal consumption expenditures.
ID Investment Demand (in billions of dollars). Investment spending by firms. Excludes residential investments.
GD* Government Demand (in billions of dollars). U.S. government spending. Includes consumption expenditures and gross investment.
NEX Net Exports (in billions of dollars). Net exports of goods and services.
CPI* Consumer Price Index. Measure of the average change is prices over time in a fixed market basket of goods and services. 1982-84 = 100.
M1* Money, M1 (in billions of dollars). Includes currency in the hands of the nonbank public, travelers checks, demand deposits, and other checkable deposits.
M2 Money, M2 (in billions of dollars). Includes M1 plus money market funds, savings deposits, and small time deposits.
UR Unemployment Rate. Percent of the labor force unemployed
TBR Treasury Bill Rate. Interest rate for 3-month Treasury bill.
FFR Federal Funds Rate.
CILEAD Composite Index - Leading Indicators. 1987 = 100.
CICOIN Composite Index - Coincident Indicators. 1987 = 100.
CILAG Composite Index - Lagging Indicators. 1987 = 100.
Note: Asterisk indicates selection for model development.

Because the purpose of this research is to study is the effect of data quality on neural network and linear regression forecasting, we decided to limit the number of input variables to a manageable number. Since it is impossible to directly determine the relevant input variables with neural networks (Bansal et al., 1993), stepwise linear regression was conducted for the 213 mutual funds to limit the number of input variables. A 5 percent significance level (the SPSS default) was used to bring variables into the models. Four input variables were chosen based on the number of times each had been selected in the regression step. These variables are identified by an asterisk in Figure 2. In addition, it was decided to limit the number of mutual funds to 10 per fund type. Fund type definitions are per the Individual Investor’s Guide (1997). The randomly chosen 40 funds are indicated in Figure 3.

Figure 3: Randomly Chosen Mutual Funds
Aggressive Growth (out of 64 possible) Growth (out of 80 possible)
Fairmont

Fidelity Sel Air Transportation

Fidelity Sel Automotive

Fidelity Sel Brokerage & Investment

Fidelity Sel Computers

Fidelity Sel Leisure

Fidelity Sel Software & Computer

INVESCO Dynamics

Kaufmann

USAA Aggressive Growth

Fidelity Capital Appreciation

Fiduciary Capital Growth

Founders Growth

Janus Fund

Mathers

Meridian

Schwartz Value

Scudder Equity Trust: Capital Growth

Sound Shore

Vanguard/Morgan Growth

Balanced (out of 24 possible)

Growth & Income (out of 45 possible)

Dodge & Cox Balanced

Fidelity Puritan

Founders Balanced

Greenspring

INVESCO Industrial Income

Northeast Investors Trust

SAFECO Income

Strong Asset Allocation

USAA Income

Value Line Income

AARP Growth & Income

Berger Growth & Income

Dreyfus Third Century

Fidelity Sel Utilities Growth

IAI Growth & Income

INVESCO Value: Value Equity

SAFECO Equity

Stratton Monthly Dividend Shares

Strong Total Return

T. Rowe Price Growth & Income

4.2 Construction of Neural Network Models

When constructing neural network models, we used the first nine years of data (the training set). We used data from a tenth year (the testing set) to develop the NAV forecast for a specified mutual fund.

We made various additional parameter value decisions using a combination of trial and error and experience. For example, we decided to have one hidden layer with six nodes and one output node when predicting NAV for a particular mutual fund for 1996. This is in addition to the four input nodes, which were used earlier. Figure 4 gives a simplified schematic representation of the neural network.

We also chose a learning rate of 0.10, a momentum rate of 0.10, and 0.30 for initial weights. We chose an activation function of hyperbolic tangent for the hidden layer and linear for the output node based on software advice. We set stopping rules to allow the neural network enough time to make significant adjustments. The maximum number of learning epochs was 10,000. (A learning epoch indicates the network going through the nine years of training data once.) Finally, better solutions were indicated by a decrease in the minimum average error. Decreases occur frequently early in training. Therefore, we decided that the procedure should be stopped if 1,000 epochs occurred without a change in minimum average error. All runs were conducted using NeuroShell 2 software (NeuroShell 2, 1996).

Again, we constructed a neural network for each of the 40 mutual funds using the 9 oldest years of the data for training. The 1995 test data was then input to the appropriate neural networks to predict a NAV value for each of the 40 mutual funds for the end-of-year 1996. Actual end-of-year 1996 NAV values and predicted end-of-year NAV values were compared using mean absolute percent error (MAPE) as a measure of accuracy. This formed the base case for the neural network model.

4.3 Construction of Linear Regression Models

The data used to construct the linear regression models are identical to the data used to construct the neural network models. We constructed separate linear regression model for each of the 40 mutual funds using the 9 oldest years of the data for training. The 1995 testing data was then input to the appropriate linear regression model to predict a NAV value for each of the 40 mutual funds for the end-of-year 1996. Actual end-of-year 1996 NAV values and predicted end-of-year NAV values were compared using mean absolute percent error (MAPE) as a measure of accuracy. This comparison formed the base case for linear regression.           (Return to Table of Contents)


5.0 Experimental Methodology

We conducted two experiments to examine the research questions. Each experiment was conducted first for the neural network model and then for the linear regression model. Both experiments used the same task (the prediction of NAV of mutual funds), the same data set, and the same dependent variable. The experimental factors were the same in both experiments, although the levels of the factors were different.

Experiment 1 examined the first research question: What is the effect of errors in test data on predictions made using neural network and linear regression models? Experiment 2 examined the second research question: What is the effect of errors in training data on predictions made using neural network and linear regression models?

5.1 Experimental Data Set

A sample data set for one of the mutual funds used in both experiments is shown in Figure 5. The training data contains 36 data items (four economic variables in the columns by nine years in the rows). The test data contains four data items (four economic variables by one year).

Figure 5. Example Base Data Set for Fairmont Mutual Fund
 

Year for Economic Variables

Economic Variables

NAV for Fairmont

Year for NAV Variable

CD

GD

CPI

M1

 

 

Training Data

1986

2892.7

938.5

109.6

724

14.96

1987

1987

3094.5

992.8

113.6

750

15.19

1988

1988

3349.7

1032.0

118.3

787

16.02

1989

1989

3594.8

1095.1

124.0

794

12.17

1990

1990

3839.3

1176.1

130.7

826

17.02

1991

1991

3975.1

1225.9

136.2

897

19.41

1992

1992

4219.8

1263.8

140.3

1024

22.43

1993

1993

4454.1

1289.9

144.5

1129

24.06

1994

1994

4698.7

1314.7

148.2

1149

27.02

1995

Test Data

1995

4924.3

1358.5

152.4

1125

26.45

1996

  

5.2 Experimental Factors

There are two factors in each experiment: (1) fraction-error and (2) amount-error. Fraction-error is the percent of the data items in the appropriate part of the data set (the test data in experiment 1 and the training data in experiment 2) that are perturbed. Amount-error is the percent the data items identified in the fraction-error factor are perturbed.

1. Fraction-error. Since fraction-error is defined as a percent of the data items in a data set, the number of data items that are changed for a given level of fraction-error is determined by multiplying the fraction-error by the total number of data items in the data set.
Experiment 1. The test data used in experiment 1 contained four data items (one value for each of the four economic variables for 1995). This experiment examines all of the possible data items that could be perturbed. The four levels for the fraction-error factor are: 25 percent (1 data item perturbed), 50 percent (2 data items perturbed), 75 percent (3 data items perturbed), and 100 percent (4 data items perturbed).
Experiment 2. The training data used in experiment 2 contained 36 data items (one value for each of the four economic variables for nine years). Four levels of the fraction-error factor were tested: 5 percent (2 data items perturbed), 10 percent (4 data items perturbed), 15 percent (5 data items perturbed), and 20 percent (7 data items perturbed).
2. Amount-error. For both experiments, the amount-error factor had two levels: (1) plus or minus 5 percent and (2) plus or minus 10 percent.

5.3 Experimental Design

The experimental design is shown in Figure 6. We used the same experimental design for the neural network model and for the linear regression model. Both experiments have four levels for the fraction-error factor and two levels for the amount-error factor for each model type (neural network and linear regression). For each combination of fraction-error and amount-error, four runs with random combinations of economic variables were performed for each of the 40 randomly chosen mutual funds. This gives a total of 1,280 runs for neural networks and 1,280 runs for linear regression in each experiment.
Figure 6: Experimental Design
Linear Regression Neural Network

Experiment 1 (Errors in Test Data)

Experiment 1 (Errors in Test Data)

Experimental Factors

Experimental Factors

Fraction-error levels

(25%, 50%, 75%, 100%)

4

Fraction-error levels

(25%, 50%, 75%, 100%)

4

Amount-error levels

(5%, and 10%)

x 2

Amount-error levels

(5%, and 10%)

x 2

Sampling Procedure

Sampling Procedure

Number of random combinations of economic variables considered within each fraction-error level x 4 Number of random combinations of economic variables considered within each fraction-error level x 4    
Number of mutual funds x 40     Number of mutual funds x 40
Total number of problems considered =1280 Total number of problems considered =1,280

Experiment 2 (Errors in Training Data)

Experiment 2 (Errors in Training Data)

Experimental Factors Experimental Factors
Fraction-error levels

(5%, 10%, 15%, 20%)

4 Fraction-error levels

(5%, 10%, 15%, 20%)

4
Amount-error levels

(5% and 10%)

x 2 Amount-error levels

(5% and 10%)

x 2
Sampling Procedure Sampling Procedure
Number of random combinations of economic variables considered within each fraction-error level x 4 Number of random combinations of economic variables considered within each fraction-error level x 4
Number of mutual funds x 40 Number of mutual funds x 40
Total number of problems considered = 1280 Total number of problems considered = 1280

 

Although the levels of the fraction-error factor are different in the two experiments, the sampling procedure is the same. For each fraction-error level, economic variables were randomly selected to be perturbed. This was repeated a total of four times per level. Figure 7 shows the results for experiment 1.

Figure 7: Four Combinations of Economic Variables for Each Fraction-Error Level in Experiment 1
Economic Variable Combination
Fraction-Error Level

1

2

3

4

25%

(CD)

(CPI)

(GD)

(M1)

50%

(CD, GD)

(CD, M1)

(GD, CPI)

(GD, M1)

75%

(CD, CPI, GD)

(CD, GD, M1)

(CD, GD, M1)

(CPI, GD, M1)

100%

(CD, CPI, GD, M1)

(CD, CPI, GD, M1)

(CD, CPI, GD, M1)

(CD, CPI, GD, M1)

 

Next, for each level of the amount-error factor, we randomly assigned each economic variable either a positive or negative sign to indicate the appropriate amount-error to be applied. Figure 8 shows the results for experiment 1. The procedure for experiment 2 differs only in the number of economic variables that were randomly selected to be perturbed for the four tested levels of the fraction-error factor.

Figure 8: Randomly Assigned Percentage Increase (+) Over Base Value or Decrease (-) for a Given Amount-Error Level in Experiment 1
Economic Variable Combination
Fraction-Error Level

1

2

3

4

25%

(CD)

+

(CPI)

-

(GD)

+

(M1)

+

50%

(CD, GD)

+, +

(CD, M1)

-, +

(GD, CPI)

-, +

(GD, M1)

+, +

75%

(CD, CPI, GD)

+, -, -

(CD, GD, M1)

-, -, -

(CD, GD, M1)

+, +, -

(CPI, GD, M1)

+, -, +

100%

(CD, CPI, GD, M1)

-, -, -, +

(CD, CPI, GD, M1)

+, +, -, -

(CD, CPI, GD, M1)

+, -, +, +

(CD, CPI, GD, M1)

-, -, -, -

The economic variables to be perturbed and the positive or negative change applied for amount-error were the same for the neural network and linear regression models.

5.4 Dependent Variable

In both experiments, actual end-of-year 1996 NAV values and predicted end-of-year 1996 NAV values were compared using mean absolute percent error (MAPE) as a measure of accuracy.           (Return to Table of Contents)


6.0 Experimental Results

For both experiments, we present MAPE results for each combination of fraction-error and amount-error for neural networks and linear regression . We then discuss the results of ANOVA tests and independent samples t-tests conducted to test for the effect of fraction-error and amount-error on MAPE for each type of model. Finally, we report the findings of tests we performed to determine which combinations of fraction-error and amount-error are significantly different than the base case scenario with no data errors for each type of model. We discuss each result first for the neural network model and second for the linear regression model.

6.1 Experiment 1 Results: Errors in Test Data

6.1.1 Predictive Accuracy Results

Table 1 shows predictive accuracy results, using the simulated inaccuracies for amount-error and fraction-error for the NAV forecasts for 1996 for the neural network and linear regression models. These results reflect the use of the appropriately perturbed portion of the test data. Each cell reflects average values for 160 estimations (four runs for 40 mutual funds).

 

Table 1: Experimental Results: MAPE Values for Neural Network and Linear Regression Models as Accuracy of Test Data Varies
Fraction Error
Amount Error (0 errors) 0% (1 errors) 25% (2 errors) 50% (3 errors) 75% (4 errors) 100% (0 errors) 0% (1 errors) 25% (2 errors) 50% (3 errors) 75% (4 errors) 100%
0% 9.5 Neural Network: Values of MAPE 16.8 Linear Regression: Values of MAPE
5% 9.8 9.3 10.3 13.0* 21.4 26.8* 43.3* 34.8*
10% 10.4 12.2* 12.7* 18.9* 26.7* 43.7* 50.6* 58.4*
Notes:

(1) Data used to obtain these results were the test data. The 0% fraction error and 0% amount error cells reflect the accuracy of the unmodified test data used in conjunction with the unmodified linear regression and neural network models. All other cells reflect average accuracy results for 4 simulated estimations involving appropriately simulated data inaccuracies for 40 funds.

(2) Cell values marked with an asterisk are values significantly (p<.05) different than the relevant base case MAPE.  

Neural network model. The left side of Table 1 shows the results for the neural network model. The results demonstrate that as fraction-error increases from 25 percent to 100 percent, MAPE increases indicating a decrease in predictive accuracy. As amount-error increases from 5 percent to 10 percent, MAPE increases also indicating a decrease in predictive accuracy.
Linear regression model. The right side of Table 1 shows the results for the linear regression model. The results demonstrate that in general as fraction-error increases from 25 percent to 100 percent, MAPE increases indicating a decrease in predictive accuracy. When amount-error is equal to five percent, MAPE decreases as fraction-error increases from 75 percent to 100 percent. As amount-error increases from 5 percent to 10 percent, MAPE increases also indicating a decrease in predictive accuracy.

6.1.2 ANOVA Tests

We conducted a separate two-factor analysis of variance (ANOVA) test for each type of model to test for the effect of the independent variables on MAPE. The independent variables are fraction-error (25 percent, 50 percent, 75 percent, and 100 percent) and amount-error (plus or minus 5 percent, and plus or minus 10 percent).

Table 2 shows the results of the ANOVA tests. We found significant main effects for fraction-error, amount-error, and their interaction for the neural network model (p<.05). We found significant main effects for fraction-error and amount-error for the linear regression model (p<.05). These results indicate that both fraction-error and amount-error have an effect on the predictive accuracy of the neural network and linear regression models.

Table 2: Significance of Varying Amount-Error and Fraction-Error on Predictive Performance of Neural Network and Linear Regression Models: ANOVA Results for Varying Testing Data
Factor/Significance criterion Neural Network  MAPE Linear Regression MAPE
Fraction-error

F(0.05;3;1272)=2.60

17.976*

12.786*

Amount-error

F(0.05;1;1272)=3.84

22.500*

19.008*

Fraction type-Amount error interaction

F(0.05;3;1272)=2.60

3.145*

1.962

Note: Significant results (p<.05) are marked with an asterisk (*).

6.1.3 Independent Samples t-tests

When there are more than two levels of a factor, ANOVA results do not indicate where the significant differences occur. For example, while fraction-error is a significant factor, this difference may come as fraction-error changed from 25 percent to 50 percent, 50 to 75 percent, or 75 to 100 percent. It could also have resulted from a larger increase, such as 25 percent to 75 percent or 25 percent to 100 percent. We performed independent samples t-tests in order to determine exactly where significant differences occurred.

Neural network model. Two conclusions can be drawn for the neural network model: first, for the 5 percent amount-error, significant differences (p < .05) were found between fraction-errors of 25 percent and 100 percent, 50 percent and 100 percent, and 75 percent and 100 percent and, second, for the 10 percent amount-error, significant differences (p < .05) were found between fraction-errors of 25 percent and 75 percent, 25 percent and 100 percent, 50 percent and 100 percent, and 75 percent and 100 percent.
Linear regression model. Two conclusions were drawn for the linear regression model: first, for the 5 percent amount-error, significant differences (p < .05) were found between fraction-errors of 25 percent and 75 percent, 25 percent and 100 percent, and 50 percent and 75 percent and, second, for the 10 percent amount-error, significant differences (p < .05) were found between fraction-errors of 25 percent and 50 percent, 25 percent and 75 percent, 25 percent and 100 percent, and 50 percent and 100 percent.

6.1.4 Comparisons with Base Case Scenarios

The ANOVA results indicate that there are differences in predictive accuracy at different levels of fraction-error and amount-error. However, they do not show which combinations of fraction-error and amount-error have MAPE significantly different than the base case scenario with no data errors (MAPE of 9.5 for the neural network model and 16.8 for the linear regression model). We constructed confidence intervals around the means shown in Table 1 for the experimental conditions to determine which values are significantly different than the base case scenario for each type of forecasting model. We identified combinations of fraction-error and amount-error with MAPE different than the base case scenario for the relevant model type at a level of significance of .05 with an asterisk in Table 1.

Neural network model. When amount error is equal to 5 percent for the neural network model, the scenario with fraction-error equal to 100 percent has MAPE significantly higher (p < .05) than the base case scenario with MAPE of 9.5. When amount-error is equal to 10 percent for the neural network model, the scenarios with fraction-error equal to 50 percent, 75 percent, and 100 percent have MAPE significantly higher (p<.05) than the base case scenario.
Linear regression model. When amount error is equal to 5 percent for the linear regression model, the scenarios with fraction-error equal to 50 percent, 75 percent, and 100 percent have MAPE significantly higher (p < .05) than the base case scenario with MAPE of 16.8. When amount-error is equal to 10 percent for the linear regression model, the scenarios with fraction-error equal to 25 percent, 50 percent, 75 percent, and 100 percent have MAPE significantly higher (p<.05) than the base case scenario.

6.2 Experiment 2 Results: Errors in Training Data

6.2.1 Predictive Accuracy Results

We show predictive accuracy results, using the simulated inaccuracies for amount-error and fraction-error for the NAV forecasts for 1996 in Table 3 for the neural network and linear regression models. The results reflect the use of the appropriately perturbed portion of the training data. Each cell reflects average values for 160 estimations (four runs for 40 mutual funds).

Table 3: Experimental Results: MAPE Values for Neural Network and Linear Regression Models as Accuracy of Training Data Varies
Fraction Error
Amount Error (0 errors) 0% (2 errors) 5% (4 errors) 10% (5 errors) 15% (7 errors) 20% (0 errors) 0% (2 errors) 5% (4 errors) 10% (5 errors) 15% (7 errors) 20%
0% 9.5 Neural Network: Values of MAPE 16.8 Linear Regression: Values of MAPE
5% 11.9* 9.1 8.2* 15.4* 13.2* 10.7* 10.1* 9.5*
10% 14.6* 10.8* 12.9* 17.3* 11.0* 9.9* 10.2* 12.0*
Notes:

(1) Data used to obtain these results were the training data. The 0% fraction error and 0% amount error cells reflect the accuracy of the unmodified test data used in conjunction with the unmodified linear regression and neural network models. All other cells reflect average accuracy results for 4 simulated estimations involving appropriately simulated data inaccuracies for 40 funds.

(2) Cell values marked with an asterisk are values significantly (p<.05) different than the relevant base case MAPE

Neural network model. The left side of Table 3 shows the results for the neural network model. This demonstrates that as fraction-error increases from 5 percent to 10 percent, MAPE decreases indicating an increase in predictive accuracy. A comparison of the results for 5 percent and 15 percent fraction-error shows that MAPE is lower for 15 percent fraction-error indicating better predictive accuracy. As fraction-error increases from 15 percent to 20 percent, MAPE increases indicating a decrease in predictive accuracy. Over the range of fraction-error tested, predictive accuracy is best for the inner levels (10 percent and 15 percent) and poorest at the outer levels (5 percent and 20 percent). For all levels of fraction-error, as amount-error increases from 5 percent to 10 percent, MAPE increases indicating a decrease in predictive accuracy.
Linear regression model. The right side of Table 3 shows the results for the linear regression model. It indicates that when amount-error is equal to 5 percent, MAPE decreases indicating an increase in predictive accuracy as fraction-error increases. When amount-error is equal to 10 percent, (1) MAPE decreases indicating an increase in predictive accuracy as fraction-error shifts from 5 percent to 10 percent and (2) MAPE increases indicating a decrease in predictive accuracy as fraction-error shifts from 10 percent to 20 percent. When fraction-error is equal to 5 percent and 10 percent, MAPE decreases as amount-error increases from 5 percent to 10 percent, indicating an increase in predictive accuracy. When fraction-error is equal to 15 percent, MAPE is nearly identical for the scenario with amount-error equal to 5 percent and the scenario with amount-error equal to 10 percent. When fraction-error is equal to 20 percent, MAPE increases as amount-error increases from 5 percent to 10 percent, indicating a decrease in predictive accuracy.

6.2.2 ANOVA Tests

We conducted a separate two-factor analysis of variance (ANOVA) test for each type of model to test for the effect of the independent variables on MAPE. The independent variables are fraction-error (5 percent, 10 percent, 15 percent, and 20 percent) and amount-error (plus or minus 5 percent, and plus or minus 10 percent).

Table 4 shows the results of the ANOVA test. A main effect was found for both fraction-error and amount-error for the neural network model (p<.05). For the linear regression model, we found an interaction effect between fraction-error and amount-error, and a main effect was found for fraction-error (p<.05). We believe the interaction between fraction-error and amount-error is an important interaction, and an analysis of the dependent variable suggests that a transformation of the variable is not appropriate (Neter et al., 1990). We have found that, for linear regression, at lower levels of fraction-error (5 percent and 10 percent), predictive accuracy is best at the higher level of amount-error (10 percent). At the highest level of fraction-error (20 percent), predictive accuracy is best at the lower level of amount-error (5 percent). These results indicate that both fraction-error and amount-error affect the predictive accuracy of the neural network and linear regression models.

Table 4: Significance of Varying Amount-Error and Fraction-Error on Predictive Performance of Neural Network and Linear Regression Models: ANOVA Results for Varying Training Data
Factor/Significance criterion Neural Network MAPE Linear Regression MAPE
Fraction-error

F(0.05;3;1272)=2.60

24.020

3.042

Amount-error

F(0.05;1;1272)=3.84

20.316

0.046

Fraction type-Amount error interaction

F(0.05;3;1272)=2.60

1.275

3.812

Note: Significant results (p<.05) are marked with an asterisk (*).

6.2.3 Independent Samples t-tests

When there are more than two levels of a factor, ANOVA results do not indicate where the significant differences occur. We performed independent samples t-tests in order to determine exactly where significant differences occurred.

Neural network model. Based on our research, we draw four conclusions for the neural network model:
Linear regression model. For the linear regression model, at the 5 percent amount-error, we found significant differences (p < .05) between fraction-errors of 5 percent and 10 percent, 5 percent and 15 percent, and 5 percent and 20 percent. For the linear regression model, at the 10 percent amount-error, we found no significant differences (p < .05).

6.2.4 Comparisons with Base Case Scenarios

The ANOVA results indicate that there are differences in predictive accuracy at different levels of fraction-error and amount-error. However, they do not show which combinations of fraction-error and amount-error have MAPE significantly different than the base case scenario with no data errors (MAPE of 9.5 for the neural network model and 16.8 for the linear regression model). We constructed confidence intervals around the means shown in Table 3 for the experimental conditions to determine which values are significantly different than the base case scenario for each type of forecasting model. Combinations of fraction-error and amount-error with MAPE different than the base case scenario for the relevant model type at a level of significance of .05 are identified with an asterisk in Table 3.

Neural network model. When amount-error is equal to 5 percent for the neural network model, the scenario with fraction-error equal to 15 percent has MAPE significantly lower than the base case scenario and the scenarios with fraction-error equal to 5 percent and 20 percent have MAPE significantly higher than the base case scenario with MAPE of 9.5 (p < .05). When amount-error is equal to 10 percent for the neural network model, the scenarios with fraction-error equal to 5 percent, 10 percent, 15 percent, and 20 percent have MAPE significantly higher than the base case scenario (p < .05).
Linear regression model. When amount-error is equal to 5 percent for the linear regression model, the scenarios with fraction-error equal to 5 percent, 10 percent, 15 percent, and 20 percent have MAPE significantly lower than the base case scenario with MAPE of 16.8 (p < .05). When amount-error is equal to 10 percent for linear regression, the scenarios with fraction-error equal to 5 percent, 10 percent, 15 percent, and 20 percent have MAPE significantly lower than the base case scenario (p < .05).           (Return to Table of Contents)

7.0 CONCLUSION

Several conclusions can be drawn about the sensitivity of the neural network and linear regression models to data errors. The first set of conclusions addresses the effect of errors in test data. The second set addresses the effect of errors in training data.

1. Errors in test data. For the neural network model, we demonstrated that predictive accuracy decreases as the magnitude of errors (amount-error) increases and as the error rate (fraction-error) increases. We found that predictive accuracy decreases as the error rate increases, which is a departure from the work of Bansal et al. (1993) who discuss a neural network application that is not affected by the error rate of test data. One difference between this study and the work of Bansal et al. (1993) is that our levels of fraction-error range from 25 percent to 100 percent while the levels of fraction-error used in the Bansal et al. (1993) study are markedly lower (4 percent, 8 percent, and 12 percent). Thus, our study shows that variations in the error rate of test data may affect the predictive accuracy of neural network models at these higher levels.

For the linear regression model, we show that predictive accuracy decreases as the magnitude of errors (amount-error) increases and as the error rate (fraction-error) increases. All scenarios with data errors except the case of 25 percent amount-error and 5 percent fraction-error have predictive accuracy significantly worse than the base case scenario without data errors. This finding is consistent with the work of Bansal et al. (1993).

2. Errors in training data. For the neural network model, we demonstrated that predictive accuracy decreases as the magnitude of errors increases. We also show that the error rate affects predictive accuracy, but that the relationship between the error rate and predictive accuracy is not a simple increasing function. As the error rate begins to increase from 5 percent, predictive accuracy first increases and then decreases as the error rate hits 20 percent. The pattern of results when amount-error is equal to 5 percent is particularly interesting. When the error rate is equal to 10 percent, predictive accuracy is not significantly different than the base case scenario with no errors. When the error rate is equal to 15 percent, predictive accuracy is significantly better than the base case scenario with no errors. When the error rate reaches 20 percent, predictive accuracy is significantly poorer than the base case scenario. The overall pattern of results when amount-error is equal to 10 percent is similar in that predictive accuracy is better when the error rate is equal to 10 and 15 percent and poorer when the error rate is equal to 5 and 20 percent.

For the linear regression model, we demonstrated that the predictive accuracy of a linear regression model built to forecast the NAV of mutual funds is better when errors exist in training data than when training data are free of errors. All of the scenarios with errors have predictive accuracy significantly better than the base case scenario without data errors.

We believe our conclusions about the effect of errors in training data are a significant contribution to the literature on data quality because ours is the first research about the effect of errors in training data on neural network and linear regression models.

In addition to testing the sensitivity of the neural network and linear regression models to data errors, we performed a comparison of the relative performance of the models to determine which model is more sensitive to errors in test data and which model is more sensitive to errors in training data. To determine which type of model is more sensitive to errors in test data, we conducted a paired comparison t-test for the MAPE values reported in Table 1 indicating the predictive accuracy of the linear regression and neural network models at different levels of the fraction-error and amount-error factors. This test shows that the neural network model performs better than the linear regression model (p<.0001) when errors exist in test data. On average, the MAPE of the neural network model is 26.14 lower than the MAPE of the linear regression model when forecasts are made using test data with errors. To determine which type of model is more sensitive to errors in training data, a paired comparison t-test was conducted for the MAPE values reported in Table 3 indicating the predictive accuracy of the linear regression and neural network models at different levels of the fraction-error and amount-error factors. This test shows that the linear regression model performs better than the neural network model (p<.0001) when errors exist in training data. On average, the MAPE of the linear regression model is 1.69 lower than the MAPE of the neural network model when the models are constructed using training data with errors. Table 5 presents the results of the paired comparison t-tests (Neter et al., 1990)

Table 5: Paired Comparison t-Tests for Cell Means
Experiment Mean Standard Error t-statistic Significance level
Testing Data

26.14

1.55

16.45

0.0001

Training Data

-1.69

0.26

-4.20

0.0001

Note:

The paired t-tests compare the performance as measured by mean absolute percent error of the linear regression and neural network models as fraction-error and amount-error vary. A positive value for the mean indicates that the neural network model performed better than the linear regression model. A negative value for the mean indicates that the neural network model performed worse than the linear regression model.

The results of our study have implications for practitioners working in a variety of settings characterized by imperfect data. They suggest that understanding error rate and the magnitude of errors in training and test data should be important considerations when choosing between alternative forecasting tools such as neural network and linear regression models. This study shows that for one forecasting task, a neural network model provides better predictive accuracy than a linear regression model when data are free of errors and when test data contain errors. However, for the same forecasting task, a linear regression model provides better predictive accuracy than a neural network model when training data contain errors. This result suggests that those who build and use forecasting models need to be aware of the error rate and magnitude of error in both training and test data. Test data may be more accurate than training data if processes associated with gathering and storing data have improved over time. Conversely, training data may be more accurate than test data if the accuracy of data improves over time. Ballou and Pazer (1995) discuss economic data disseminated by the U.S. government as an example of data that improve in accuracy over time.

The results of our study also have implications for practitioners who have made the decision to use either a neural network or linear regression model and find themselves working with test data containing errors. Our findings suggest that an understanding of the error rate and the magnitude of errors in a data set should be important considerations for users of these models and that devoting resources to lowering the error rate in test data is likely to be beneficial.

Our results also suggest that the error rate of a data set used to build a neural network or linear regression model should be an important consideration. Our discovery that lowering the error rate of training data can decrease the predictive accuracy of both types of models under some conditions is of particular practical importance given the potential cost of lowering the error rate.

Although it would be rash to rely on the results of a single study as the basis for conclusions about the effect of errors on neural network and linear regression models in general, such conclusions may be drawn on the basis of a body of evidence collected through additional research. Our study demonstrates that the outputs of neural network and linear regression models developed to make predictions in one problem domain are sensitive to data errors. The results suggest that additional studies designed to examine the effect of data errors on the outputs of neural network and linear regression models in other problem domains would be worthwhile.

Until a body of evidence addressing the research questions across application domains is constructed, we suggest that designers and users of neural network and linear regression models who are interested in understanding the relationship between data errors and predictive accuracy for a problem domain follow the methodology outlined in this paper. We also suggest that a module for analyzing the effect of data errors be added to neural network and traditional statistical analysis software packages so that users working in other domains can more easily understand the effect of data errors on their work.           (Return to Table of Contents)

 


References

Agmon, N., and N. Ahituv. 1987. Assessing data reliability in an information system. Journal of Management Information Systems 4: 34-44.

Ballou, D., and H. Pazer. 1985. Modeling data and process quality in multi-input, multi-output information systems. Management Science 31: 150-162.

Ballou, D., and H. Pazer. 1995. Designing information systems to optimize the accuracy-timeliness tradeoff. Information Systems Research 6: 51-72.

Ballou, D., H. Pazer, S. Belardo, and B. Klein. 1987. Implications of data quality for spreadsheet analysis. Data Base 18: 13-19.

Ballou, D., and G. Tayi. 1989. Methodology for allocating resources for data quality enhancement. Communications of the ACM 32: 320-329.

Ballou, D., R. Wang, H. Pazer, and G. Tayi. 1998. Modeling information manufacturing systems to determine information product quality. Management Science 44: 462-484.

Balvers, R., T. Cosimano, and B. McDonald. 1990. Predicting stock returns in an efficient market. Journal of Finance 45: 1109-1128.

Bansal, A., R. Kauffman, and R. Weitz. 1993. Comparing the modeling performance of regression and neural networks as data quality varies: A business value approach. Journal of Management Information Systems 10: 11-32.

Bennin, R. 1980. Error rates in CRSP and COMPUSTAT: A second look. Journal of Finance 35: 1267-1271.

Boockholdt, J. 1989. Implementing security and integrity in micro-mainframe networks. MIS Quarterly 13: 135-144.

Breen, W., L. Glosten, and R. Jagannathan. 1990. Predictable variations in stock index returns. Journal of Finance 44: 1177-1189.

Campbell, J. 1987. Stock returns and the term structure. Journal of Financial Economics 18: 373-399.

Chiang, W., T. Urban, and G. Baldridge. 1996. A neural network approach to mutual fund net asset value forecasting. Omega 24: 205-215.

Cochrane, J. 1991. Production-based asset pricing and the link between stock returns and economic fluctuations. Journal of Finance 46: 209-238.

Cole, C. S. 1994. Forecasting interest rates with eurodollar futures rates. Journal of Futures Markets 14: 37-50.

Consumer enemy no. 1. 1991. Newsweek, 28 October, 42, 47.

Corman, L. 1988. Data integrity and security of the corporate data base: The dilemma of end user computing. Data Base 19: 1-5.

Davis, G. 1984. Caution: User developed systems can be dangerous to your organization. MISRC Working Paper 82-04, MIS Research Center, University of Minnesota.

Davis, G., D. Adams, and C. Schaller. 1983. Auditing & EDP. New York: American Institute of Certified Public Accountants, Inc.

Davis, G., and M. Olson. 1985. Management information systems: Conceptual foundations, structure, and development. New York: McGraw-Hill Book Company.

Dead farmer syndrome haunts efforts to trim USDA offices. 1993. Minneapolis Star Tribune 19 April, p. 5A.

Fama, E., and K. French. 1989. Business conditions and expected returns on stocks and bonds. Journal of Financial Economics 25: 23-49.

Ferson, W., and C. Harvey. 1993. The risk and predictability of international equity returns. Review of Financial Studies 6: 527-566.

Fox, C., A. Levitin, and T.C. Redman. 1993. The notion of data and its quality dimensions. Information Processing & Management 30: 9-19.

French, K., G. Schwert, and R. Stambaugh. 1987. Expected stock returns and volitatility. Journal of Financial Economics 19: 3-30.

Glosten, C., R. Jagannathan, and D. Runkle. 1993. On the relation between the expected value and the volatility of the nominal excess returns on stocks. Journal of Finance 48: 1779-1802.

Hardgrave, B., R. Wilson, and K. Walstrom. 1994. Predicting graduate student success: A comparison of neural networks and traditional techniques. Computers and Operations Research 21: 249-263.

Hershey, R. D. 1995. US is considering a large overhaul of economic data. New York Times, 16 January, pp. A1 and D3.

Huh, Y., F. Keller, T.C. Redman, and A. Watkins. 1990. Data quality. Information and Software Technology 32: 559-565.

The individual investor’s guide to low-load mutual funds. 1997. 16th ed. Chicago, IL: American Association of Individual Investors.

Jabbour, G. M. 1994. Prediction of future currency exchange rates from current currency futures prices: The case of GM and JY. Journal of Futures Markets 14: 25-36.

Jain, A., and J. Mao. 1996. Artificial neural networks: A tutorial. Computer 29: 31-44.

Jain, B., and B. Nag. 1995. Artificial neural network models for pricing initial public offerings. Decision Sciences 26: 283-302.

Jankus, J. C. 1997. Relating global bond yields to macroeconomic forecasts. Journal of Portfolio Management 23: 96-101.

Knight, B. 1992. The data pollution problem. Computerworld 26: 81-83.

Laudon, K. 1986. Data quality and due process in large interorganizational record systems. Communications of the ACM 29: 4-11.

Lewis-Beck, M. S. 1980. Applied regression: An introduction. Newbury Park, CA: Sage Publications, Inc.

Madnick, S., and R. Wang. 1992. Introduction to the TDQM research program. Total Data Quality Management Research Program Working Paper #92-01.

Mark, N. C. 1995. Exchange rates and fundamentals: Evidence on long-horizon predictability. The American Economic Review 85: 201-218.

Masson, E., and Y. Wang. 1990. Introduction to computation and learning in artificial neural networks. European Journal of Operational Research 47: 1-28.

Mentzer, J. T., and J.E. Cox. 1984. Familiarity, application, and performance of sales forecasting techniques. Journal of Forecasting 3: 27-36.

Morey, R. 1982. Estimating and improving the quality of information in a MIS. Communications of the ACM 25: 337-342.

Morgenstern, O. 1963. On the accuracy of economic observations. Princeton, NJ: Princeton University Press.

Nayar, M. 1993. Achieving information integrity: A strategic imperative. Information Systems Management 10: 51-61.

NeuroShell 2. 1996. 4th ed. Frederick, MD: Ward Systems Group.

Neural Applications Corporation: Intelligent process optimization. 1997, June 27. http://www.neural.com/ .

Neter, J., W. Wasserman, and M. Kutner. 1990. Applied linear statistical models. 3rd ed. Homewood, IL: Irwin.

O'Leary, D. 1993. The impact of data accuracy on system learning. Journal of Management Information Systems 9: 83-98.

Panko, R. R. 1998. What we know about spreadsheet errors. Journal of End User Computing 10, no. 2: 15-21.

Pesaran, M., and A. Timmermann. 1994. Forecasting stock returns: An examination of stock market trading in the presence of transaction costs. Journal of Forecasting 13: 335-367.

Pesaran, M., and A. Timmerman. 1995. Predictability of stock returns: Robustness and economic significance. Journal of Finance 50: 1201-1228.

Redman, T. C. 1992. Data quality: Management and technology. New York: Bantam Books.

Redman, T. C. 1995. Improve data quality for competitive advantage. Sloan Management Review 36: 99-107.

Redman, T. C. 1996. Data quality for the information age. Norwood, MA: Artech House, Inc.

Rosenberg, B., and M. Houglet. 1974. Error rates in CRSP and COMPUSTAT data bases and their implications. Journal of Finance 29: 1303-1310.

Sanders, N. R. 1994. Forecasting practices in US corporations: Survey results. Interfaces 24: 92-100.

Schoneburg, E. 1990. Stock price prediction using neural networks: A project report. Neurocomputing 2: 17-27.

Statistical abstract of the United States. 1996. Washington, D.C.: U.S. Bureau of the Census, Government Printing Office.

Tam, K., and M. Kiang. 1992. Managerial applications of neural networks: The case of bank failure predictions. Management Science 38: 926-947.

Trendy Systems, LLC. End of day S&P futures trading. 1997, June 27. http://www.trendysystems.com/ .

Wadi, I. 1996, November 25. Neural network model predicts naphtha cut point. Oil & Gas Journal 94: 67-70.

Wand, Y., and R. Wang. 1996. Anchoring data quality dimensions in ontological foundations. Communications of the ACM 39: 86-95.

Wang, J., and B. Malakooti. 1992. A feedforward neural network for multiple criteria decision making. Computers and Operations Research 19: 151-167.

Wang, R., and D. Strong. 1996. Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems 12: 5-34.

Wasserman, G., and A. Sudjianto. 1996. A comparison of three strategies for forecasting warranty claims. IEEE Transactions 28: 967-977.

Yoon, Y., G. Swales, and T. Margavio. 1993. A comparison of discriminant analysis versus artificial neural networks. Journal of Operational Research 44: 51-60.

Zahedi, F. 1991. An introduction to neural networks and a comparison with artificial intelligence and expert systems. Interfaces 21: 25-38.

Zmud, R. 1978. An empirical investigation of the dimensionality of the concept of information. Decision Sciences 9: 187-195.

(Return to Table of Contents)


Go to:

Data Quality Home Page

Comments: dqemail@aol.com (2000-02-14)