Wednesday 16 May 2018

Trouble-shooting for Excel regression reporting: some advice

Trouble-shooting for Excel regression reporting: some advice.

Some Independent Study students report on the following issues as related to Excel regression reports when conducting multiple regression analysis:

1. t-value and p-value figures are contradictory: I suggest that you examine the figures of p-values. If the figure is 2.3E-3; that means the figure is 0.00.23. You could choose the comma format, allowing for more decimal points to let Excel shows the actual figures in a way that you could understand better.  My experiment 1 result below is indicative (y variable figures are all the same):

Experiment 1: variable y value is constant

Y-variable X1 X2 X3 X4 X5
1.2 21 14 0 23 47
1.2 23 15 1 24 48
1.2 18 16 0 25 48
1.2 19 17 1 26 50
1.2 22 18 0 27 51
1.2 17 19 0 28 52
1.2 21 20 0 29 53
1.2 20 21 1 30 50
1.2 20 22 1 24 55
1.2 22 23 0 32 56
1.2 25 24 1 20 55
1.2 25 25 1 34 58
1.2 20 12 0 23 47
1.2 21 12 1 19 60
1.2 20 15 1 26 45
1.2 18 17 1 21 42



  Coefficients Standard Error t Stat P-value
Intercept 1.2 1.88E-31 6.4E+30 2.1E-304
X1 -1E-33 7.76E-33 -0.13156 0.897945
X2 -1.1E-33 5.22E-33 -0.21339 0.835314
X3 2.48E-32 3.17E-32 0.781673 0.452516
X4 1.32E-33 4.6E-33 0.287302 0.779746
X5 1.81E-34 3.67E-33 0.049329 0.961629

Revised report format

  Coefficients Standard Error t Stat P-value
Intercept             1.2000000             0.0000000                                                     6,396,897,226,530,190,000,000,000,000,000.0000000             0.0000000
X1 -          0.0000000             0.0000000 -                                                                                                                         0.1315552             0.8979452
X2 -          0.0000000             0.0000000 -                                                                                                                         0.2133856             0.8353138
X3             0.0000000             0.0000000                                                                                                                           0.7816735             0.4525163
X4             0.0000000             0.0000000                                                                                                                           0.2873024             0.7797462
X5             0.0000000             0.0000000                                                                                                                           0.0493285             0.9616288


If the p-value is #NUM! while the t-value is very large, that may means that the data set has some extreme figures, e.g. all x1 variable figures are the same, see my experiment 2 result below). [Note: the explanation of #NUM! from Microsoft can be found in this link].

2. If the t-value is very large, then what can be done? Try to check the underlying data file figures to see if there are outlier figures, which explain why this is the case; or, it is due to data entry errors. Another way to deal with that is to use a record as based record to calculate sensitivity figures. Sensitivity figures (in % terms) tend not to vary so significantly as the absolute figures. Anyway, even if the t-value is very large, e.g. 1,000, it is still possible to conduct a hypothesis testing by comparing the sample t-value of 1,000 with the critical value of 1.96. At the same time, students could also consider to revise or introduce some new quantitative tests to explore the data set further.
3. It is also useful to revise the formula design to explore the underlying data set further.
4. If all efforts to cope with the technical problem fails, try to stick to one set of figures for analysis, i.e., the t-value of the p-value. This is the means of last resort. This step has risk, however; see point 5b below.
5a. If the b value is 0 and the standard error figure is also 0 for an x-variable, my experiment finding shows that the variable figures for this x-variable in the file is a constant value.  See my experiment result as follows:

Experiment 2 : variable x1 is a constant value.

Y-variable X1 X2 X3 X4 X5
21 1.2 14 0 23 47
23 1.2 15 1 24 48
18 1.2 16 0 25 48
19 1.2 17 1 26 50
22 1.2 18 0 27 51
17 1.2 19 0 28 52
21 1.2 20 0 29 53
20 1.2 21 1 30 50
20 1.2 22 1 24 55
22 1.2 23 0 32 56
25 1.2 24 1 20 55
25 1.2 25 1 34 58
20 1.2 12 0 23 47
21 1.2 12 1 19 60
20 1.2 15 1 26 45
18 1.2 17 1 21 42


Coefficients Standard Error t Stat P-value
Intercept 9.617407577 6.683169194 1.439048945 0.177978
X1 0 0 65535 #NUM!
X2 0.145132095 0.197826159 0.733634501 0.478517
X3 0.68778005 1.214593666 0.566263491 0.582583
X4 -0.032302779 0.178243671 -0.181228194 0.859485
X5 0.17517698 0.132191307 1.325177764 0.211979


Anyway, it is useful to examine the underlying data file to figure out what happens in the data pattern. It also points to the need to make further refinement of the research design, notably on the regression formula setting. One needs to especially review the choice of x1 in the formula as shown in my experiment above.
5b. When the data set consists of records with such extreme values, the Excel calculation may run into some difficulties, resulting in certain calculation errors. On this point, I am not able to verify however. I do feel that more fundamentally, the formula design and data gathering approach need to be revised or refined in this case.


Two additional final viewpoints:
Point 1: It is very rare that the problem of Regression reporting is due to Excel bugs. And, after you resolve the problem, try to justify your revised research design in your Independent Study report.
Point 2: Although I do not have the technical knowledge to answer all Excel problem, I feel that we do not need multiple regression analysis to study a data set with a large amount of extreme figures. Extreme data patterns, in this case, can simply be studied with more basic descriptive statistics, e.g., scatter diagrams, in my view.

No comments:

Post a Comment