5 views

Skip to first unread message

Jun 27, 2005, 1:57:44 AM6/27/05

to

On June 19, Robin Edwards wrote, regarding a data set in SPSS

I discussed, involving "model building",

I discussed, involving "model building",

RE> Totally lacking access to anything to with SPSS, I would

RE> be very pleased to be told how I might get hold of that

RE> data set. Sounds like an interesting and instructional

RE> exercise to try analysing it from a standing start.

On June 21, I wrote

RF> I don't see many eager volunteers to help on providing the

RF> data. Let me try to see if that data is STILL used in the

RF> current SPSS Manual.

Not getting any response about the data set, I wrote on June 22,

"I just remembered now, that I review the book by Cox and Snell,

"Applied Statistics, Principles and Practice", in JASA (1984,

229-231) in which I re-analyzed their most detailed analysis in

the book, a multiple regression model building example, and

showed that as carefully as they showed many things THEY

considered that most other analysts would have missed, that

THEIR analysis was still too cut-and-dry. I produced several

alternative models (based on their data) in my review that

proved superior to their final model by significant PRACTICAL

significance margins as well as statistical significance. "

"JASA is widely accessible. Perhaps you or other readers may be

interested in taking a look at THAT example while I give it

another couple of days for anyone to help out with the SPSS data

before I type it MYSELF, and challenge everyone who thinks they

are good at "model building" to try their hands on it. "

I don't think any help is forthcoming, so here's the data, from

the 1975 SPSS Manual:

INVDEX Investors Index 1940 = 100

GNP Gross National Product (scaled)

C.PROF Corporate Profits before taxes

C.DIVD Corporate Dividends paid

INVDEX GNP C.PROF C.DIVD

76.4 7678 269 216 (1935)

99.5 8022 351 251 (1936)

105.9 8820 403 250

86.7 8871 362 290

83.7 9536 541 304

70.7 10911 619 317

61.7 12486 801 273

58.7 14816 917 243

76.3 15357 882 233

76.6 15927 858 211

91 15552 852 195

105.8 15251 966 230

96.8 15446 1008 286

102.8 15735 908 240

100 16343 851 278

120.3 17471 1065 361

153.8 18547 1034 300

158.2 20027 1081 296

146.5 20794 1089 287

165.6 20186 953 282

212.7 21920 1206 321

245.9 23811 1313 340

236 24117 1202 364

218.8 24397 1242 371

242.6 25242 1378 388

256.9 15849 1295 397

326.1 25615 1314 436

314.4 28287 1422 470

336 29740 1525 511

394 31650 1718 583

433.1 33814 1836 629 (1965)

408.5 35822 1762 655 (1966)

The example was used to illustrate the how to use the Multiple

Regression Procedure to fit INVDEX to three predictor variables

(GNP, C.PROF, and C.DIVD) and what outputs are given. The output

showed EVERYTHING to be highly statistically significant (anyone

can run this set of data and will notice that, which is obvious).

It turned out that the highly statistically significant results

are highly DECEPTIVE!

There are many hidden and important LESSONS behind the ROUTINE

fitting of this data set (let's say for prediction purposes).

This would be a TYPICAL problem in "model building" using multiple

regression methods -- that's all the hint I am going to give now.

If you think you fitting the data as given would yield a "good

result" (as was given in the SPSS Manual), because the R^2 is

well over 0.9, or whatever else that impressed you about the

fit, you deserve to FLUNK any course in "model building" OR

"data analysis"! Not even a low D!

So, the challenge is to do SOMETHING right! There is no "best"

answer or model for the data, but some are clearly far superior

to others -- that's the usual result in almost all problems of

fitting models to data. For this EXERCISE, there is NO need to

attempt to explain anything in terms of cause or influence.

Just a matter of FITTING and use the model for prediction

intervals at various values of the predictors.

So, there's something to be done besides sutffing the data into

the SPSS package and get the standard multiple regression results.

Let's see what discoveries and what model anyone might have found

or points to make for this exercise.

-- Bob.

Jun 27, 2005, 3:59:54 AM6/27/05

to

>

> The example was used to illustrate the how to use the Multiple

> Regression Procedure to fit INVDEX to three predictor variables

> (GNP, C.PROF, and C.DIVD) and what outputs are given. The output

> showed EVERYTHING to be highly statistically significant (anyone

> can run this set of data and will notice that, which is obvious).

>

this (output from R):

> reg3=lm(NVDEX~GNP+C.DIVD+C.PROF)

> anova(reg3)

Analysis of Variance Table

Response: NVDEX

Df Sum Sq Mean Sq F value Pr(>F)

GNP 1 323110 323110 318.7243 < 2.2e-16 ***

C.DIVD 1 35447 35447 34.9658 2.312e-06 ***

C.PROF 1 3 3 0.0028 0.9582

Residuals 28 28385 1014

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova() adds terms to the ANOVA table sequentially, so the order of the

terms is important. I don't know how the SS's are calculated in SPSS.

Actually, I think that both Reef Fish and I are right, but finding out

why we get different results is educational in itself, so I'm not going

to give it away. It's obvious once you spot it.

Bob

Jun 27, 2005, 5:09:15 AM6/27/05

to

While I'm no economist, I have to ask whether 1960 is a typ0.

--Jerry

Jun 27, 2005, 9:19:26 AM6/27/05

to

I'm no economist either (and would be highly insulted to be

called one ;-) ), but I think the GNP figure should be 25849,

not 15849.

Cheers,

Russell

Jun 27, 2005, 10:06:51 AM6/27/05

to

LESSON #1. ALWAYS examine the data for gross (and not so gross)

anomaly.

Jerry and Russell are making a good start toward their Fish

University "A".

Anon Bob O'Hara is maintaining the solid "F" he had earned

in this ng.

The 15849 is of course an obvious typo, not by ME (it took me

about 10 minutes to type the data, about 30 minutes to write

a multiple regression program in SPEAKEASY because I have NO

access to any statistical package; and at least an hour to

find and correct the half a dozen or so typos of MINE <G> by

checking against the results I had done 30 years ago). The

typos were in the 1975 SPSS Manual!

The TYPO was what contributed to all THREE variables being

statistically significant in the SPSS Manual -- without it ...

that's the next chapter/Lesson. :-)

In this case, the typo was so obvious that Jerry and Russell

spotted it without doing any analysis. But the way many gross

outliers and anomalies are seen are during various graphical

display or analysis of the data.

The early spotting of the gross typo took us to first base --

there were TWO typos in the 1975 SPSS Manual. For subsequent

anslysis, change the 1960 and 1961 values of GNP as follows:

15849 ----> 25849 (good guess by Russell)

25615 ----> 26515 (common transposition error, "56" for "65")

The corrected values were used by me in my Data Analysis

Lecture Notes since 1975. The corrections were taken from

the 1972 SPSS Manual.

NOW we have the corrected data set we can start the model building

exercise.

-- Bob.

Jun 27, 2005, 10:20:55 AM6/27/05

to

Reef Fish wrote:

> In this case, the typo was so obvious that Jerry and Russell

> spotted it without doing any analysis. But the way many gross

> outliers and anomalies are seen are during various graphical

> display or analysis of the data.

Actually, I spotted it by graphing the data, which I consider analysis.

You can rouse any of my students from a sound sleep, shine a bright

light in his/her eyes, and shout at the top of your lungs, "What's the

first thing you do with a set of data?" Without missing a beat, s/he'll

mumble, "Display it..." and go back to sleep.

--Jerry

Jun 27, 2005, 10:55:23 AM6/27/05

to

Reef Fish wrote:

> NOW we have the corrected data set we can start the model building

> exercise.

Bob, would you double check CDIVD for 1950? It seems out of whack with

the GNP. The GNP looks smooth over time. The CDIVD spikes in 1950.

Thanks!

Jun 27, 2005, 11:01:48 AM6/27/05

to

Good for you, Jerry!

That's the Corollary to Lesson 1. :-)

You must not get much sleep, because I didn't post the data until 1:57

am!

That was why I didn't think you had time to have even looked at any

display of the data, at 5 am.

Here's another SPSS anecdote. When I was at the GSB U. Chicago, in

1970, in a basement office across the hall from a Ph.D. grad student

in Behavioral Sciences, completing his thesis, Bill Ouchi.

http://www.williamouchi.com/

http://www.absoluteastronomy.com/encyclopedia/W/Wi/William_Ouchi.htm

Bill was struggling for over 2 MONTHS over some SPSS results he

couldn't understand, and came to me for help. The FIRST thing I

asked him was to show me his DATA -- NOT any of his SPSS regression

results.

It turned out that 2 MONTHS of his life went down the drain because

he missed punched (or miss-specified the input format) of some

numbers in the programs he ran. :-)

Bill is smart enough that he soon had chaired Professorships, first

at Stanford, and later at UCLA. If anyone ever run across Bill,

I am sure he'll remember what happened, and don't mind me telling

this anecdote about his SPSS runs. :-)

-- Bob.

Jun 27, 2005, 11:18:02 AM6/27/05

to

I could well be, but unimportant in our model building EXERCISE.

The data I gave is without typo (of mine) from the 1975 SPSS Manual.

I did not look at the C.DIVD data nearly as carefully because that

variable turned out to be "not needed" (I am giving part of the

solution/future-Lessons away now <G> by that remark).

I haven't looked at any SPSS Manual (or had any copy) since about the

mid 1980s. But if someone can check it and correct, it would be fine

with me.

It will not alter any of my discussions to come, on the problem based

on the data ACTUALLY given, or some further minor corrections to it.

-- Bob.

Jun 27, 2005, 11:19:17 AM6/27/05

to

Yes, but that could be real. I'm no more CEO than economist, but

there are reasons beyond pure macro economics for dividends

to vary. What I want to know is why the heading says the Investors

Index is defined as 100 in 1940, but in fact it is 1949 when it

equals 100.

there are reasons beyond pure macro economics for dividends

to vary. What I want to know is why the heading says the Investors

Index is defined as 100 in 1940, but in fact it is 1949 when it

equals 100.

Cheers,

Russell

--

All too often the study of data requires care.

Jun 27, 2005, 11:50:05 AM6/27/05

to

Russell...@wdn.com wrote:

> Yes, but that could be real. I'm no more CEO than economist, but

> there are reasons beyond pure macro economics for dividends

> to vary. What I want to know is why the heading says the Investors

> Index is defined as 100 in 1940, but in fact it is 1949 when it

> equals 100.

>

> Cheers,

> Russell

> Yes, but that could be real. I'm no more CEO than economist, but

> there are reasons beyond pure macro economics for dividends

> to vary. What I want to know is why the heading says the Investors

> Index is defined as 100 in 1940, but in fact it is 1949 when it

> equals 100.

>

> Cheers,

> Russell

Yes, but plot CDIVD, GNP, and YEAR against each other and you'll see why

I asked. I can't recall anything that would make '50 special. It

wasn't a presidential election year. The Korean War started in June,

but it lasted 3 years. It might be tied to something to do with it

being 5 years after the end of WWII, but that's stretching. When in

doubt, I always ask.

Unfortunately, not yet having retired, I've got a couple of projects to

work on today. Whatever else these data are, they are short time

series. I'm not sure how Bob is proposing we deal with them. Are we to

"pretend" they are 32 iidrvs as though it were a typical multiple

regression problem? The pre-1950 data make me especially anxious about

that approach. Take a look at what's going on in the scatterplots pre-1950.

http://www.tufts.edu/~gdallal/invdex.jpg

http://www.tufts.edu/~gdallal/3d.jpg

I'll try to get back to this tonight, although I suspect others will

pick the data apart long before then.

Jun 27, 2005, 11:55:15 AM6/27/05

to

Jerry Dallal wrote:

> Reef Fish wrote:

>

>> In this case, the typo was so obvious that Jerry and Russell

>> spotted it without doing any analysis. But the way many gross

>> outliers and anomalies are seen are during various graphical

>> display or analysis of the data.

>

>

> Actually, I spotted it by graphing the data, which I consider analysis.

> Reef Fish wrote:

>

>> In this case, the typo was so obvious that Jerry and Russell

>> spotted it without doing any analysis. But the way many gross

>> outliers and anomalies are seen are during various graphical

>> display or analysis of the data.

>

>

> Actually, I spotted it by graphing the data, which I consider analysis.

Me too. That's also the technique that lead me to seeing the typo, and

what the basic problems is with the data. I deciding that it's not

worth progressing further without seeking expert advice from an

economist: I don't even know how the response variable is arrived at (Oh

and I assume that there is a typo in the explanantion as well).

Bob

Jun 27, 2005, 12:12:25 PM6/27/05

to

Russell...@wdn.com wrote:

> Yes, but that could be real. I'm no more CEO than economist, but

> there are reasons beyond pure macro economics for dividends

> to vary. What I want to know is why the heading says the Investors

> Index is defined as 100 in 1940, but in fact it is 1949 when it

> equals 100.

You'll do well as a "copy editor" for journals and books. They are

the ones who pick out MY typos, grammatic errors, and other faux

pas on the English language. :)

*I* claim that TYPO error! :-)

SPSS Manuals did give the correct definition, and it was in my notes,

that 1949 = 100. The key "9" was too close to "0" for my fat finger.

In any event, that typo is inconsequential treating this data as a

mere EXCERCISE with given data. I discount all ECONOMICS and

CORPORATE substance in the data because there are many valid concerns

that cannot be be usefully discussed on those subjects relative to the

data SPSS used for Multiple Regression on the data.

Given the above, are you ready to do some STATISTICAL analysis and

data-fitting/model-building?

-- Bob.

Jun 27, 2005, 12:19:21 PM6/27/05

to

Here are a couple of loess smoothers. One for all of the data, another

for pre-1950 to provide additional detail.

http://www.tufts.edu/~gdallal/loess_all.jpg

http://www.tufts.edu/~gdallal/loess_pre1950.jpg

for pre-1950 to provide additional detail.

http://www.tufts.edu/~gdallal/loess_all.jpg

http://www.tufts.edu/~gdallal/loess_pre1950.jpg

Jun 27, 2005, 12:36:20 PM6/27/05

to

Anon Bob O'Hara. Your grade of "F" was earned in Fish University,

spotting the error notwithstanding, precise because of your paragraph

above AS WELL AS showing the regression results from "R" even after

you've discovered what was an OBVIOUS typo/blunder in the data.

If you had to do any TENTATIVE analysis, such as (removing the one

row with the obvious data blunder), it would have been appropriate

-- all you had to do was ASK, as Jerry and Russell did, instead

of plunging right into the GARBAGE pool.

I'll temporarily withdraw your "F", and would welcome your further

analysis based on the CORRECTED data.

That's actually my greatest criticism of BATCH statistical packages

such as SPSS and SAS which may spew out 5 to 10 pages of output of

one PROC after another, all of which may have been invalidated by

the result of the FIRST graphical display!

That's the advantage of an "interactive" software which does one

small task at a time, to accomplish what George Box (JASA article

on Science and Statistics) and my Statistical Encyclopedia

article on "Innteractive Data Analysis" talk about, in terms of

the ITERATIVE process of "model building".

As soon as SOMETHING is detected to require a change of course

or further examination, NO FURTHER RESULT should be printed that

would prove to be inappropriate given the finding during the

iterative process.

-- Bob.

Jun 27, 2005, 12:48:29 PM6/27/05

to

Jerry Dallal wrote:

> Russell...@wdn.com wrote:

> > Yes, but that could be real. I'm no more CEO than economist, but

> > there are reasons beyond pure macro economics for dividends

> > to vary. What I want to know is why the heading says the Investors

> > Index is defined as 100 in 1940, but in fact it is 1949 when it

> > equals 100.

> >

> > Cheers,

> > Russell

>

>

> Yes, but plot CDIVD, GNP, and YEAR against each other and you'll see why

> I asked. I can't recall anything that would make '50 special. It

> wasn't a presidential election year. The Korean War started in June,

> but it lasted 3 years. It might be tied to something to do with it

> being 5 years after the end of WWII, but that's stretching. When in

> doubt, I always ask.

So far, so good.

>

> Unfortunately, not yet having retired, I've got a couple of projects to

> work on today.

You ASSUMED incorrectly that someone who has "retired" from academia

(out of DISGUST of those who sold their souls to the Devil) do not

have (more important than yours) projects to work on today. ;-)

> Whatever else these data are, they are short time

> series. I'm not sure how Bob is proposing we deal with them. Are we to

> "pretend" they are 32 iidrvs as though it were a typical multiple

> regression problem?

Yes, for the EXERCISE, as I had indicated in my reply to Russel,

on the same post you are following-up on, because it was done in

SPSS as a Multiple Regression example.

> The pre-1950 data make me especially anxious about

> that approach. Take a look at what's going on in the scatterplots pre-1950.

> http://www.tufts.edu/~gdallal/invdex.jpg

> http://www.tufts.edu/~gdallal/3d.jpg

It's all deja vu. :-) But you are GIVEN a dataset to FIT a

regression model, to predict INVDEX. The question is "what CAN

you do" and not "how much can I complain" about the given DATA,

which are (reasonably) assumed to be CORRECT.

>

> I'll try to get back to this tonight, although I suspect others will

> pick the data apart long before then.

Don't bother to pick the DATA apart. Do some DATA fitting, using

MODEL BUILDING methods, with the data as given.

-- Bob.

Jun 27, 2005, 12:55:51 PM6/27/05

to

If you're grading me I'd actually want to think about what I'm

doing :-) and while it may look like I have nothing else to do,

I'm trying to stomp out bugs in a computer program by day

and renovate my house by night, so I'll just have to kibitz

about typos.

doing :-) and while it may look like I have nothing else to do,

I'm trying to stomp out bugs in a computer program by day

and renovate my house by night, so I'll just have to kibitz

about typos.

Cheers,

Russell

Jun 27, 2005, 12:58:36 PM6/27/05

to

Reef Fish wrote:

>

> Jerry Dallal wrote:

>

>>Whatever else these data are, they are short time

>>series. I'm not sure how Bob is proposing we deal with them. Are we to

>>"pretend" they are 32 iidrvs as though it were a typical multiple

>>regression problem?

>

>

> Yes, for the EXERCISE, as I had indicated in my reply to Russel,

> on the same post you are following-up on, because it was done in

> SPSS as a Multiple Regression example.

>

>

> Jerry Dallal wrote:

>

>>Whatever else these data are, they are short time

>>series. I'm not sure how Bob is proposing we deal with them. Are we to

>>"pretend" they are 32 iidrvs as though it were a typical multiple

>>regression problem?

>

>

> Yes, for the EXERCISE, as I had indicated in my reply to Russel,

> on the same post you are following-up on, because it was done in

> SPSS as a Multiple Regression example.

>

In that case, better to label the variables A,B,C,D,... especially when

the requested analysis might not be appropriate given the labels. Just

because SPSS analyzed the data by using multiple regression doesn't mean

that such an analysis is the right way to go. Is YEAR one of the

predictors? Then, I'll be ready to do some model building.

Jun 27, 2005, 1:18:36 PM6/27/05

to

Jerry Dallal wrote:

> Reef Fish wrote:

> >

> > Jerry Dallal wrote:

> >

> >>Whatever else these data are, they are short time

> >>series. I'm not sure how Bob is proposing we deal with them. Are we to

> >>"pretend" they are 32 iidrvs as though it were a typical multiple

> >>regression problem?

> >

> >

> > Yes, for the EXERCISE, as I had indicated in my reply to Russel,

> > on the same post you are following-up on, because it was done in

> > SPSS as a Multiple Regression example.

> >

>

> In that case, better to label the variables A,B,C,D,... especially when

> the requested analysis might not be appropriate given the labels.

You may, if you wish. But the labels were in SPSS's example and the

data do come from those labels.

> Just

> because SPSS analyzed the data by using multiple regression doesn't mean

> that such an analysis is the right way to go.

That's certainly correct. After the model-fitting EXERCISE, I'll

be more than happy to discuss WHY the SPSS type of regression

analysis is NOT the right way to go -- no ifs or buts about it.

But it certainly furnishes a nice example of DATA to a regression

EXERCISE, to contrast what SPSS didn't do right! We have already

found ONE -- that whoever did the example for the SPSS Manual,

certainly did not EXAMINE the data.

There are many more generic LESSONS to come, in model-building

relative to regression analyses.

> Is YEAR one of the

> predictors? Then, I'll be ready to do some model building.

That's part of the INFO (auxiliary) given in SPSS. I am giving

everyone the FULL DISCLOSURE. You can use it (or not use it) in

WHATEVER way you deem appropriate. :-) It was NOT used in

the SPSS example either as a predictor or auxiliary information.

That's all PART of the ocnsideration in ANY Data Analysis project!

-- Bob.

Jun 27, 2005, 1:15:22 PM6/27/05

to

FWIW, blindly throwing everything, including year, into a multiple

linear regression equation (no interactions, no transformations) gives

an RMS of 622. Using GNP alone, a linear-linear spline with a knot at

13770 gives an RMS of 335.

linear regression equation (no interactions, no transformations) gives

an RMS of 622. Using GNP alone, a linear-linear spline with a knot at

13770 gives an RMS of 335.

Jun 27, 2005, 1:32:16 PM6/27/05

to

Russell...@wdn.com wrote:

> If you're grading me I'd actually want to think about what I'm

> doing :-)

That's tacitly assumed to be true of ANY data analyst worth

even a little bit of salt, whether he is graded by the Fish

University or not. :-)

> and while it may look like I have nothing else to do,

> I'm trying to stomp out bugs in a computer program by day

> and renovate my house by night, so I'll just have to kibitz

> about typos.

Forget about the typos. See my further remarks (latest, right

before this post) to Jerry Dallal about how to view the data

as GIVEN, and use any or all INFO given in the SPSS Manual in

the model-building EXERCISE, just as a discussion of what

SPSS (or anyone looking at that data from SPSS in a multiple

regression) SHOULD have done.

BTW, except for the "F"s, I wouldn't be so insensitive or

presumptious to give letter grades to others.

So, nearly everyone is SAFE in that regard. I hope EVERYONE

who seriously tried to do some model building of this data

as an EXERCISE will learn some valuable LESSONS that may not

have occured to them before.

Some of what *I* have to say can only be found in *MY* Data

Analysis Lecture Notes -- that anyone will be welcome to

criticize or challenge. So, I am EXPOSING myself to GRADES

or attacks given by all the Quacks and Malpractioners (they

wouldn't know enough to contribute anything), AS WELL AS

(and more likely) challenges by those who are competent

on the subject of model-fitting in regression analysis, with

real DATA (what SOME in this ng may have never seen. LOL)

-- Bob.

Jun 27, 2005, 2:10:54 PM6/27/05

to

Without even graphing the data, I can see that they are time

trending, and probably have unit roots. That suggests

"spurious regression". To analyze these data, therefore,

we'll have to see whether any of the variables are cointegrated

(what Granger won his Nobel for in 2003). If so, they

will have to be analyzed via cointegration or error-correction

methods and, if not, will have to be differenced to avoid

spurious regression.

trending, and probably have unit roots. That suggests

"spurious regression". To analyze these data, therefore,

we'll have to see whether any of the variables are cointegrated

(what Granger won his Nobel for in 2003). If so, they

will have to be analyzed via cointegration or error-correction

methods and, if not, will have to be differenced to avoid

spurious regression.

Jun 27, 2005, 3:11:13 PM6/27/05

to

This is also an example why it is critical to either (double enter and

verify) or (enter and proofread) data whenever it is possible.

verify) or (enter and proofread) data whenever it is possible.

Art

A...@DrKendall.org

Social Research Consultants

University Park, MD USA

(301) 864-5570

Jun 27, 2005, 3:57:44 PM6/27/05

to

bdmccu...@drexel.edu wrote:

> Without even graphing the data, I can see that they are time

> trending, and probably have unit roots. That suggests

> "spurious regression". To analyze these data, therefore,

> we'll have to see whether any of the variables are cointegrated

> (what Granger won his Nobel for in 2003).

Granger won his Nobel for 2003 on THAT?

I know the Economics Nobel has really been scratching the bottom

of the barrel for winners, but if that's what Granger won his

Nobel price, they could have given me the Nobel prize 30 years

ago. :-)

I take it back!! They (the Nobel Committee) COULDN'T because

Nobel (for his wife's "affair" with a mathematician) had made

sure than no mathematician or statistician could win a Nobel

Prize (for the lack of such a category).

> If so, they

> will have to be analyzed via cointegration or error-correction

> methods and, if not, will have to be differenced to avoid

> spurious regression.

Your comment about spurious correlation on time series is a

WELL KNOWN fact, but a good one to point out. It'll come into

play in subsequent LESSONS that go with this EXERCISE.

But a Nobel prize? You jest! That's KID STUFF, for ANY

applied statistician who knows anything about data anslysis.

Thanks for your commment all the same. The Nobel mention was

definitely grossly exaggerated. :-)

-- Bob.

Jun 27, 2005, 4:16:25 PM6/27/05

to

http://nobelprize.org/economics/laureates/2003/index.html

"for methods of analyzing economic time series with common

trends (cointegration)"

"for methods of analyzing economic time series with common

trends (cointegration)"

Don't get me started on the Nobel Prize for Economics...

Cheers,

Russell

Jun 27, 2005, 4:27:31 PM6/27/05

to

"Reef Fish" <Large_Nass...@Yahoo.com> writes:

> I take it back!! They (the Nobel Committee) COULDN'T because

There is no "Nobel Committee"

> Nobel (for his wife's "affair" with a mathematician) had made

Nobel never married.

Jun 27, 2005, 4:31:34 PM6/27/05

to

This is what I get for even looking that up. I just saw that

the website lists the prize as "The Bank of Sweden Prize

in Economic Sciences in Memory of Alfred Nobel".

Economic *Sciences*! That's an oxymoron given the

way economics is presently practiced in far too many

cases. OTOH the Voodoo Economics of the Reagan

administration was tautological. Usually I have 12 months

to get over my aggravation over the previous prize, but

now I have this intervening aggravation. The Bank of

Sweden couldn't think of anything better to do with its

money?!

the website lists the prize as "The Bank of Sweden Prize

in Economic Sciences in Memory of Alfred Nobel".

Economic *Sciences*! That's an oxymoron given the

way economics is presently practiced in far too many

cases. OTOH the Voodoo Economics of the Reagan

administration was tautological. Usually I have 12 months

to get over my aggravation over the previous prize, but

now I have this intervening aggravation. The Bank of

Sweden couldn't think of anything better to do with its

money?!

Cheers,

Russell

Jun 27, 2005, 4:58:22 PM6/27/05

to

Torkel Franzen wrote:

> "Reef Fish" <Large_Nass...@Yahoo.com> writes:

>

> > I take it back!! They (the Nobel Committee) COULDN'T because

>

> There is no "Nobel Committee"

There are certainly Nobel award committees. You think they came

from random drawing as in the Reader's Digest sweepstakes? :-)

>

> > Nobel (for his wife's "affair" with a mathematician) had made

>

> Nobel never married.

That may explain it. It must've been Nobel's "wife to be" who

ran away with a mathematician instead of marrying him.

-- Bob.

Jun 27, 2005, 5:05:47 PM6/27/05

to

Come on, Russell, start on it. :-)

Always interested in hearing others' versions about it. Nothing

personally against any of the Nobel Prize winners in Economics.

Several of them are even people I know personally. These

include about half a dozen of my former colleagues at the

University of Chicago.

But the Prize itself and its recent selections were a joke!

-- Bob.

Jun 27, 2005, 5:12:54 PM6/27/05

to

"Reef Fish" <Large_Nass...@Yahoo.com> writes:

> There are certainly Nobel award committees. You think they came

> from random drawing as in the Reader's Digest sweepstakes? :-)

The prizes are awarded by different bodies.

> That may explain it. It must've been Nobel's "wife to be" who

> ran away with a mathematician instead of marrying him.

The whole story is pure invention.

Jun 27, 2005, 7:49:27 PM6/27/05

to

If I haven't overlooked any posts of substance in this thread, over

6 hours past since your post and no one had offered any tentative

or definitive fitting model.

Let's just say you're on the GENERAL right track.

You can't eat RMS and you can't drink it.

Here's a suggestion for you (or anyone else) to try -- to get a

taste of PRACTICAL significance vs STATISTICAL significance, and

get a Reality Check of how your or any other fitted model fare,

in prediction.

HOLD OUT the data for 1966.

Use the data prior to 1966 to build the fitting/predicting model.

Now get a Prediction Interval for the INVDEX for 1966 with the

value(s) of the predictor variables for that same year to

assess how well (or how badly) it did.

You may also devise some kind of non-standard and non-textbook

measure for predictive performance, such as the average width

of prediction intervals for several rows of data close to the

row to be predicted.

I am looking forward to SOME actual models, their statistical

results and significance measures, as well as some comments

and discussion about how the model was arrived at, and how

they perform.

-- Bob.

Jun 28, 2005, 1:09:07 AM6/28/05

to

Reef Fish wrote:

>

> bdmccu...@drexel.edu wrote:

>

>>Without even graphing the data, I can see that they are time

>>trending, and probably have unit roots. That suggests

>>"spurious regression". To analyze these data, therefore,

>>we'll have to see whether any of the variables are cointegrated

>>(what Granger won his Nobel for in 2003).

>

>

> Granger won his Nobel for 2003 on THAT?

>

> I know the Economics Nobel has really been scratching the bottom

> of the barrel for winners, but if that's what Granger won his

> Nobel price, they could have given me the Nobel prize 30 years

> ago. :-)

>

> I take it back!! They (the Nobel Committee) COULDN'T because

> Nobel (for his wife's "affair" with a mathematician) had made

> sure than no mathematician or statistician could win a Nobel

> Prize (for the lack of such a category).

>

The "Nobel" prize for economics isn't an actual Nobel prize: it was >

> bdmccu...@drexel.edu wrote:

>

>>Without even graphing the data, I can see that they are time

>>trending, and probably have unit roots. That suggests

>>"spurious regression". To analyze these data, therefore,

>>we'll have to see whether any of the variables are cointegrated

>>(what Granger won his Nobel for in 2003).

>

>

> Granger won his Nobel for 2003 on THAT?

>

> I know the Economics Nobel has really been scratching the bottom

> of the barrel for winners, but if that's what Granger won his

> Nobel price, they could have given me the Nobel prize 30 years

> ago. :-)

>

> I take it back!! They (the Nobel Committee) COULDN'T because

> Nobel (for his wife's "affair" with a mathematician) had made

> sure than no mathematician or statistician could win a Nobel

> Prize (for the lack of such a category).

>

first awarded in 1969. It's actually called "The Bank of Sweden Prize

in Economic Sciences in Memory of Alfred Nobel".

Source: http://nobelprize.org/economics/

Bob

Jun 28, 2005, 9:23:20 AM6/28/05

to

True, as I pointed that out earlier, but it is popularly known

as "the Nobel Prize in Economics" (at least in the U.S.), as

when the news anchor on TV says, "The Nobel Prize in Economics

was awarded today to an economist for a piece of work that

bears no relation to reality." (OK, they don't really say the

last part, but IMO in most cases they should. True, all models

are wrong, some are useful. But in economics more are wrong in

more ways and less useful than in just about any "science" with

which I am familiar.) Another example of the bias of the

liberal media distorting the truth, I guess. ;-)

Cheers,

Russell

Jun 28, 2005, 9:29:07 AM6/28/05

to

Russell...@wdn.com writes:

> (OK, they don't really say the

> last part, but IMO in most cases they should. True, all models

> are wrong, some are useful. But in economics more are wrong in

> more ways and less useful than in just about any "science" with

> which I am familiar.)

In Sweden, it has been argued that the association of the economics

prize with the name of Nobel is unfortunate, and can be expected to

devalue the proper Nobel prizes.

Jun 28, 2005, 9:35:48 AM6/28/05

to

Swedes are an intelligent, thoughtful people.

Cheers,

Russell

Jun 28, 2005, 3:44:48 PM6/28/05

to

In article <1119851864.7...@g14g2000cwa.googlegroups.com>,

Reef Fish <Large_Nass...@Yahoo.com> wrote:

> On June 19, Robin Edwards wrote, regarding a data set in SPSS

> I discussed, involving "model building",

Reef Fish <Large_Nass...@Yahoo.com> wrote:

> On June 19, Robin Edwards wrote, regarding a data set in SPSS

> I discussed, involving "model building",

Many thanks for providing this data set, Bob. Seems like I started

quite a hare with my simple request!

I have snipped all the comments, the data and everything. It has been

much repeated in the other postings.

As I wrote a few days ago I have had a first look at the data, and I

kept a log of my operations. I should point out that I deliberately

avoided reading any replies to Bob's post before doing anything with

the data. My journal, below, thus knows nothing of all the words

posted after RF's data arrived. I've now read the ones I downloaded

yesterday evening (27 June). As I write it is Tuesday evening, 28

June, and I have not downloaded anything today.

Here's my journal:-

*********************************

Data provided by Bob on 27 June 05

I shall look at this before reading others' posts.

1. Scan (eyeball) the data. No missing values. Good! Clearly time

series, so could mean trouble.

2 Notice that it is reminiscent of the famous Longley data set.

3 Import into 1st (my stats software).

4 Run naive multiple regression. Note that the software produces a

warning message. "Very possibly correlated independent variables.

Check regression diagnostics." Thus viewed the inital run as of

doubtful value.

5 So, computed regression diagnostics. Warnings about very high

multiple correlation coeffs and the equivalent variance inflation

factors. GNP and C.PROF have VIFs of 14.3 and 12.2, so one of them is

effectively redundant as a potential predictor. C.DIVD has VIF of 3.64

in this company. Note that Row 26 is a highly influential point ("HAT"

value 2.14, with next highest 1.199). Looks like an "outlier".

6 Look at Row 26. Ha ha! There it is. 15849. Clearly a typo.

Should be 25849.

7 Repair data set.

8 Repeat operations with new data. Similar (but of course different)

diagnostics on VIFs. Row 26 is no longer influential. The most

influential points are 31 and 32.

9 Look at INVDEX as a time series by Cusum plot. Noted possible steps

at 1950, 1954, 1960 and possibly 1963.

10 Generate multi-plot of all four variables plus Year (10 plots on

one diagram). This shows clearly that the predictor most likely to be

useful as a model for INVDEX is C.DIVD. The other two (closely similar

as noted in

the regression diagnostics) have a "hook", or, dare I say it, a hockey

stick, shape when INVDEX is plotted against them. Year gives a similar

but less angular plot.

11 Try a regression of INVDEX on C.DIVD. Produces adj R-Sq 0.8736, t

value for C.DIV 14.65

Forecast for 337.75 (Mean of C.DIV) of 176.9, L and U 95% interval for

a further single point is

94.46 to 259.4. For C.DIV = 700 (a reasonable extrapolation) values

are 400.8, 494.3 and 587.9. The regression plot looks fine.

12 Try a regression with C.DIV and GNP. Adj R-Sq 0.93585 - looks

good! But is it? Forecast value for mean of GNP and C.DIVD gives

118.2, 176.9 and 235.7, noticeably better than the simple regrn.

Now try C.DIVD 700 with GNP at its mean of 19314. Values are 262.86

(lower 95%), forecast 348.7 and upper 95% 434.6. These are nonsense!

No doubt the reason is the very high multiple correlations,

of which I've had warnings. Haven't tried C.Prof, but the result will

be almost exactly like the model with GNP in it. Good results very

close to the mean values and meaningless forecasts elsewhere.

13 My current choice for the best model is just

INVDEX = -119 + 0.87621*C.DIVD.

I'll do a bit more thinking about this, but can't hold out much hope of

an improvement. Maybe inspiration or advice will come from someone.

I'll post this and then have a look at all the other contributions, to

see where I've gone wrong.

**************************

That's what I wrote as I went along with the analyses earlier this

evening.

So, you can start shooting.

I should point out that I'm not a statistician - a mere long retired

industrial chemist, who came across stats via the experiment design

route, in 1956, from a book by Brownlee "Industrial Experimentation"

which was written to help industry during WW2. Looked dry as dust -

especially to someone who is no natural mathematician, but I liked the

notions of ANOVA and fractional factorials. Thought they might save me

some work!

I've looked at the original postings - very interesting!

Now to send this and download all the newer postings.

Cheers, Robin

Jun 28, 2005, 4:29:40 PM6/28/05

to

G Robin Edwards wrote:

> In article <1119851864.7...@g14g2000cwa.googlegroups.com>,

> Reef Fish <Large_Nass...@Yahoo.com> wrote:

> > On June 19, Robin Edwards wrote, regarding a data set in SPSS

> > I discussed, involving "model building",

>

> Many thanks for providing this data set, Bob. Seems like I started

> quite a hare with my simple request!

>

> I have snipped all the comments, the data and everything. It has been

> much repeated in the other postings.

>

> As I wrote a few days ago I have had a first look at the data, and I

> kept a log of my operations. I should point out that I deliberately

> avoided reading any replies to Bob's post before doing anything with

> the data.

Excellent!!

i was somewhat worried about the unfair influence (good or bad) by

others. I tried to refrain from saying much, but what others had

about what they found is hard to ignore unless you don't read them.

> My journal, below, thus knows nothing of all the words

> posted after RF's data arrived. I've now read the ones I downloaded

> yesterday evening (27 June). As I write it is Tuesday evening, 28

> June, and I have not downloaded anything today.

Nothing happened to day (in non time-series) except your post now.

I'll withold comments until I get something more from Jerry Dallal

(I hope he'll find time to add more to what he had already done)

and anyone else.

>

> Here's my journal:-

>

> *********************************

>

> Data provided by Bob on 27 June 05

>

> I shall look at this before reading others' posts.

>

> 1. Scan (eyeball) the data. No missing values. Good! Clearly time

> series, so could mean trouble.

Both good observations.

>

> 2 Notice that it is reminiscent of the famous Longley data set.

Not in the multicollinearity sense. BTW, here's a "side lesson":

Highly correlated independent variables (such as r > .9) does

not NECESSARILY imply collinearity problems. On the other

hand you MAY have a singular correlation matrix even if ALL

of the pairwise correlations are < .2 say.

>

> 3 Import into 1st (my stats software).

>

> 4 Run naive multiple regression. Note that the software produces a

> warning message. "Very possibly correlated independent variables.

> Check regression diagnostics." Thus viewed the inital run as of

> doubtful value.

See my "side lesson" above. You PROGRAM may be issuing FALSE to

MISLEADING warnings. ANOTHER abuse of the "correlation coeff". :-)

The software must examine the EIGENVALUES of X'X to correctly

detect multicollinearity conditions/problems!!

I'll stop my comments here. Will resume with the rest of your

analysis/results at the conclusion of the Million $ Challenge. :-)

Thanks for your effort and interest. I think almost everyone

will learn SOMETHING from it. The misunderstanding of the

detection and effects of multicollinearity ranks among the

highest "regression abuses" I know (next to the "expected

sign" fallacy).

Stay tuned under the LESSON 2 thread.

-- Bob.

Jun 28, 2005, 10:10:07 PM6/28/05

to

Given that this is a time series, it's hard to get worked up about an

inappropriate analysis, so I'll stop with

inappropriate analysis, so I'll stop with

invdex = 181.9 - 0.016 gnp +

+ 0.028 I(gnp>=13770)*(gnp-13770)

+ 0.114 cprof + 56.7 I(year=1961)

where I(x)=1 if x is true and 0 otherwise

ResMS = 203.5

Unless I've made a typo transcribing. The ResMS is correct, though.

The residual plot gives the impression that the variability is larger

for larger predicted values.

Jun 28, 2005, 11:39:41 PM6/28/05

to

It doesn't appear there'll be any more entries. I suspect Jerry

is either too busy or his knot and spline software has a feature

for him to get prediction intervals or PRACTICAL significance

assessments.

So, I'll going ahead and finish commenting on your analysis here,

then continue my "lessons" without Jerry. He can always do more

after I show what I did (30 year ago) which was a better fit than

his.

>

> > 5 So, computed regression diagnostics. Warnings about very high

> > multiple correlation coeffs and the equivalent variance inflation

> > factors. GNP and C.PROF have VIFs of 14.3 and 12.2, so one of them is

> > effectively redundant as a potential predictor. C.DIVD has VIF of 3.64

> > in this company. Note that Row 26 is a highly influential point ("HAT"

> > value 2.14, with next highest 1.199). Looks like an "outlier".

> >

> > 6 Look at Row 26. Ha ha! There it is. 15849. Clearly a typo.

> > Should be 25849.

> >

> > 7 Repair data set.

You were a bit late here. I had corrected that value at about 10 am

the same morning I posted the original data at 1:59 am. So, you must

have stopped looking as soon as you saw my original data.

In any event data examination and graphical displays should have

been done first, before doing any computation such as correlations.

> >

> > 8 Repeat operations with new data. Similar (but of course different)

> > diagnostics on VIFs. Row 26 is no longer influential. The most

> > influential points are 31 and 32.

If you had done some scatter plots of INVDEX vs the other variables,

you would have notice the obvious "elbow" Jerry found, and the same

elbow AUTOBOX found using time-series methods. See LESSON 2 for

details.

> >

> > 9 Look at INVDEX as a time series by Cusum plot. Noted possible steps

> > at 1950, 1954, 1960 and possibly 1963.

> >

> > 10 Generate multi-plot of all four variables plus Year (10 plots on

> > one diagram). This shows clearly that the predictor most likely to be

> > useful as a model for INVDEX is C.DIVD. The other two (closely similar

> > as noted in

> > the regression diagnostics) have a "hook", or, dare I say it, a hockey

> > stick, shape when INVDEX is plotted against them. Year gives a similar

> > but less angular plot.

Ah, you DID notice the "hockey stick" which I called the "elbow", but

you let the golden goose walk by and grabbed the quacking duck

instead! :-) Again, see my continuation of LESSON 2.

> >

> > 11 Try a regression of INVDEX on C.DIVD. Produces adj R-Sq 0.8736, t

> > value for C.DIV 14.65

> > Forecast for 337.75 (Mean of C.DIV) of 176.9, L and U 95% interval for

> > a further single point is

> > 94.46 to 259.4. For C.DIV = 700 (a reasonable extrapolation) values

> > are 400.8, 494.3 and 587.9. The regression plot looks fine.

> >

> > 12 Try a regression with C.DIV and GNP. Adj R-Sq 0.93585 - looks

> > good! But is it? Forecast value for mean of GNP and C.DIVD gives

> > 118.2, 176.9 and 235.7, noticeably better than the simple regrn.

> > Now try C.DIVD 700 with GNP at its mean of 19314. Values are 262.86

> > (lower 95%), forecast 348.7 and upper 95% 434.6. These are nonsense!

> > No doubt the reason is the very high multiple correlations,

> > of which I've had warnings. Haven't tried C.Prof, but the result will

> > be almost exactly like the model with GNP in it. Good results very

> > close to the mean values and meaningless forecasts elsewhere.

These are nice exploratory steps. Unfortunately, you did not take

advantage of the "golden goose" and ended with this model:

> >

> > 13 My current choice for the best model is just

> >

> > INVDEX = -119 + 0.87621*C.DIVD.

This was based on all 32 observations, and would have yield a

MSE of 1582, which is almost 5 times the MSE (or RMS) of 335 Jerry

got with GNP alone as the predictor, which was also about HALF the

RMS of 622 of the SPSS-like multiple regression model with the

kitchen sink thrown in.

> >

> > I'll do a bit more thinking about this, but can't hold out much hope of

> > an improvement. Maybe inspiration or advice will come from someone.

> >

> > I'll post this and then have a look at all the other contributions, to

> > see where I've gone wrong.

> >

> > **************************

> >

> > That's what I wrote as I went along with the analyses earlier this

> > evening.

Very nicely document. Helped others see your thought process as well

as where and how you missed the boat, so to speak, when you read my

LESSON 2.

> >

> > So, you can start shooting.

Sorry, the golden goose already walked away. :-)

> >

> > I should point out that I'm not a statistician - a mere long retired

> > industrial chemist, who came across stats via the experiment design

You certainly showed much better insight and thoughtfulness in your

exploratory step than MOST "applied statisticians" would have done,

sort of like the SPSS Manual example -- "Garbage IN, Garbage Out".

They might even get busy discussing whether the SIGN of one of the

coefficient is right or not. :-)

> > route, in 1956, from a book by Brownlee "Industrial Experimentation"

> > which was written to help industry during WW2. Looked dry as dust -

> > especially to someone who is no natural mathematician, but I liked the

> > notions of ANOVA and fractional factorials. Thought they might save me

> > some work!

> >

> > I've looked at the original postings - very interesting!

> >

> > Now to send this and download all the newer postings.

> >

> > Cheers, Robin

Data analysis and model-building are things that always have UNIQUE

features in every data set, and only those trained to look out for

them and take advantage of any "golden goose" they see during the

iterative process can consistently do well.

Thanks to you voluntary participation (at the risk of being shot),

I believe you've contributed more than you realized, to help OTHERS

think more about what THEY might do, the next time they get hold

of ANY multiple regression data set, that life is much more

interesting and fruitful than just throwing all the variables into

a large scale model, and look only at correlations and coefficient

signs.

Now you, or any reader who is following this EXERCISE, may continue

reading my continuation of the LESSON 2 thread, on this same data set.

-- Bob.

Jun 28, 2005, 11:46:44 PM6/28/05

to

Jerry Dallal wrote:

> Given that this is a time series, it's hard to get worked up about an

> inappropriate analysis, so I'll stop with

>

> invdex = 181.9 - 0.016 gnp +

> + 0.028 I(gnp>=13770)*(gnp-13770)

> + 0.114 cprof + 56.7 I(year=1961)

> where I(x)=1 if x is true and 0 otherwise

>

> ResMS = 203.5

I was about to give up on you (when I was typing my follow-up to

Robin's post in this thread).

>

> Unless I've made a typo transcribing. The ResMS is correct, though.

>

> The residual plot gives the impression that the variability is larger

> for larger predicted values.

Now I have to take a look at your model before continuing with the

LESSON 2 thread. Meanwhile, if it's not time consuming, you may

want to give a PREDICTION interval for the held-out last row of

data (1966) pretending that you're trying to predict the INDVEX

fot that year using your model and the GNP value of 35822 for that

year.

-- Bob.

Jun 29, 2005, 12:12:40 AM6/29/05

to

Jerry Dallal wrote:

> Given that this is a time series, it's hard to get worked up about an

> inappropriate analysis, so I'll stop with

>

> invdex = 181.9 - 0.016 gnp +

> + 0.028 I(gnp>=13770)*(gnp-13770)

> + 0.114 cprof + 56.7 I(year=1961)

> where I(x)=1 if x is true and 0 otherwise

>

> ResMS = 203.5

It didn't take long for me to have absorbed all your wisdom above.

So, I'll interpret it for the readers, comment on it, and will be

ready to finish LESSON 2, and later lessons right after this.

I can't resist my mock horror, "But your SIGN of GNP is WRONG,

Jerry !!!!! How can you EXPLAIN to anyone that the INVDEX will

drop when the GNP value RISES?" :-)

Multiple Regression "expected sign" abusers take note!

Your model is basically taking advantage of the "hockey plug" in

GNP vs year (which you took the "elbow" to be between 1941 and

1942, estimating the "knot" point to be 13770 for GNP. You used

CPROF as the 2nd indep. variable, and then you saved a bundle of

SSE by adjusting for ONE value of the fitted function at 1961,

to drop the residual from 56.7 to 0 (I presume), thus lowering

the SSE by 56.7^2 or a rather substantial 3,214.89. :-)

That's why your RMS is so much smaller than your previous 335.

I consider that "cheating" (in a non-criminal way <G>) because if

you play that game by setting residuals to 0 at will, you can

drop the RMS even further, but the model will hardly be a good

or valid one for prediction purposes. It's more like OVER-FITTING.

Let's move over to the continuation of the LESSON 2 thread.

-- Bob.

Jun 29, 2005, 10:20:52 AM6/29/05

to

Reef Fish wrote:

>

> Jerry Dallal wrote:

>

>>Given that this is a time series, it's hard to get worked up about an

>>inappropriate analysis, so I'll stop with

>>

>>invdex = 181.9 - 0.016 gnp +

>> + 0.028 I(gnp>=13770)*(gnp-13770)

>> + 0.114 cprof + 56.7 I(year=1961)

>> where I(x)=1 if x is true and 0 otherwise

>>

>>ResMS = 203.5

>

>

> It didn't take long for me to have absorbed all your wisdom above.

> So, I'll interpret it for the readers, comment on it, and will be

> ready to finish LESSON 2, and later lessons right after this.

>

> I can't resist my mock horror, "But your SIGN of GNP is WRONG,

> Jerry !!!!! How can you EXPLAIN to anyone that the INVDEX will

> drop when the GNP value RISES?" :-)

>

> Jerry Dallal wrote:

>

>>Given that this is a time series, it's hard to get worked up about an

>>inappropriate analysis, so I'll stop with

>>

>>invdex = 181.9 - 0.016 gnp +

>> + 0.028 I(gnp>=13770)*(gnp-13770)

>> + 0.114 cprof + 56.7 I(year=1961)

>> where I(x)=1 if x is true and 0 otherwise

>>

>>ResMS = 203.5

>

>

> It didn't take long for me to have absorbed all your wisdom above.

> So, I'll interpret it for the readers, comment on it, and will be

> ready to finish LESSON 2, and later lessons right after this.

>

> I can't resist my mock horror, "But your SIGN of GNP is WRONG,

> Jerry !!!!! How can you EXPLAIN to anyone that the INVDEX will

> drop when the GNP value RISES?" :-)

That's easy. It's a corollary to "Never interpret main effects in the

presence of an interaction!"

Even more shocking is the decision to constrain the first part of the

spline to be horizontal!

[copied from another post in the same thread. Quotation from Reef Fish:

"Then it occurred to me that it made sense for the relation to be

nearly horizontal during the WWII era and then both variables

were growing strong in a linear fit in the post-war years."]

*Forcing* the sign?!?! The HORROR! :-)

> Multiple Regression "expected sign" abusers take note!

Indeed!

>

> Your model is basically taking advantage of the "hockey plug" in

> GNP vs year (which you took the "elbow" to be between 1941 and

> 1942, estimating the "knot" point to be 13770 for GNP. You used

> CPROF as the 2nd indep. variable, and then you saved a bundle of

> SSE by adjusting for ONE value of the fitted function at 1961,

> to drop the residual from 56.7 to 0 (I presume), thus lowering

> the SSE by 56.7^2 or a rather substantial 3,214.89. :-)

>

> That's why your RMS is so much smaller than your previous 335.

> I consider that "cheating" (in a non-criminal way <G>) because if

> you play that game by setting residuals to 0 at will, you can

> drop the RMS even further, but the model will hardly be a good

> or valid one for prediction purposes. It's more like OVER-FITTING.

RSS, yes, RMS, not necessarily. In any analysis I do, 1961 is an

outlier. If year 'x' isn't squirrelly, the contribution of I(x) will be

negligible. Fitting I(x) is equivalent to setting an observation aside

based on its externally Studentized residual. Granted, probability

theory guarantees that there be some large externally Studentized

residual, but the one for 1961 is huge. I(1961) has a t statistic of

3.74 and an observed significance level of 0.00091, which survives even

a Bonferroni adjustment, (32*0.00091=0.02912). While this might be

overfitting for data from biological units, these are US national level

economic data. 1961 sticks out like a sore thumb.

Jun 29, 2005, 6:43:09 PM6/29/05

to

In article <1120016381.1...@g14g2000cwa.googlegroups.com>,

Reef Fish <Large_Nass...@Yahoo.com> wrote:

Reef Fish <Large_Nass...@Yahoo.com> wrote:

Does this apply to multiple correlations? All I'm trying to do is to

make reasonably sure that the data do not suffer from the effects of

multi-colinearity. Thus a "warning" is issued if the software thinks

there's a possibility of strong multiple correlation, something that I

think might be not noticed in pairwise plots in some cases.

> >

> > The software must examine the EIGENVALUES of X'X to correctly

> > detect multicollinearity conditions/problems!!

Can you explain to me how this approach might relate (if at all) to

computing Cholesky inverse roots? I wrote this software in about 1977

to run on a Commodore PET (32K memory for programs and data) and I

can't remember the details of the technique, or even why I thought it

might be useful :-(

You must remember that I do not have broadband. I switch on my machine

once each evening, and download at about 4k characters/second. I'm on

line for perhaps 2 - 4 minutes, so it is far from real time!

> In any event data examination and graphical displays should have been

> done first, before doing any computation such as correlations.

> > >

> > > 8 Repeat operations with new data. Similar (but of course

> > > different) diagnostics on VIFs. Row 26 is no longer influential.

> > > The most influential points are 31 and 32.

> If you had done some scatter plots of INVDEX vs the other variables,

> you would have notice the obvious "elbow" Jerry found, and the same

> elbow AUTOBOX found using time-series methods. See LESSON 2 for

> details.

> > >

> > > 9 Look at INVDEX as a time series by Cusum plot. Noted possible

> > > steps at 1950, 1954, 1960 and possibly 1963.

> > >

> > > 10 Generate multi-plot of all four variables plus Year (10 plots

> > > on one diagram). This shows clearly that the predictor most

> > > likely to be useful as a model for INVDEX is C.DIVD. The other

> > > two (closely similar as noted in the regression diagnostics) have

> > > a "hook", or, dare I say it, a hockey stick, shape when INVDEX is

> > > plotted against them. Year gives a similar but less angular plot.

> Ah, you DID notice the "hockey stick" which I called the "elbow", but

> you let the golden goose walk by and grabbed the quacking duck

> instead! :-) Again, see my continuation of LESSON 2.

I always plot data in various ways :-) But running linear regressions

is such a simple step - (specifying a model takes moments and the

calculations a second or so for ordinary size data sets) that it might

as well be done. One never knows what might turn up.

> > >

> > > 11 Try a regression of INVDEX on C.DIVD. Produces adj R-Sq

> > > 0.8736, t value for C.DIV 14.65 Forecast for 337.75 (Mean of

> > > C.DIV) of 176.9, L and U 95% interval for a further single point

> > > is 94.46 to 259.4. For C.DIV = 700 (a reasonable extrapolation)

> > > values are 400.8, 494.3 and 587.9. The regression plot looks

> > > fine.

> > >

> > > 12 Try a regression with C.DIV and GNP. Adj R-Sq 0.93585 -

> > > looks good! But is it? Forecast value for mean of GNP and

> > > C.DIVD gives 118.2, 176.9 and 235.7, noticeably better than the

> > > simple regrn. Now try C.DIVD 700 with GNP at its mean of 19314.

> > > Values are 262.86 (lower 95%), forecast 348.7 and upper 95%

> > > 434.6. These are nonsense! No doubt the reason is the very high

> > > multiple correlations, of which I've had warnings. Haven't tried

> > > C.Prof, but the result will be almost exactly like the model with

> > > GNP in it. Good results very close to the mean values and

> > > meaningless forecasts elsewhere.

> These are nice exploratory steps. Unfortunately, you did not take

> advantage of the "golden goose" and ended with this model:

> > >

> > > 13 My current choice for the best model is just

> > >

> > > INVDEX = -119 + 0.87621*C.DIVD.

> This was based on all 32 observations, and would have yield a

> MSE of 1582,

Agreed

> which is almost 5 times the MSE (or RMS) of 335 Jerry

> got with GNP alone as the predictor, which was also about HALF the

> RMS of 622 of the SPSS-like multiple regression model with the

> kitchen sink thrown in.

Also agreed

I look forward to Lesson 2.

However, I worry a bit about inferential statistics when the fitted

model has been arrived at by a stepwise programme driven by the outcome

of preliminary analyses which lead to modification of the originally

proposed model. I had always thought that technical specification of

the model should precede any analytical work. Is this correct?

Jun 29, 2005, 2:40:43 PM6/29/05

to

Jerry Dallal wrote:

> Reef Fish wrote:

> >

> > Jerry Dallal wrote:

> >

> >>Given that this is a time series, it's hard to get worked up about an

> >>inappropriate analysis, so I'll stop with

> >>

> >>invdex = 181.9 - 0.016 gnp +

> >> + 0.028 I(gnp>=13770)*(gnp-13770)

> >> + 0.114 cprof + 56.7 I(year=1961)

> >> where I(x)=1 if x is true and 0 otherwise

> >>

> >>ResMS = 203.5

> >

> >

> > It didn't take long for me to have absorbed all your wisdom above.

> > So, I'll interpret it for the readers, comment on it, and will be

> > ready to finish LESSON 2, and later lessons right after this.

> >

> > I can't resist my mock horror, "But your SIGN of GNP is WRONG,

> > Jerry !!!!! How can you EXPLAIN to anyone that the INVDEX will

> > drop when the GNP value RISES?" :-)

>

> That's easy. It's a corollary to "Never interpret main effects in the

> presence of an interaction!"

For the sake of not getting off the present topic, I'll leave your

statement alone, though it didn't quite technically or precisely

apply to the example in question. :)

> Even more shocking is the decision to constrain the first part of the

> spline to be horizontal!

>

> [copied from another post in the same thread. Quotation from Reef Fish:

> "Then it occurred to me that it made sense for the relation to be

> nearly horizontal during the WWII era and then both variables

> were growing strong in a linear fit in the post-war years."]

>

> *Forcing* the sign?!?! The HORROR! :-)

Actually that's not quite in MY case; perhaps more in YOURS, for

forcing a horizontal spline.

In MY case, I was merely THROWING away historical data that are

justifiable to be thrown away (BTW, deleted ANY observation, let

alone more than 1, is a serious business that MUST be justified

other than "it didn't fit") because it made economic as well as

commonsense that there are two different relations in pre-war/war

and post-war eras. And even if it DIDN'T make any strong sense,

it made sense (in such a time series) to discard data in a

DISTANT past, when the object is to predict the most recent

present.

>

> > Multiple Regression "expected sign" abusers take note!

>

> Indeed!

>

> >

> > Your model is basically taking advantage of the "hockey plug" in

> > GNP vs year (which you took the "elbow" to be between 1941 and

> > 1942, estimating the "knot" point to be 13770 for GNP. You used

> > CPROF as the 2nd indep. variable, and then you saved a bundle of

> > SSE by adjusting for ONE value of the fitted function at 1961,

> > to drop the residual from 56.7 to 0 (I presume), thus lowering

> > the SSE by 56.7^2 or a rather substantial 3,214.89. :-)

> >

> > That's why your RMS is so much smaller than your previous 335.

> > I consider that "cheating" (in a non-criminal way <G>) because if

> > you play that game by setting residuals to 0 at will, you can

> > drop the RMS even further, but the model will hardly be a good

> > or valid one for prediction purposes. It's more like OVER-FITTING.

>

> RSS, yes, RMS, not necessarily.

NOW we can get down to some concrete discussion of NUMBERS. Not

necessarily, yes. But VERY easily so.

For my argument and illustration, I'll have to do a bit of detective

work since I don't know any of your residuals except the 56.7 one.

But I can infer, from your previous result of MSE = 335 with only

the GNP, and the location of spine and knot value (on the entire 32

observations, hence 28 df?) that your SSE was 9380.

The setting of ONE residual of 56.7 to zero (and nothing else)

would have reduced the SSE to 6166 and the MSE to 228 on 27 df.

Your actual model (for which I assume 25 df, but doesn't really

matter if it's one or two more, or less) would imply SSE = 5088,

given your MSE of 203.5.

If you "fix" another residual of 30 by setting it to zero, you

would have reducedd the SSE df to 24, but reduced the MSE to 182.

"Fix" a third residual of size 30, you would have reduced the MSE

to 157! And so on.

I believe in general, the fixing of ONE unusually large residual

must be based NOT on the size of the residual, but on REASONS

you can explain on WHY it's excessively large.

There are many examples of such in the analysis of time series.

For example, the daily national total on the use of fireworks

would most likely have a peak on the 4th of July; of alcohol

consumption on special events/days of the year.

In this EXERCISE, there was nothing you attached to 1961 other

than having observed a large residual. In the grand scheme of

things, that's at least a faux pas or peccadillo. Those are

the grades James F. Kilpatrick assigned to various erroneous

usage of the English language in his nationally syndicated

columns on "crimes, misdemeanors, faux pas, and peccadillos).

I make many peccadillos everyday! :-)

> In any analysis I do, 1961 is an outlier.

But not good enough reason for "deleting" or neutralizing the

negative redisual effect for the year 1961.

> If year 'x' isn't squirrelly, the contribution of I(x) will be

> negligible. Fitting I(x) is equivalent to setting an observation aside

> based on its externally Studentized residual. Granted, probability

> theory guarantees that there be some large externally Studentized

> residual, but the one for 1961 is huge. I(1961) has a t statistic of

> 3.74 and an observed significance level of 0.00091, which survives even

> a Bonferroni adjustment, (32*0.00091=0.02912). While this might be

> overfitting for data from biological units, these are US national level

> economic data. 1961 sticks out like a sore thumb.

Now you're talking almost like a social scientist who is happen to

throw away anything that doesn't fit until what's left fit TOO well.

:-)

I prefer your previous model of having a knot and using only GNP to

result in a MSE of 335. I also prefer mine over yours for reasons

of Box's principle of "parsimony" or Occum's razor.

-- Bob.

Jun 29, 2005, 3:23:20 PM6/29/05

to

I wrote:

> When Hinkley's tests for two-phase regression are performed, the

> observed significance levels for equality of the slopes and for the

> slope of the later part of the data being 0 are <0.0001. The osl for

> the slope of the earlier data being different from 0 is 0.1552.

strike "different from" in the last line. I was stating null

hypotheses, or trying to, anyway.

Jun 29, 2005, 3:21:49 PM6/29/05

to

Reef Fish wrote:

>

> I prefer your previous model of having a knot and using only GNP to

> result in a MSE of 335. I also prefer mine over yours for reasons

> of Box's principle of "parsimony" or Occum's razor.

>

> -- Bob.

>

>

> I prefer your previous model of having a knot and using only GNP to

> result in a MSE of 335. I also prefer mine over yours for reasons

> of Box's principle of "parsimony" or Occum's razor.

>

> -- Bob.

>

I agree, especially about the vicious cycle of setting data aside. I

was just passing the time of day. The labels matter/they don't matter.

It's a time series/treat them as independent observations. It's the

wrong analysis/let's do it anyway. Hard to get worked up over the data

in a situation like this.

When Hinkley's tests for two-phase regression are performed, the

observed significance levels for equality of the slopes and for the

slope of the later part of the data being 0 are <0.0001. The osl for

the slope of the earlier data being different from 0 is 0.1552.

1961 was odd enough that I'd go back and check it. It's INVDEX that

appears to be odd, even when plotted against year. It's the

scatterplots more than any statistical quantity that led me to question

it. If the number is what was reported, then that's what it is.

Jun 29, 2005, 9:59:01 PM6/29/05

to

G Robin Edwards wrote:

> In article <1120016381.1...@g14g2000cwa.googlegroups.com>,

> Reef Fish <Large_Nass...@Yahoo.com> wrote:

< GIGANTIC snip to get to the technical points in question >

>

Robin>> 4 Run naive multiple regression. Note that the software

> > > > produces a warning message. "Very possibly correlated

> > > > independent variables. Check regression diagnostics." Thus

> > > > viewed the inital run as of doubtful value.

> > >

> > > See my "side lesson" above. You PROGRAM may be issuing FALSE to

> > > MISLEADING warnings. ANOTHER abuse of the "correlation coeff". :-)

>

> Does this apply to multiple correlations? All I'm trying to do is to

> make reasonably sure that the data do not suffer from the effects of

> multi-colinearity. Thus a "warning" is issued if the software thinks

> there's a possibility of strong multiple correlation, something that I

> think might be not noticed in pairwise plots in some cases.

There are two separate and unrelated points here.

1. Pairwise correlations among independent variables tells NOTHING

about multicolinearity, unless one of the them is ridiculously

high, say .99999. If ALL of the pairwise correlations are less

than .2 say, you not only can have multicollinearity problems,

but the X'X matrix MAY even be "singular" or the regression

coefficients undeterminable.

2. The MULTIPLE R (or R^2) is completely different. A high R is

a GOOD thing, because it is the simple correlation between

the observed Y nd the fitted Y! The higher the R, the better

the fit.

That's why I said your SOFTWARE may have been written by someone

who doesn't know the statistical theory and issued false and/or

erroneous warnings, as indicated by your descriptions.

>

> > > The software must examine the EIGENVALUES of X'X to correctly

> > > detect multicollinearity conditions/problems!!

>

> Can you explain to me how this approach might relate (if at all) to

> computing Cholesky inverse roots? I wrote this software in about 1977

> to run on a Commodore PET (32K memory for programs and data) and I

> can't remember the details of the technique, or even why I thought it

> might be useful :-(

The Cholesky decomposition is ONE of the MANY methods of doing an

eigen-decomposition as well as for matrix inversion.

It is tangent to the statistical interpretation of the regression

results. A numerical analysis book or a book on Statistical

Computing would likely address the question much more adequately

than I can or want to do it here.

> > Now you, or any reader who is following this EXERCISE, may continue

> > reading my continuation of the LESSON 2 thread, on this same data set.

>

> I look forward to Lesson 2.

>

> However, I worry a bit about inferential statistics when the fitted

> model has been arrived at by a stepwise programme driven by the outcome

> of preliminary analyses which lead to modification of the originally

> proposed model. I had always thought that technical specification of

> the model should precede any analytical work. Is this correct?

That is essentially correct and your worry is appropriate. ANY kind

of "automatic" selection method ignores the probability model

assumptions (or its violations) and only search for the best "fit".

Stepwise type of regressoin is a cheap way to get some PLAUSIBLE

candidate models on the basis of "fit" only. Once found, the

analyst must examine the residuals as carefully as they would do

in a "manual" iterative process.

Often, the "best" fitting models have to be abandoned in favor of

worse fitting but more appropriate models from a residual analysis

point of view.

Note in LESSON 3, I indicated before I exhibited the final model,

that I had gone through all of the normality, independence and

homoscedasticity tests on the residuals.

-- Bob.

Jun 30, 2005, 9:38:53 PM6/30/05

to

"Torkel Franzen" <tor...@sm.luth.se> wrote in message

news:vcb7jge...@beta19.sm.ltu.se...

news:vcb7jge...@beta19.sm.ltu.se...

You think Kahneman, Harsanyi, Nash, Sharpe, Markowitz, Modigliani, Friedman

and Samuelson are not Nobel quality?

IMHO, it is the Nobel Peace Prize which devalues all others.

Abe

Jul 1, 2005, 3:25:12 AM7/1/05

to

Nash, if he should win anything at all, should have won it in

Mathematics. But we all know that Math and Stat were SCREWED out

of the Nobel categories. ;^)

Nash won it for being schizo AND being crazy at the right time,

when the ECON category was already running out of worthy

candidates YEARS ago, and had started the tradition of finding

excuses for giving the prize to NON-economists. Simon is

another example. There are other notable examples in recent

years.

There were MANY more qualified Mathematicians for a Nobel Prize

(the Field's medal winners, e.g., a coveted Mathematics Prize

which Nash never won), had they not be SCREWED by Nobel himself. :-)

So there!

And why didn't you mention Miller, another NON-economist, who

won it on the strength of having been Modigliani's TEACHER,

and the co-winner on the Miller-Modigliani theory. :)

Friedman? He should have won it LONG before he actually did.

His theory was directly opposed to Samuelson's, and since the

"banking committee" was obviously pro-Samuelson and anti-Friedman,

they kept him OFF the Nobel prize as long as they could, which

was already obvious to EVERYONE who knew anything about ECONOMICS

that Uncle Milty :-) should have won it, perhaps even before

Samuelson did.

Besides, Milton Friedman could have won a Nobel Prize for his

STATISTICAL work, for the years he collaborated and SUPERVISED

the likes of Fred Mosteller, Jimmie Savage, and other Nobel-

prize-deserving statisticians, in STATISTICAL anslyses, had

Statisticians not be SCREWED out of the Nobel category, because

Nobel mis-associated the field as Mathematics.

>

> IMHO, it is the Nobel Peace Prize which devalues all others.

>

> Abe

Are you throwing your hat into the ring for the Nobel Peace

Prize, (Honest) Abe? :-) If it accelerates its devaluation

rate, and you live to be a thousand-year-old man (curse to

medical science <g>), even YOU may have a SHOT at it, (Honest)

Abe.

-- Bob (NOT Hogg; NOT Anon-O'Hara) the Reef Fish.

Jul 1, 2005, 3:25:05 AM7/1/05

to

Nash, if he should win anything at all, should have won it in

So there!

>

> IMHO, it is the Nobel Peace Prize which devalues all others.

>

> Abe

Are you throwing your hat into the ring for the Nobel Peace

Jul 1, 2005, 3:25:16 AM7/1/05

to

Nash, if he should win anything at all, should have won it in

So there!

>

> IMHO, it is the Nobel Peace Prize which devalues all others.

>

> Abe

Are you throwing your hat into the ring for the Nobel Peace

Jul 1, 2005, 3:35:14 AM7/1/05

to

"Abe Kohen" <ako...@xenon.stanford.edu> writes:

> You think Kahneman, Harsanyi, Nash, Sharpe, Markowitz, Modigliani, Friedman

> and Samuelson are not Nobel quality?

I know nothing about these people. The argument, as I have seen it

presented, turned not on the qualities of individual recipients, but on

the nature of the subject.

Jul 1, 2005, 3:38:21 AM7/1/05

to

"Reef Fish" <Large_Nass...@Yahoo.com> writes:

> Statisticians not be SCREWED out of the Nobel category, because

> Nobel mis-associated the field as Mathematics.

I see that you stick with grit and determination to your tradition

of promoting your favorite fantasies!

Jul 1, 2005, 10:32:08 AM7/1/05

to

In article <1120202705....@g14g2000cwa.googlegroups.com>,

Reef Fish <Large_Nass...@Yahoo.com> wrote:

Reef Fish <Large_Nass...@Yahoo.com> wrote:

>Abe Kohen wrote:

>> "Torkel Franzen" <tor...@sm.luth.se> wrote in message

>> news:vcb7jge...@beta19.sm.ltu.se...

>> > Russell...@wdn.com writes:

>> > > (OK, they don't really say the

>> > > last part, but IMO in most cases they should. True, all models

>> > > are wrong, some are useful. But in economics more are wrong in

>> > > more ways and less useful than in just about any "science" with

>> > > which I am familiar.)

>> > In Sweden, it has been argued that the association of the economics

>> > prize with the name of Nobel is unfortunate, and can be expected to

>> > devalue the proper Nobel prizes.

>> You think Kahneman, Harsanyi, Nash, Sharpe, Markowitz, Modigliani, Friedman

>> and Samuelson are not Nobel quality?

>Nash, if he should win anything at all, should have won it in

>Mathematics. But we all know that Math and Stat were SCREWED out

>of the Nobel categories. ;^)

Stat was not even a topic at the time, and the story about

Math being screwed out has long been discredited.

[Much deleted; I only agree with part of it.]

>> IMHO, it is the Nobel Peace Prize which devalues all others.

>> Abe

>Are you throwing your hat into the ring for the Nobel Peace

>Prize, (Honest) Abe? :-) If it accelerates its devaluation

>rate, and you live to be a thousand-year-old man (curse to

>medical science <g>), even YOU may have a SHOT at it, (Honest)

>Abe.

Considering the people who have been awarded the Peace

Prize, and how little most of them did for peace, and in

fact how much most of them did to hinder the cause of

peace, I have to agree completely with Abe.

--

This address is for information only. I do not claim that these views

are those of the Statistics Department or of Purdue University.

Herman Rubin, Department of Statistics, Purdue University

hru...@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

Jul 1, 2005, 11:01:01 AM7/1/05

to

What fantasy? About Nobel SCREWING the Mathematicians and

Statisticians?

Then how do you explain why the is NO category of Nobel Prize for

Mathematics OR Statistics? Even if you discount the recent history

of Statistics, Mathematics as a science far exceeded all the other

sciences in the history of contribution to all sciences.

Andrew Wiles and Taniyama and Shimura should have shared a Nobel

Prize, had they NOT been screwed by Nobel himself, for slaying

the Grandeset Dragon of Mathematics, Fermat's Last Theorem,

which stood unproved for 550 years until they came along!

Wiles was credited with the actual proof, but Wiles would not

have been able to prove it without the Taniyama-Shimura conjecture

which itself stood for decades unproved.

It would take some 3rd rate economist to find some trivial use

or contrived used of Fermat's Las Theorem, and THEN Wiles will be

awarded the Nobel Prize in ECONOMICS (given by the Swiss Bank <G>)

for some economic nonsense.

That's the way it'll be.

Mathematicians are SCREWED by Nobel!

But for YOUR consolation, even if Nobel gives 1000 Prizes to

mathematicians EVERY year, Torkel Franzen would not surface to

the top 100,000 within the next millenium. Trust me! :-)

-- Bob.

Jul 1, 2005, 11:03:51 AM7/1/05

to

As I wrote in another post, don't get me started...

Nash, Harsanyi, Kahneman, yes, and some others (Arrow comes to

my mind), but IMO the good work for which this prize has been

given has been mostly the highly mathematical results which have

broader interest (at least to mathematicians) and applications

beyond economics. Also good work for which the prize has been

given is the work that shows that most of economics, as it is

presently practiced by the dominant school in the subject, is

built on a foundation of sand (again Arrow comes immediately

to mind, along with Kahneman). Often the closer to empirical

the work is, the worse it is, IMO, in terms of pure scientific

value and actual scien