Data Science Classwork: ANOVA and Post Hoc Tests

Pictures are great but they're not quite precise enough to tell us if there's a statistically significant difference between groups of data.

Which brings us to the ANOVA (analysis of variance) test. ANOVA requires a categorical response variable and a quantitative explanatory variable which is perfect for my research questions (I mean, my Mars Missions).

Mission 1: Determine if there's an association between the number of visible layers in a crater and its latitude. (Or, in hypothesis testing lingo, the alternative hypothesis is that craters that are grouped by their number of layers are, on average, found in different latitudes. The null hypothesis is that there's no difference in latitudes between the groups.)

ANOVA code in SAS

Running the code gives the following results:

The ANOVA Procedure

Dependent Variable: LATITUDE_CIRCLE_IMAGE Crater Latitude (degrees)

Class Level Information
Class	Levels	Values
NUMBER_LAYERS	6	0 1 2 3 4 5

Number of Observations Read	384343
Number of Observations Used	384343

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	5	2004589.2	400917.8	356.57	<.0001
Error	384337	432133768.4	1124.4
Corrected Total	384342	434138357.6

R-Square	Coeff Var	Root MSE	LATITUDE_CIRCLE_IMAGE Mean
0.004617	-465.7665	33.53150	-7.199209

Source	DF	Anova SS	Mean Square	F Value	Pr > F
NUMBER_LAYERS	5	2004589.224	400917.845	356.57	<.0001

And within that giant amount of information, there's one major point: the p-value (which is listed Pr<F in the table) is less than 0.0001 so that means the null hypothesis can be rejected. In other words, the craters have been grouped into 6 groups (0 layers, 1 layer, and so on up to 5 layers) and for each group, a mean latitude was calculated. Since the p-value is so low, the mean latitudes for the groups are not all equal to each other. Which is sorta useful to know but it'd be really great to know which groups are different.

That's where the Duncan Test (a Post Hoc test) comes in. The results from the Duncan Test are shown below.

Means with the same letter are not significantly different.
Duncan Grouping		Mean	N	NUMBER_LAYERS
	A	12.680	3435	2
	A
B	A	2.390	739	3
B	A
B	A	0.671	5	5
B	A
B	A	-1.497	15467	1
B	A
B	A	-2.094	85	4
B
B		-7.649	364612	0

This table indicates the craters with 1, 2, 3, 4, and 5 layers do not have a statistically significant differences in their mean latitudes. And same goes for the craters with 0, 1, 3, 4, and 5 layers. However, craters with no visible layers do have a different latitude (on average) than craters with 2 layers, which makes a lot of sense, given that this is the graph of latitude vs number of layers:

Probably should have included this one in my last post, too. Oh well. Better late than never.

Mission 2: Determine if there's an association between the number of layers and the morphology of the ejecta.

For this mission, my alternate hypothesis is that each morphology (SLE, DLE, MLE, Pd, and Rd) have a different number of visible layers. The null hypothesis is that they don't have different numbers of layers.

More ANOVA code

Here are the results of the ANOVA test:

The ANOVA Procedure

Class Level Information
Class	Levels	Values
morph	5	DLE MLE Pd Rd SLE

Number of Observations Read	384343
Number of Observations Used	44625

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	4	17456.69025	4364.17256	32610.8	<.0001
Error	44620	5971.31679	0.13383
Corrected Total	44624	23428.00704

R-Square	Coeff Var	Root MSE	NUMBER_LAYERS Mean
0.745121	65.51155	0.365822	0.558409

Source	DF	Anova SS	Mean Square	F Value	Pr > F
morph	4	17456.69025	4364.17256	32610.8	<.0001

This one also has a p-value less than 0.0001 so the null hypothesis can be rejected (ie all of the morphologies don't have the same number of layers). But, again, it'd be nice to know which ones are significantly different from each other and so we run another post hoc test:

Means with the same letter are not significantly different.
Duncan Grouping	Mean	N	morph
A	3.1102	581	MLE

B	1.9950	2777	DLE

C	1.0033	14196	SLE

D	0.1230	27069	Rd
D
D	0.0000	2	Pd

Three types (MLE, DLE, and SLE) are in their own little groups - they've all got means different from the others. The other two types (Rd and Pd) have means that are not significantly different. That means that the number of layers and morphology types are associated for MLE, SLE, DLE, and Rd/Pd but there's no association between layers and morphology if we only look at the Rd and Pd groups.

Bonus Mission: Determine if there's an association between the number of layers and the depth of a crater.

Alternative hypothesis: Craters grouped by their number of visible layers will have different average depths. (Null: the depths won't be different, regardless of how many layers are visible).

The final bit of ANOVA code

And the results:

The ANOVA Procedure

Dependent Variable: DEPTH_RIMFLOOR_TOPOG Crater Depth (km)

Class Level Information
Class	Levels	Values
NUMBER_LAYERS	6	0 1 2 3 4 5

Number of Observations Read	384343
Number of Observations Used	384343

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	5	3714.16500	742.83300	18850.4	<.0001
Error	384337	15145.50177	0.03941
Corrected Total	384342	18859.66676

R-Square	Coeff Var	Root MSE	DEPTH_RIMFLOOR_TOPOG Mean
0.196937	261.7589	0.198512	0.075838

Source	DF	Anova SS	Mean Square	F Value	Pr > F
NUMBER_LAYERS	5	3714.164997	742.832999	18850.4	<.0001

Once again, the p-value is less that 0.0001 so there's at least one group that's got a different average depth from the others.

Here are the post hoc test results:

Means with the same letter are not significantly different.
Duncan Grouping	Mean	N	NUMBER_LAYERS
A	1.59400	5	5

B	1.33776	85	4

C	1.04597	739	3

D	0.55734	3435	2

E	0.42667	15467	1

F	0.05414	364612	0

This is a pretty cool little table because it says that all of the groups have different mean depths. That means that craters with 0 layers have a different depth than craters with 1 layers and that both of those groups have different depths than craters with 2 (and so on). It's not an incredibly surprising result but it's still neat to see that it's true.

In the next post... I have no idea what we'll be doing with the data but I'm sure it'll be fun.

Data Science Classwork

Thursday, November 12, 2015

ANOVA and Post Hoc Tests

No comments:

Post a Comment