Updated 12/13/00
Multivariate statistics
-used for analyzing data in which many simultaneous measurements have been made.
Objectives of Multivariate Methods:
1) Data reduction
-phenomenon being studied is represented as simply as possible w/out sacrificing information (selection index)
-data reduction can make interpretation easier or it becomes more clear which variables are most important
2) Sorting or grouping
-groups of 'similar' objects or variables are created based on measured characteristics (cluster analysis, ordination, principal component analysis)
-alternatively, rules for classifying objects into well defined groups may be required (discriminant analysis)
3) Investigation of dependence among variables.
-nature of relationship among variables is of interest
-are all variables mutually independent or is one or more variables dependent upon the others? If so, how? (MANOVA, multiple correlation and regression)
4) Prediction
-relationship among variables must be determined for the purpose of predicting the values of one or more variables on the basis of observations on the other variables (multiple regression)
5) Hypothesis testing
-a specific statistical hypothesis, formulated in terms of the parameters of the multivariate populations, are tests (includes several mulitivariate methods)
Considerations for multivariate techniques (seem useful, but are not easy to apply)
1) Computational difficulty-almost impossible to conduct w/out matrix notation. Matrix algebra is not difficult, but some fundamental exposure to it is required to understand how multivariate analyses are conducted.
2) Number of observations-typically, multivariate analyses require a lot of information and cannot be used effectively if you have made measurements on only a few subjects. Don't use multivariate approaches if samples or EU's are difficult to obtain (or may die, missing observations typically result in loss of all data for that subject).
3) Data organization-typically in data matrices or arrays in which rows list all the observations on one sampling unit and each column lists the values of one of the observed variables on all the sampling units.
-data may be continuous, discrete, or both
Hotelling's T2
(analogous to a univariate t-test)
Can evaluate:
-whether sets of means obtained from criterion variables (dependent or response variable) differ significantly from some Ho.
e.g.: Is the N-chemical profile acceptable based on EPA standards (compare your values to standard)
-whether two sets of means are different from each other (difference does not equal 0).
e.g.: U Florida gators testing latest formulation of Gatorade. 10 healthy individuals given latest formulation and 10 given regular Gatorade. Sweat rate, sodium content of sweat, and K content of sweat are criterion or dependent variables.
You don't want to compare each response variable separately b/c they are probably dependent upon each other.
Multivariate Regression
-y predicted from a set of x's
y'=a+b1x1+b2x2+....bkxk
-estimates of y correlate most highly with observed values of y (yield smallest deviations)
-R2-multiple correlation coefficient
Multiple Correlation
-as above, R2 used to evaluate relationship
-partial R2's helpful in understanding contribution of each variable
MANOVA-Multivariate ANOVA
-involves analyses with > 1 response (or criterion) variable
-like ANOVA (but more difficult to analyze and interpret)
-predictor variables are class/discrete variables
-response variables are continuous
-considerations for MANOVA
- are criterion (response) variables inter-related?
-evaluated with correlation matrices. If there is not relationship, then use mulitple ANOVA's
- does number of response variables substantially increase Type I error risk? Yes, if doing multiple ANOVA's-but there are methods to adjust alpha (like conservative multiple comparison procedures).
- quantity of data required for MANOVA and other multivariate techniques may be prohibitive
-many observations required to:
detect differences (increased power)
satisfy the assumption of multivariate normality (large sample size really helps)
Canonical Correlation
-method for correlating two derived (canonical) variables
in multiple correlation, you would have y and x1,x,2,x3
in canonical correlation, you have y1, y2 correlated with x1, x2, x3
-canonical variables represent a weighted combination of other variables
-Rc2 = canonical correlation coefficient
-describes proportion of variance of derived variable that is associated w/ the variance in the other derived variable
Group 1 Group 2
y1 y2 x1 x2 x3
1st canonical correlation (e.g. y1=weight, y2=cholesterol; x1=pushups, x2=situps, x3=aerobic endurance)
u1 = a1y1 + a2y2
v1 = ß1x1 + ß2x2 + ß3x3
a + ß selected to maximize correlation b/w u1 and v1
2nd canonical correlation (superscript 2 indicates second canonical correlation, not that the values are squared)
u2 = a12y12 + a22y22
v2 = ß12x12 + ß22x22 + ß32x32
a + ß selected to maximize correlation b/w u2 and v2
Constraints: (looking at or emphasizing totally different things-often, only 1st canonical correlation is signficant).
corr (u1, u2) = 0
corr (v1, v2) = 0