TERMINOLOGY OF ANALYSIS OF VARIANCE

C. Patrick Doncaster

Once you have familiarised yourself with the terminology of Analysis of Variance you will find it easier to grasp many of the parametric techniques that you read about in statistics books. Some of the terms described below may be referred to by one of many names, as indicated in the left hand column. They are illustrated here with a simple example of statistical analysis, in which a biologist wishes to explain variation in the body weights of a sample of people according to different variables such as their height, sex and nationality.

 

See also Examples of Analysis of Variance and Covariance for a comprehensive description of designs, Analysis of Variance and Covariance in R for running analyses, and the Lexicon of Statistical Modelling for further definitions.

 

Term

Description

1.

Variable

A property that varies in a measurable way between subjects in a sample.

2.

Response variable,
Dependent variable,
Y

Describes the measurements, usually on a continuous scale, of the variable of interest (e.g. weight: what causes variation in weight?). If these measurements are free to vary in response to the explanatory variable(s), statistical analysis will reveal the explanatory power of the hypothesised source(s) of variation.

3.

Explanatory variable,
Independent variable,
Predictor variable,
Treatment,
Factor,
Effect,
X

The non-random measurements or observations (e.g. treatments fixed by experimental design), which are hypothesised in a statistical model to have predictive power over the response variable. This hypothesis is tested by calculating sums of squares and looking for a variation in Y between levels of X that exceeds the variation within levels. An explanatory variable can be categorical (e.g. sex, with 2 levels of male and female), or continuous (e.g. height with a continuum of possibilities). The explanatory variable is assumed to be 'independent' in the sense of being independent of the response variable: i.e. weight can vary with height, but height is independent of weight. The values of X are assumed to be measured precisely, without error, permitting an accurate estimate of their influence on Y.

4.

Variates,
Replicates,
Observations,
Scores,
Data points

The replicate observations of the response variable (Y1, Y2, Yi, YN) measured at each level of the explanatory variable. These are the data points, each usually obtained from a different subject to ensure that the sample size reflects N independent replicates (i.e. it is not inflated by non-independent data: 'pseudoreplication').

5.

Sample

The collection of observations measured at a level of X (e.g. body weights were measured from one sample of males and another of females to test the effect of Sex on Weight). If X is continuous, the sample comprises all the measures of Y on X (e.g. Weight on Height).

6.

Sum of squares

The squared distance between each data point (Yi) and the sample mean, summed for all N data points. The squared deviations measure variation in a form which can be partitioned into different components that sum to give the total variation (e.g. the component of variation between samples and the component of variation within samples).

7.

Variance

The variance in a normally distributed population is described by the average of N squared deviations from the mean. Variance usually refers to a sample, however, in which case it is calculated as the sum of squares divided by N-1 rather than N. Its positive root is then the standard deviation, SD, which describes the dispersion of normally distributed variates (e.g. 95% lying within 1.96 standard deviations of the mean when N is large).

8.

Statistical model,
Y = X
+

A statement of the hypothesised relationship between the response variable and the predictor variable. A simple model would be: Weight = Sex +. The '=' does not signify a literal equality, but a statistical dependency. So the statistical analysis is going to test the hypothesis that variation in the response variable on the left of the equals sign (Weight) is explained or predicted by the factor on the right (Sex), in addition to a component of random variation (the error term, "epsilon"). An Analysis of Variance will test whether significantly more of the variation in Weight falls between the categories of 'male' and 'female', and so is explained by the independent variable 'Sex' than lies within each category (the random variation). The error term is often dropped from the model description though it is always present in the model structure, as the random variation against which to calibrate the variation between levels of X in testing for a significant explanation (the F-ratio).

9.

Null hypothesis,
H0

While a statistical model can propose a hypothesis, that Y depends on X, the statistical analysis can only seek to reject a null hypothesis: that Y does not vary with X. This is because it is always easier to find out how different things are than to know how much they are the same, so the statistician's easiest objective is to establish the probability of a deviation away from random expectation rather than towards any particular alternative. Thus does science in general proceed cautiously by a process of refutation. If the analysis reveals a sufficiently small probability that the null hypothesis is true, then we can reject it and state that Y evidently depends on X in some way.

10.

One-way ANOVA,
Y = X

An Analysis of Variance (ANOVA) to test the model hypothesis that variation in the response variable Y can be partitioned into the different levels of a single explanatory variable X (e.g. Weight = Sex). If X is a continuous variable, then the analysis is equivalent to a linear regression, which tests for a significant slope in the best fit line describing change of Y with X (e.g. Weight with Height).

11.

Two-way ANOVA,
Y
= X1 + X2 + X1*X2

Test of the hypothesis that variation in Y can be explained by one or both variables X1 and X2. If X1 and X2 are categorical and Y has been measured only once in each combination of levels of X1 and X2, then the interaction effect X1*X2 cannot be estimated. Otherwise a significant interaction term means that the effect of X1 is modulated by X2 (e.g. the effect of Sex, X1, on Weight, Y, depends on Nationality, X2). If one of the explanatory variables is continuous, then the analysis is equivalent to a linear regression with one line for each level of the categorical variable (e.g. graph of Weight by Height, with one line for males and one for females): different intercepts signify a significant effect of the categorical variable, different slopes signify a significant interaction effect with the continuous variable.

12.

Error,
Residual

The amount by which an observed variate differs from the value predicted by the model. Errors or residuals are the segments of scores not accounted for by the analysis. In Analysis of Variance, the errors are assumed to be independent of each other, and normally distributed about the sample means. They are also assumed to be identically distributed for each sample (since the analysis is seeking only a significant difference between sample means), which is known as the assumption of homogeneity of variances.

13.

Normal distribution

A bell-shaped frequency distribution of a continuous variable. The formula for the normal distribution contains two parameters: the mean, giving its location, and the standard deviation, giving the shape of the symmetrical 'bell'. This distribution arises commonly in nature when myriad independent forces, themselves subject to variation, combine additively to produce a central tendency. The technique of Analysis of Variance is constructed on the assumption that the component of random variation takes a normal distribution. This is because the sums of squares that are used to describe variance in an ANOVA accurately reflect the true variation between and within samples only if the residuals are normally distributed about sample means.

14.

Degrees of freedom

The numbers of pieces of information about the 'noise' from which an investigator wishes to extract the 'signal'. The F-ratio in an Analysis of Variance is always presented with two sets of degrees of freedom, the first corresponding to one less than the a samples or levels of the explanatory variable (a - 1), and the second to the remaining error degrees of freedom (N - a). For example, F3,23 = 3.10, P < 0.05 would describe a significant effect of Nationality (4 nations, giving 3 degrees of freedom for the effect) on Weight (27 subjects, giving 23 error degrees of freedom) in a one-way ANOVA. A continuous factor has one degree of freedom, so the linear regression ANOVA has 1 and N-2 degrees of freedom (e.g. a significant Height effect: F1,25 = 4.27, P < 0.05).

15.

F-ratio

The statistic calculated by Analysis of Variance, which reveals the significance of the hypothesis that Y depends on X. It comprises the ratio of two mean- squares: MS[X] / MS[]. The mean-square, MS, is the average sum of squares, in other words the sum of squared deviations from the mean X or(as defined above) divided by the appropriate degrees of freedom. This is why the F-ratio is always presented with two degrees of freedom, one used to create the numerator MS[X], and one the denominator, MS[]. The F-ratio tells us precisely how much more of the variation in Y is explained by X (MS[X]) than is due to random, unexplained, variation (MS[]). A large proportion indicates a significant effect of X. In fact, the observed F-ratio is connected by a very complicated equation to the exact probability of a true null hypothesis, i.e. that the ratio equals unity, but you can use standard tables to find out whether the observed F-ratio indicates a significant relationship.

16.

Significance

This is the probability of mistakenly rejecting a null hypothesis that is actually true. In the biological sciences a critical value P = 0.05 is generally taken as marking an acceptable boundary of significance. A large F-ratio signifies a small probability that the null hypothesis is true. Thus finding a significant nationality effect: F3,23 = 3.10, P < 0.05 means that the variation in weight between the samples from four nations is 3.10 times greater than the variation within samples, and that tables of the F-distribution tell us we can have greater than 95% (i.e. > [1-0.05]100) confidence in an effect of nationality on weight (i.e. less than 5% confidence in the null hypothesis of no effect). The significant Height effect in the linear regression (F1,25 = 4.27, P < 0.05) means that the regression slope of Weight with Height is significantly different from horizontal. This regression line takes the form: y = B0+B1x, and 95% confidence intervals for the estimated slope are obtained at B1t [0.05]v-2SEB1; if the slope is significant, then these intervals will not encompass zero.