Content

Agreement

Agreement analyses

Continuous (intra-class correlation etc.)

Categorical (kappa etc.)

Reliability and reducibility (Cronbach alpha etc.)

Agreementanalyses.. 1

Agreement of continuous measurements.. 1

Kappa and Maxwell.. 5

Principal components analysis and alpha reliabilitycoefficient.. 10

Agreement analyses

Menulocation: Analysis_Agreement

These methodslook at the agreement of a set of measurements across a sample of individuals.Your data should be prepared so that each row represents data from anindividual. Some literature lists these methods under 'reliability testing'.

·CONTINUOUS(intra-class correlation etc.)

·CATEGORICAL (kappa etc.)

·REDUCIBILITY (principal components analysis and Cronbachalpha)

Download a free 10 day StatsDirect trial

Agreement of continuous measurements

Menu location: Analysis_Analysisof Variance_Agreement.

The function calculates one way random effectsintra-class correlation coefficient, estimated within-subjects standarddeviation and a repeatability coefficient (Bland and Altman 1996a and 1996b, McGraw and Wong, 1996).

Intra-classcorrelation coefficient is calculated as:

- where m is the number of observations per subject, SSB isthe sum of squared between subjects and SST is the total sum of squares (as perone way ANOVA above).

Within-subjectsstandard deviation is estimated as the square root of the residual mean squarefrom one way ANOVA.

Therepeatability coefficient is calculated as:

- where m is thenumber of observations per subject, z is a quantilefrom the standard normal distribution (usually taken as the 5% two tailed quantile of 1.96) and xw is theestimated within-subjects standard deviation (calculated as above).

Intra-subjectstandard deviation is plotted against intra-subject means and Kendall'srank correlation iswww.med126.com used to assess the interdependence of these two variables.

An agreement plotis constructed by plotting the maximum differences from each possibleintra-subject contrast against intra-subject means and the overall mean ismarked as a line on this plot.

A Q-Q plot isgiven; here the sum of the difference between intra-subject observations andtheir means are ordered and plotted against an equal order of chi-square quantiles.

Agreement analysisis best carried out under expert statistical guidance.

Example

FromBland and Altman (1996a).

Test workbook(Agreement worksheet: 1st, 2nd, 3rd, 4th).

Five peak flowmeasurements were repeated for twenty children:

1st	2nd	3rd	4th
190	220	200	200
220	200	240	230
260	260	240	280
210	300	280	265
270	265	280	270
280	280	270	275
260	280	280	300
275	275	275	305
280	290	300	290
320	290	300	290
300	300	310	300
270	250	330	370
320	330	330	330
335	320	335	375
350	320	340	365
360	320	350	345
330	340	380	390
335	385	360	370
400	420	425	420
430	460	480	470

To analyse these data using StatsDirectyou must first enter them into a workbook or open the test workbook. Thenselect Agreement from the Analysis of Variance section of the Analysis menu.

Agreement

Variables: 1st,2nd, 3rd, 4th

Intra-class correlationcoefficient (one way random effects) = 0.882276

Estimatedwithin-subjects standard deviation = 21.459749

Forwithin-subjects sd vs. mean,Kendall's tau b =0.164457 two sided P = .3296

Repeatability (foralpha = 0.05) = 59.482297

Download a free 10 day StatsDirect trial

Kappa and Maxwell

Menu location: Analysis_Miscellaneous_Kappa& Maxwell.

AgreementAnalysis

For the case of two raters, this function givesCohen's kappa (weighted and unweighted) and Scott'spi as measures of inter-rater agreement for two raters' categorical assessments(Fleiss, 1981; Altman, 1991; Scott 1955). For three ormore raters, this function gives extensions of the Cohen kappa method, due to Fleiss and Cuzick (1979) in the case oftwo possible responses per rater, and Fleiss, Nee and www.med126.comLandis (1979) in the generalcase of three or more responses per rater.

If you have onlytwo categories then Scott's pi is the statistic of choice (with confidenceinterval constructed by the Donner-Eliasziw (1992) method) forinter-rater agreement (Zwick, 1988).

Weighted kappapartly compensates for a problem with unweightedkappa, namely that it is not adjusted for the degree of disagreement.Disagreement is weighted in decreasing priority from the top left (origin) ofthe table. StatsDirect uses the following definitionsfor weight (1 is the default):

1. w(ij)=1-abs(i-j)/(g-1)

2. w(ij)=1-[(i-j)/(g-1)]²

3. User defined(this is only available via workbook data entry)

g = categories

w = weight

i = category forone observer (from 1 to g)

j = category forthe other observer (from 1 to g)

In broad terms akappa below 0.2 indicates poor agreement and a kappa above 0.8 indicates verygood agreement beyond chance.

Guide (Landis and Koch, 1977):

Kappa	Strength of agreement
< 0.2	Poor
> 0.2 £ 0.4	Fair
> 0.4 £ 0.6	Moderate
> 0.6 £ 0.8	Good
> 0.8 £ 1	Very good

N.B. You can notreliably compare kappa values from different studies because kappa is sensitiveto the prevalence of different categories. i.e. if onecategory is observed more commonly in one study than another then kappa mayindicate a difference in inter-rater agreement which is not due to the raters.

Agreement analysiswith more than two raters is a complex and controversial subject, see Fleiss (1981, p. 225).

Disagreement Analysis

StatsDirect uses the methodsof Maxwell (1970) to test fordifferences between the ratings of the two raters (or k nominal responses withpaired observations).

Maxwell'schi-square statistic tests for overall disagreement between the two raters. Thegeneral McNemar statistic tests for asymmetry in thedistribution of subjects about which the raters disagree, i.e. disagreementmore over some categories of response than others.

Data preparation

You may presentyour data for the two-rater methods as a fourfold table in the interactivescreen data entry menu option. Otherwise, you may present your data asresponses/ratings in columns and rows in a worksheet, where the columnsrepresent raters and the rows represent subjects rated. If you have more thantwo raters then you must present your data in the worksheet column (rater) row(subject) format. Missing data can be used where raters did not rate allsubjects.

Technical validation

All formulae forkappa statistics and their tests are as per Fleiss (1981):

For two raters(m=2) and two categories (k=2):

- where n is thenumber of subjects rated, w is the weight for agreement or disagreement, po is the observed proportion of agreement, pe is the expected proportion of agreement, pij is the fraction of ratings iby the first rater and j by the second rater, and so is the standard error fortesting that the kappa statistic equals zero.

For three or moreraters (m>2) and two categories (k =2):

- where xi is the number of positive ratings out of mi ratersfor subject i of n subjects, and so is the standarderror for testing that the kappa statistic equals zero.

For three or moreraters and categories (m>2, k>2):

- where soj is the standard error for testing kappa equal for eachrating category separately, and so bar is the standard error for testing kappaequal to zero for the overall kappa across the k categories. Kappa hat iscalculated as for the m>2, k=2 method shown above.

Example

FromAltman (1991).

Altman quotes theresults of Brostoff et al. in a comparison not of twohuman observers but of two different methods of assessment. These methods areRAST (radioallergosorbent test) and MAST (multi-RAST)for testing the sera of individuals for specifically reactive IgE in the diagnosis of allergies. Five categories ofresult were recorded using each method:

		RAST
		Negative	weak	moderate	high	very high
MAST	negative	86	3	14	0	2
Weak	26	0	10	4	0
Moderate	20	2	22	4	1
High	11	1	37	16	14
very high	3	0	15	24	48

To analyse these data in StatsDirectyou may select kappa from the miscellaneous section of the analysis menu.Choose the default 95% confidence interval. Enter the above frequencies asdirected on the screen and select the default method for weighting.

For this example:

General agreement over all categories (2 raters)

Cohen's kappa (unweighted)

Observed agreement= 47.38%

Expected agreement= 22.78%

Kappa = 0.318628(se = 0.026776)

95% confidenceinterval = 0.266147 to 0.371109

z (for k = 0) =11.899574

P < 0.0001

Cohen's kappa (weighted by 1-Abs(i-j)/(1 - k))

Observed agreement= 80.51%

Expected agreement= 55.81%

Kappa = 0.558953(se = 0.038019)

95% confidenceinterval for kappa = 0.484438 to 0.633469

z (for kw = 0) = 14.701958

P < 0.0001

Scott's pi

Observed agreement= 47.38%

Expected agreement= 24.07%

Pi = 0.30701

Disagreement over any category and asymmetry ofdisagreement (2 raters)

Marginalhomogeneity (Maxwell) chi-square = 73.013451, df = 4, P < 0.0001

Symmetry (generalised McNemar) chi-square =79.076091, df = 10, P <0.0001

Note that forcalculation of standard errors for the kappa statistics, StatsDirectuses a more accurate method than that which is quoted in most textbooks (e.g. Altman, 1990).

The statisticallyhighly significant z tests indicate that we should reject the null hypothesisthat the ratings are independent (i.e. kappa = 0) and accept the alternativethat agreement is better than one would expect by chance. Do not put too muchemphasis on the kappa statistic test, it makes a lot of assumptions and fallsinto error with small numbers.

The statisticallyhighly significant Maxwell test statistic above indicates that the ratersdisagree significantly in at least one category. The generalisedMcNemar statistic indicates the disagreement is notspread evenly.

confidence intervals

P values

Download a free 10 day StatsDirect trial

Principal components analysis andalpha reliability coefficient

Menu locations:

Analysis_Regression & Correlation_Principal Components

Analysis_Agreement_Reliability &Reducibility

This function provides principal components analysis(PCA), based upon correlation or covariance, and Cronbach'scoefficient alpha for scale reliability.

See questionnaire design for moreinformation on how to use these methods in designing questionnaires or otherstudy methods with multiple elements.

Principalcomponents analysis is most often used as a data reduction technique forselecting a subset of "highly predictive" variables from a largergroup of variables. For example, in order to select a sample of questions froma thirty-question questionnaire you could use this method to find a subset thatgave the "best overall summary" of the questionnaire (Johnson and Wichern, 1998; Armitage and Berry, 1994; Everittand Dunn, 1991; Krzanowski, 1988).

There are problemswith this approach, and principal components analysis is often wrongly appliedand badly interpreted. Please consult a statistician before using this method.

PCA does notassume any particular distribution of your original data but it is verysensitive to variance differences between variables. These differences mightlead you to the wrong conclusions. For example, you might be selectingvariables on the basis of sampling differences and not their "real"contributions to the group. Armitage and Berry (1994) give an exampleof visual analogue scale results to which principal components analysis wasapplied after the data had been transformed to angles as a way of stabilising variances.

Another problemarea with this method is the aim for an orthogonal or uncorrelated subset ofvariables. Consider the questionnaire problem again: it is fair to say that apair of highly correlated questions are serving much the same purpose, thus oneof them should be dropped. The component dropped is most often the one that hasthe lower correlation with the overall score. It is not reasonable, however, toseek optimal non-correlation in the selected subset of questions. There may bemany "real world" reasons why particular questions should remain inyour final questionnaire. It is almost impossible to design a questionnairewhere all of the questions have the same importance to every subject studied.For these reasons you should cast a net of questions that cover what you aretrying to measure as a whole. This sort of design requires strong knowledge ofwhat you are studying combined with strong appreciation of the limitations ofthe statistical methods used.

Everitt and Dunn (1991) outline PCA andother multivariate methods. McDowell and Newell (1996) and Streiner & Norman (1995) offer practicalguidance on the design and analysis of questionnaires.

Factor analysis vs. principal components

Factor analysis(FA) is a child of PCA, and the results of PCA are often wrongly labelled as FA. A factor is simply another word for acomponent. In short, PCA begins with observations and looks for components,i.e. working from data toward a hypothetical model, whereas FA works the otherway around. Technically, FA is PCA with some rotation of axes. There aredifferent types of rotations, e.g. varimax (axes arekept orthogonal/perpendicular during rotations) and oblique Procrustean (axesare allowed to form oblique patterns during rotations), and there isdisagreement over which to use and how to implement them. Unsurprisingly, FA ismisused a lot. There is usually a better analytical route that avoids FA; youshould seek the advice of a statistician if you are considering it.

Data preparation

To prepare datafor principal components analysis in StatsDirect youmust first enter them in the workbook. Use a separate column for each variable(component) and make sure that each row corresponds to the observations fromone subject. Missing data values in a row will cause that row / subject to bedropped from the analysis. You have the option of investigating eithercorrelation or covariance matrices; most often you will need the correlationmatrix. As discussed above, it might be appropriate to transform your data before applying thismethod.

For the example of0 to 7 scores from a questionnaire you would enter your data in the workbook inthe following format. You might want to transform these data first (Armitage and Berry, 1994).

	Question 1	Question 2	Question 3	Question 4	Question 5
subject 1	5	7	4	1	5
subject 2	3	3	2	2	6
subject 3	2	2	4	3	7
subject 4	0	0	5	4	2

Internal consistency and deletion of individual components

Cronbach's alpha is a usefulstatistic for investigating the internal consistency of a questionnaire. Ifeach variable selected for PCA represents test scores from an element of aquestionnaire, StatsDirect gives the overall alphaand the alpha that would be obtained if each element in turn were dropped. Ifyou are using weights then you should use the weighted scores. You should notenter the overall test score as this is assumed to be the sum of the elementsyou have specified. For most purposes alpha should be above 0.8 to supportreasonable internal consistency. If the deletion of an element causes aconsiderable increase in alpha then you should consider dropping that elementfrom the test. StatsDirect highlights increases ofmore than 0.1 but this must be considered along with the "real world"relevance of that element to your test. A standardisedversion of alpha is calculated by standardising allitems in the scale so that their mean is 0 and variance is 1 before thesummation part of the calculation is done (Streiner and Norman, 1995; McDowell and Newell, 1996;Cronbach, 1951). You should use standardisedalpha if there are substantial differences in the variances of the elements ofyour test/questionnaire.

Technical Validation

Singular valuedecomposition (SVD) is used to calculate the variance contribution of eachcomponent of a correlation or covariance matrix (Krzanowski,1988; Chan, 1982):

The SVD of an nby m matrix X is USV' = X. U and Vare orthogonal matrices, i.e. V' V = V V' where V'is the transpose of V. U is a matrix formed fromcolumn vectors (m elements each) and V is a matrix formed fromrow vectors (n elements each). S is asymmetrical matrix with positive diagonal entries in non-increasing order. If Xis a mean-centred, n by m matrix where n>mand rank r = m (i.e. full rank) then the first r columns of Vare the first r principal components of X. The positive eigenvalues of X'X on XX arethe squares of the diagonals in S. Thecoefficients or latent vectors are contained in V.

Principalcomponent scores are derived from U and S via a S as trace{(X-Y)(X-Y)'}. For acorrelation matrix, the principal component score is calculated for thestandardized variable, i.e. the original datum minus the mean of the variablethen divided by its standard deviation.

Scale reversalis detected by assessing the correlation between the input variables and thescores for the first principal component.

A lowerconfidence limit for Cronbach's alpha is calculatedusing the sampling theory of Kristoff (1963) and Feldt (1965):

- where F is the F quantile for a100(1-p)% confidence limit, k is the number of variables and n is the number ofobservations per variable.

Example

Test workbook(Agreement worksheet: Question 1, Question 2, Question 3, and Question 4).

Principal components (correlation)

Sign wasreversed for: Question 3; Question 4

Component	Eigenvalue (SVD)	Proportion	Cumulative
1	1.92556	48.14%	48.14%
2	1.305682	32.64%	80.78%
3	0.653959	16.35%	97.13%
4	0.114799	2.87%	100%

With rawvariables:

Scalereliability alpha = 0.54955 (95% lower confidence limit = 0.370886)

Variable dropped	Alpha	Change
Question 1	0.525396	-0.024155
Question 2	0.608566	0.059015
Question 3	0.411591	-0.13796
Question 4	0.348084	-0.201466

Withstandardized variables:

Scalereliability alpha = 0.572704 (95% lower confidence limit = 0.403223)

Variable dropped	Alpha	Change
Question 1	0.569121	-0.003584
Question 2	0.645305	0.072601
Question 3	0.398328	-0.174376
Question 4	0.328003	-0.244701

You can see fromthe results above that questions 2 and 3 seemed to have scales going inopposite directions to the other two questions, so they were reversed beforethe final analysis. Dropping question 2 improves the internal consistency ofthe overall set of questions, but this does not bring the standardisedalpha coefficient to the conventionally acceptable level of 0.8 and above. Itmay be necessary to rethink this questionnaire.

外科	妇产科	儿科
内科学	生理学	更多

药理学	中药学	药物化学
生药学	卫生毒理学	更多