Comparative judgment (CJ) is an alternative method for assessing competences based on Thurstone’s law of comparative judgment. Assessors are asked to compare pairs of students work (representations) and judge which one is better on a certain competence. These judgments are analyzed using the Bradly–Terry–Luce model resulting in logit estimates for the representations. In this context, the Scale Separation Reliability (SSR), coming from Rasch modeling, is typically used as reliability measure. But, to the knowledge of the authors, it has never been systematically investigated if the meaning of the SSR can be transferred from Rasch to CJ.
As the meaning of the reliability is an important question for both assessment theory and practice, the current study looks into this. A meta-analysis is performed on 26 CJ assessments. For every assessment, split-halves are performed based on assessor. The rank orders of the whole assessment and the halves are correlated and compared with SSR values using Bland–Altman plots. The correlation between the halves of an assessment was compared with the SSR of the whole assessment showing that the SSR is a good measure for split-half reliability. Comparing the SSR of one of the halves with the correlation between the two respective halves showed that the SSR can also be interpreted as an interrater correlation. Regarding SSR as expressing a correlation with the truth, the results are mixed.
comparative judgment (CJ), Scale Separation Reliability (SSR), reliability theory, Rasch measurement, IRT
Comparative Judgement (CJ) aims to improve the quality of performance- based assessments by letting multiple assessors judge pairs of performances. CJ is generally associated with high levels of reliability, but there is also a large variation in reliability between assessments. This study investigates which assessment characteristics influence the level of reliability. A meta-analysis was performed on the results of 49 CJ assessments. Results show that there was an effect of the number of comparisons on the level of reliability. In addition, the probability of reaching an asymptote in the reliability, i.e., the point where large effort is needed to only slightly increase the reliability, was larger for experts and peers than for novices. For reliability levels of .70 between 10 and 14 comparisons per performance are needed. This rises to 26 to 37 comparisons for a reliability of .90.
Comparative Judgement(CJ); Scale Separation; Reliability (SSR); performance-based assessment; task-complexity; meta-analysis