23 Dec Rater effects
Are you an experienced teacher, juggling assessment types and methods easily, but still looking for some inspiration? Or are you more of a casual passerby, not knowing what some of the terms you come across on this website mean? In either case, you’ve come to the right place! In this column, we provide an overview of testing and assessment methods, explain jargon and scientific terms in simple language, and supplement with some literature suggestions for those who’d like to keep reading. Today we talk about rater effects.
Assessment is difficult, any teacher can attest to that. Especially when complex skills are involved. Creativity, teamwork, speaking or writing skills, for example, are difficult to test using fill-in-the-blank or multiple-choice questions. Therefore, open tasks such as essays, portfolios, presentations, etc. are usually used to test such skills.
The wide variation in output in open tasks makes assessment difficult and time-consuming. Assessors can make use of analytical assessment methods, such as rubrics, but they can never include all of the students’ answer options. Thus, assessors rely in part on themselves to make judgements, making subjectivity creep into the assessment process more easily. Because that subjectivity plays a role while assessing, it creates differences between assessors that can have undesirable effects on the student’s score. After all, the person of the assessor should have no influence on the score, but that is almost inevitable. Here we briefly discuss some of these common rater effects.
This effect occurs when the freedom in assessment, voluntary or not, is used for purposes other than those of unbiased assessment. This is the case, for example, when assessors give lower average scores to show that their subject is difficult, or give a particular student a higher score to encourage him or her.
Halo and horn effect
In the halo and horn effect, the assessor allows himself to be influenced by other characteristics of the student that usually have nothing to do with the competency being assessed. Aspects such as the student’s previous performance, handwriting, or the teacher’s general perception of the student then come into play. The halo effect occurs when the assessor rates higher a lesser performance of a student who usually performs well or of whom he has an overall positive image. The performance is then overvalued. In the opposite case, when the evaluator rates lower a good performance by a student who usually performs weaker or of whom he has a generally more negative view, we speak of a horn effect. Here the performance is undervalued.
This effect occurs when the assessor adjusts for student performance. For example, assessors may become less strict if, after a number of assessments, it appears that most students answer the same question incorrectly.
Restriction of range
Restriction-of-range indicates an assessor’s personal tendency not to use the full scale of values. For example, a mild assessor is more likely to rate above average and a strict assessor is more likely to rate below average. Furthermore, there are assessors who tend to rate always in the middle. In the latter case, we also speak of ‘central tendency’.
We speak of a sequence effect when the assessor’s assessment is guided by prior assessments. For example, after a series of weaker products, an assessor will give an average-quality product a higher rating than if that product followed some very good products.
This effect occurs when multiple assessors rate the same product very differently because they use different criteria, interpret criteria differently, or weight aspects differently.
Because assessors are humans and not computers, rater effects are almost impossible to avoid completely. However, you can try to reduce the likelihood of such effects occurring. Initially, it helps a lot to be aware of the possible rater effects. So reading this article is already a step in the right direction.
Furthermore, you can use comparative judgement methods. After all, research shows that comparative judgement effectively leads to a higher reliability and validity. How exactly comparative judgement helps prevent rater effects, you will read in a subsequent article.
De Gruijter, D. N. M. (2008). Toetsing en toetsanalyse. Leiden: ICLON, Sectie Onderwijsontwikkeling Universiteit Leiden.
Van Berkel, H. (2017). Toetsen in het hoger onderwijs. A. Bax, & D. Joosten-ten Brinke (Eds.). Bohn Stafleu van Loghum. 287-289.
Van Gasse, R., Bouwer, R., Goossens, M., & De Maeyer, S. (2017). Competenties kwaliteitsvol beoordelen met D-PAC. Examens: Tijdschrift voor de Toetspraktijk, 1(1), 11-17.