Relative vs absolute assessment and standards

03 Feb Relative vs absolute assessment and standards

Posted at 16:51h in ABC, Assessment, Education and assessment by Stephanie Kruiper

0 Likes

‘Comparative judgement? I’m not going to do that. We work with absolute standards. I don’t want students to be assessed against each other, that’s not fair.’ When we give presentations or workshops on comparative judgement, we often hear the above misunderstanding among attendees. Comparative judgement does not necessarily mean you’re using a relative standard. How come? And what exactly do these different forms of assessment and standard setting mean? We explain it to you.

Assessment and standard setting

First, we turn to the difference between assessment and standard setting. Despite the fact that the two concepts have a lot in common, there is an important difference. Assessing is about the process of assigning a judgement to the quality of a particular product. Setting a standard is about the decision you make with your assessment. For example: does someone pass or fail? Or does someone get an 8 or a 9? With setting the standard, you turn your assessment into a decision about the performance level.

Are you assessing? Then you are not necessarily setting a standard. Indeed: comparative judgement explicitly separates these two processes. Comparative judgement is based on comparing to assess. Comparison is a very natural process that ensures reliable and valid assessment in complex tasks¹,². You compare student works with multiple assessors at the same time to arrive at a common ranking. In the ranking, the works are ordered from the least good work to the best work.

The ranking only does not yet say anything about grades or passes/fails. In other words, you are comparing to arrive at a judgement, but are not yet setting a standard. You know that one work is better than another, but you do not know yet how good such a work is compared to a standard. The ranking can run from a 6 to a 7 on a scale of 1 to 10, but also from a 1 to a 10. So comparative judgement is a relative process, but then you can still make a choice whether to set the standard absolutely or relatively.

Relative standard setting

Both absolute and relative standard setting are about how you arrive at summative judgements on students’ exams or tests (e.g. from 1 to 10, or from A to F). Relative standard setting means that you let the results depend on the performance of the group³. Suppose, on a knowledge test, the best student scores 35 out of 40 points. With relative standard setting, this student, for example, gets a 10. All other students get grades based on this result. This way of working is also called grading to the curve.

There are many criticisms of relative standard setting⁴. For instance, a student in a high-performing class may be at a disadvantage compared to a low-performing class: if the student achieves a score of 25 on the exam, he or she will get a lower grade in a high-performing class than in a low-performing class. It may also be that a large part of the class actually scores so poorly that they should not pass the exam based on the learning objectives. With relative standard setting, there is basically always a certain percentage of students who pass³. Students who actually learned too little might still get a pass, or students who actually learned very well might not get a pass because there are (too) many students who do even better.

Absolute standard setting

Knowing that there are so many drawbacks to relative standard setting, it seems logical to adopt absolute standard setting. With absolute standard setting, the grading is independent of the performance of others. You establish in advance a clear list of standards that must be met to achieve a certain grade³. This way of setting the standard is common in most exams in Flanders and the Netherlands.

However, there are also disadvantages to absolute standard setting. First, you want to use a reasoned standard: for example, what is the minimum level that students have to achieve and why that level? And how does this translate into a concrete grade on an exam? We often choose a score of 55% or 60% good for a pass, but why really? It often turns out to be difficult to justify this specific choice.

In addition, absolute standards do not take into account the difficulty of the exam itself and the circumstances⁵. As an example, we can take the covid19 period. Perhaps the results of students in your subject were suddenly lower than other years. Is it then appropriate for you to maintain the same standard for these students? A lower result is not only because of the students’ competence but also because of the teaching offered and the difficulty of the exam or test. Perhaps the score of the best student in the group was the highest achievable score possible this year. The Cohen-Schotanus method is a compromise between absolute and relative standard setting and takes into account the difficulty of the exam⁵. The score of the group of best-performing students is then the highest achievable score.

In general, there are pros and cons to both methods of standard setting and it is important to make a reasoned choice about which method to use. For example, do you aim to admit the 10% best students to a programme? Then you might want to use relative standards. For example, do you want to make sure that every student achieves a certain standard because the consequences of low mastery could lead to crashes or deaths? Then absolute standard setting is more obvious.

Rubric or comparative?

There is something to be said for separating grading and standard setting more explicitly. When using a criteria list or a rubric, points are often awarded per element of the assessment. For example, think of 0, 1, or 2 points per category of the rubric. This score is usually easy to translate into a decision such as a pass/fail or a grade. So, as an assessor, it is not surprising that you see assessing and making a decision as one and the same activity, because while assessing you are already working on the final score a student is going to get.

The disadvantage of this way of working is that different rater effects occur. An important effect in this context is called central tendency: assessors tend to give a score that comes out in the middle of a scale and prefer not to give an extreme score⁶. The use of a rubric may cause under- or over-scoring because assessors prefer not to score full marks on a criterion, even if the description accompanying that criterion does fit the performance. Comparative judgement does not yet involve scoring and making decisions, which, among other things, minimises the effect of central tendency.

From comparative judgement to standard setting

In Comproved, you arrive at a ranking after assessing. You can then choose to set the standard absolutely or relatively. For instance, you can set the standard absolutely by choosing two works on the ranking and assigning them a grade in a joint discussion with colleagues. You can also determine the grades by adding two works (benchmarks) that have already been graded in advance to the comparisons. The tool calculates the grades of all works based on the two works you graded yourself. With this method, it is certainly possible that no student gets a pass, or they all do. If you want to know more about how standard setting in Comproved works practically, read on here.

Conclusion

Comparative judgement is not the same as relative standard setting. You use a comparative approach to arrive at a reliable and valid ranking, after which you can use different methods to set the standard. The main advantage is that with comparative judgement, you have ultimately determined the final judgement collectively with as few rater effects as possible, whether the judgement is pass/fail or a specific grade. Want to know more about comparative judgement? Read on here!

Literature

¹Lesterhuis, M., Bouwer, R., van Daal, T., Donche, V., & De Maeyer, S. (2022). Validity of comparative judgement scores: how assessors evaluate aspects of text quality when comparing argumentative texts. Frontiers in Education, 7. https://doi.org/10.3389/feduc.2022.823895

²Verhavert, S., De Maeyer, S., Donche, V., & Coertjens, L. (2017). Scale separation reliability: what does it mean in the context of comparative judgement? Applied Psychological Measurement, 9, 1-18. https://doi.org/10.1177/0146621617748321

³Lok, B., McNaught, C., & Young, K. (2016). Criterion-referenced and norm-referenced assessments: compatibility and complementarity. Assessment & Evaluation in Higher Education, 41(3), 450-465. https://doi.org/10.1080/02602938.2015.1022136

⁴Kjærgaard, A., Buhl-Wiggers, J., & Mikkelsen, E. N. (2023). Does gradeless learning affect students’ academic performance? A study of effects over time. Studies in Higher Education. https://doi.org/10.1080/03075079.2023.2233007

⁵Cohen-Schotanus, J., & Van der Vleuten, C. P. M. (2010). A standard setting method with the best performing students as point of reference: Practical and affordable. Medical Teacher, 32, 154-160. https://doi.org/10.3109/01421590903196979

⁶Leckie, G., & Baird, J. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399-418. https://doi.org/10.1111/j.1745-3984.2011.00152.x