27 Nov Reliability of assessments
Are you an experienced teacher, juggling assessment types and methods easily, but still looking for some inspiration? Or are you more of a casual passerby, not knowing what some of the terms you come across on this website mean? In either case, you’ve come to the right place! In this column, we provide an overview of testing and assessment methods, explain jargon and scientific terms in simple language, and supplement with some literature suggestions for those who’d like to keep reading. Today we talk about reliability of assessments.
When you say assessments, you say reliability. Along with validity, it is perhaps one of the most important concepts related to assessment. At the same time, determining reliability is not always so easy. If the exam board asks you to display the reliability of your test, how do you get that information? With a multiple-choice test, you may know where to go, but what if your assessment consists of an authentic assignment that is graded more holistically? We’ll help you out!
What is reliability?
Assessments can cover anything, depending on your purpose. Students can take a knowledge test that asks them to reproduce factual knowledge. If your subject is about acquiring skills, then an assignment might be used.
In all cases, the reliability of assessments is about the precision with which you measure. If you have high reliability you know you are assessing consistently (repeatably). Suppose you take a test and review it. The next year you take the same test under roughly the same conditions. The student groups are similar and they have had the same education. If your test is reliable, then you expect equivalent results from the students in both years of study. Note: Reliability is not about what you assess, that is the validity of a test. High reliability is necessary to ensure validity.
Reliability is often calculated numerically. There are different measures of reliability. A few examples:
- Internal consistency: If you are trying to measure the same competency or skill with multiple questions or assignments in the same test, you expect a student to score about the same on all these questions or assignments. If this is the case, the test is internally consistent. Cronbach’s Alpha is often used as a measure.
- Inter-rater reliability: The degree to which different, separate assessors assess similarly. Cohen’s Kappa is a commonly used measure for this; it corrects for the probability that assessors happened to arrive at the same assessment by coincidence. If you do not correct for this, correlation can also be calculated (e.g. Pearson’s r).
- Intra-assessor reliability: The degree to which an assessor reviewing the same work twice arrives assesses similarly. Also computable by calculating a correlation.
- Test-retest reliability: The degree to which you consistently get the same results among students when you administer the same test again. Usually it is impractical to administer the same test to the same students twice under exactly the same conditions.
In daily practice, exact reliability is not often calculated, especially when dealing with more complex assignments. Measures of inter-rater reliability, such as Cohen’s Kappa, are useful under limited circumstances. For example, if you assess with more than two assessors, Cohen’s Kappa is no longer appropriate. So if you as a teacher do want to calculate the reliability of a test yourself, it is not always easy unless you are experienced in using statistics. Sometimes you have no way of collecting suitable data, for example because it would take far too much time to have all the work reviewed by multiple assessors.
There are several ways through which you can increase the reliability of assessments. When assessing skills or competencies, you use answer models, criteria lists or observation forms to assess quality. The aim then is to state as clearly as possible what performance is required or what elements of an answer are correct. This can take the form of a rubric, for example, in which you describe what each level of performance looks like. You increase reliability because all assessors use the same instrument.
Still, a tool like a rubric is not enough. Earlier we wrote about the rater effects that can exist. These affect inter- and intra-assessor reliability, which means you want to avoid them as much as possible. If there are multiple assessors involved in the assessment, it is also important that these assessors use the assessment instruments in the same way. You can organize a calibration session where you review a work together and agree on the use of your assessment tool. In addition, you can try to have assessors assess at the same time of day, in the same circumstances.
The common denominator in all these measures is that you assume objectivity to ensure reliability. One might ask whether this is feasible in every situation. For assessing knowledge, being objective is a little easier because an assessment leaves little room for interpretation. For assessing complex skills, there is more involved. Criteria in rubrics can hardly be made clear and explicit enough to really objectively assess how well a student can do something¹. This causes differences in the assessment, which still reduces reliability.
There is a way to do justice to differences in teachers’ interpretations and take them into account to achieve a reliable assessment, especially in open and complex tasks. This can be done through comparative judgement. Comparative judgement is based on the principle that people assess more reliably when they compare student work than when they score each work individually²,³. As a result, inter-rater reliability is higher compared to using rubrics or assessment forms.
Would you like to use comparative judgement in your teaching? Comproved is very suitable for this. In Comproved you can always see the reliability of your assessment in the results of your assessment. If the reliability is too low, you can increase it at any time by making more comparisons. Contact us, we are happy to help you!
¹Coertjens, L., Lesterhuis, M., Verhavert, S., Van Gasse, R., & De Maeyer, S. (2017). Teksten beoordelen met criterialijsten of via paarsgewijze vergelijking: een afweging van betrouwbaarheid en tijdsinvestering. Pedagogische Studiën, 94(4), 283–303. Link: https://repository.uantwerpen.be/docman/irua/e71ea9/147930.pdf
²Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education: Principles, policy & practice, 26(5), 541-562. Link: https://www.tandfonline.com/doi/pdf/10.1080/0969594X.2019.1602027?casa_token=jkUUf2kviAQAAAAA:IpNFEQH1vcDjIQc3dz6Yl-dlOS4AqRZJ4fHnksy2-llneI5VnPYVFOoQh8yIt9_N92tHz2oPBEp17Q
³Lesterhuis, M., Donche, V., De Maeyer, S., Van Daal, T., Van Gasse, R., Coertjens, L., Verhavert, S., Mortier, A., Coenen T., Vlerick Peter, …. (2015) Competenties kwaliteitsvol beoordelen : brengt een comparatieve aanpak soelaas? Tijdschrift voor hoger onderwijs – ISSN 0168-1095 – 33(2), p. 55-67 http://hdl.handle.net/10067/1283920151162165141