Unintended consequences of approaches to marking and assessment

painting of group of people facing one way with one person facing another
Original artwork by Kelly Zou, MA Illustration student at Edinburgh College of Art

In this post, Michael Daw, Director of Quality in the Deanery of Biomedical Sciences, writes about the impact of different marking schemes and how our beliefs about what we are achieving when assessing students are not always backed up by evidence…


As Director of Quality in the Deanery of Biomedical Sciences I have become aware that a number of our external examiners comment that we do not use the full range of marks. Most often this has referred to a lack of high marks (e.g.: above 80%) but also to a generally narrow range of marks. As a marker in the Deanery, I am aware that a range of different marking schemes are used on different courses. Deanery-wide guidance on preparing and using marking schemes has long been proposed but I felt that, before issuing this guidance, we needed to understand the impact of different marking schemes on mark distributions. For example, external examiners have pushed subdivision of A grade descriptors to give specific descriptors for A1, A2 and A3 in the belief that it will encourage high A marks. However, there is currently no evidence to suggest whether this is effective.

To address this I analysed all the marking schemes for year 1-3 courses run by Biomedical Sciences in Edinburgh along with all the 4th year courses that are required core courses for specific Biomedical Sciences honours programmes. I excluded programmes only available to intercalating medical students and 4th year elective courses which often have very small cohorts. I also analysed the mark distributions for each assignment. In total this covered 152 assessments with 13745 submissions.

For brevity I will not describe all of the analysis or results here but I am happy to share them along with details of statistical tests supporting conclusions with anyone who is interested. Analysis of the example above shows that, surprisingly, assignments where A descriptors are subdivided resulted in a smaller proportion of marks of 80% or greater than those with a single A grade descriptor.

picture of a bar graph
Figure 1. The sub-dividing A-grade descriptors is associated with a reduction in the proportion of marks ≥ 80%

One major categorisation of marking schemes is whether they are holistic of analytic. A holistic marking scheme results in a single mark that represents the whole of the piece of work whilst an analytic scheme results in a series of marks for individual components which are combined to produce a final mark. A significant proportion of our marking schemes lie between these categories: there is a single mark but a grid is provided to give indicative grades for each category. I term these “semi-analytic”. Other assessments, such as multiple choice exams or calculations have a single best answer (SBA) so do not use either type of marking scheme. Proponents of analytic schemes argue that they provide greater reliability of marks and agreement between markers whilst critics argue that they may result in narrow ranges of marks by failing to properly reward work that is brilliant in a single respect. I found no difference in the interquartile range of marks between holistic, semi-analytic and analytic schemes whilst SBA assignments resulted in a much larger range. This argues against the restrictive effect of analytic approaches.

Picture of a bar graph
Figure 2. Single best answer assessments produce a wide range of marks. Holistic, semi-analytic and analytic approaches to marking have no effect on interquartile range of marks.

None of our analytic or SBA assignments were double marked but there was no difference in correlation between markers for holistic and semi-analytic schemes suggesting that neither approach is inherently more consistent.

Picture of bar graph
Figure 3. Holistic and semi-analytic approaches to marking are equally consistent between markers.

Together with other findings not discussed here, the analysis does not preferentially support the use of either analytic or holistic schemes.

One of the main findings was that the style of marking scheme made little difference to mark distributions. However, in the course of the analysis, it became apparent that it is difficult to disentangle the effect of marking schemes from that of the type of assessment used. In fact, the type of assessment appears to have greater effect on both the interquartile range of marks and the proportion of marks of 80% and above.

Picture of bar graph
Figure 4. Different forms of assessment are associated with markedly different ranges of marks.
Picture of bar graph
Figure 5. Different forms of assessment are associated with markedly different proportions of marks >80%.

One especially striking example is that of exams with long-form or essay-style answers. These are often thought of as a great test of the ability of students to apply knowledge and, as such, an effective discriminator. Contrary to this, these assignments produced the smallest proportion of marks above 80% (8 out of 14 courses did not award any marks in this range) and only presentations had a smaller interquartile range. Whilst presentations represent an important graduate skill and authentic assessment, the same cannot be said for exam essays. This finding questions the utility of this type of assessment or, at least, our expectations of student performance in them.

Overall this analysis shows that our beliefs about what we are achieving when assessing are not always backed up by evidence. These sort of counter-intuitive findings can be very difficult to identify at the single course-level suggesting that School-level analysis is important to guide our practice.

Preliminary analysis was carried out by Omolabake Fakunle


Michael Daw

Dr Michael Daw is a Senior Lecturer and Director of Quality in the Deanery of Biomedical Sciences. He was an Edinburgh graduate in 1998 and returned in 2010 as a Research Fellow before developing an increasing interest in teaching and learning.


photograph of the authorKelly Zou

Kelly Zou is a storyteller, illustrator, and ex-programmer. She is based in UK and currently studying MA in Illustration in Edinburgh College of Art. She has been the recipient of Honorable Mention in the 2019 3×3 Illustration Awards, her works were chosen for 2019 and 2020 Asia Illustrations Collections, and she is also the illustrator of the book “The Lost Flower Children” by Janet Taylor Lisle published in China. She loves telling stories.
Websitehttps://www.kellyzou.com/ 

Instagram: @kelly_zxj

3 comments

  1. Fascinating analysis Michael. Also really good to see students work used to illustrate.
    MCQs (SBA or other) and lab marks appear to generate a wider range of results, but I wonder whether you’re looking at ‘scores’ rather than marks; have scores in your MCQs been translated to the University marking scale for reporting? Or are they actually returning more fails, Cs and Ds.
    Long form essays: agree with your comments querying their dominance. Their relevance to employment (a key element of authenticity) may be queried, but they also tend to be unreliable, i.e. second marking returns a different grade, or a students performance in a series of these can be quite divergent. Do we continue to use so many essay Qs because they are easiest for teachers to set? Probably we should be using a wider variety of methods, and thinking more about ‘why’ for each.

  2. For clarification: all marks are % not “scores”. i.e.: MCQs do indeed result in more Cs, Ds and fails.
    We need to be careful to differentiate “essays” from “exam essays”. The mark distribution is quite different.
    I suspect that both forms of essay are common both because they always have been (they are the default) and because many staff think they are a good way to test students’ ability to integrate multiple sources of information and critically analyse it. For exams, in particular, I think the assumption is that you need to really understand a subject to answer a question in a time limited setting.

  3. Thanks Michael, this is a very helpful, thought-provoking piece.

    I think in Humanities and Soc Sci, most assessments are interpretive essays and these are formally assessed using the ‘semi-analytical’ approach. Based on what people say when the justify their mark in conversations between 1st and 2nd markers, i think most people do their semi-analytic marking ‘top-down’ (i.e. they get a holistic sense of the overall quality of an essay, and then use that to justify the grades they give for specific criteria), whereas some do it more ‘bottom-up’ (building their justification for the final mark on the grades they give for the various criteria.)

    There’s also the question of whether anyone gives implicitly (or perhaps explicitly) different weightings to various criteria. Do we make it clear to students if, for example, for a given assignment ‘originality’ matters much more than ‘structure’, ‘presentation’, and ‘evidence’?

Leave a Reply

Your email address will not be published. Required fields are marked *