Differential Attainment in Summative Assessments within Postgraduate Medical Education & Training

This discussion paper has been prepared for the expert roundtable exploring the ‘Differential Attainment in PG Medical Education and Training’ planned for 17 September 2020. This will be the first engagement exercise launching the 2020 Thematic series on Tackling differential attainment in Healthcare professions, bringing together an interdisciplinary Alliance on equality in healthcare professions. This paper presents a preliminary outline of the current evidence on differential attainment in high stakes postgraduate summative assessment, explores its impact, deliberates on known causes and discusses a number of potential solutions. This paper is written with a view to present the case for tackling DA in PG summative assessments and will be accompanied by a prioritised selection of ‘focussed questions and solutions’ to be discussed at the roundtable with subject experts. This paper and roundtable will form part of, and contribute to the thematic synthesis in the section on ‘Assessment formative and summative’. Therefore, as described in the ‘protocol’, will be followed by a focussed systematic review, engagement with priority setting partnerships (via questionnaires, focus groups and workshops) and culminate in an expert consensus. The final outcome will be presented as synthesized recommendations, solutions, policy enablers and areas for further research.

What is Differential Attainment? Differential attainment (DA) is a term used to describe the variations in levels of educational achievement that occur between different demographic groups undertaking the same assessment. UK doctors from Black and Minority Ethnic (BAME) groups, and International Medical Graduates (IMG) i.e. doctors whose Primary Medical Qualification (PMQ) is from a medical school outside of the UK have, consistently, poorer outcomes in assessments and recruitment compared to white doctors and UK medical school graduates. 1 2 Differential attainment has been recognised as a challenge for medical professionals and educators since the 1990s.
How big is the problem?
Ethnic minority medical graduates in the UK have 2.5 times higher odds of failing high-stakes exams. 3 Summative assessments for the membership of the Royal Colleges of Physicians (MRCP), General Practitioners (MRCGP) and Psychiatrists (MRCPsych), amongst others have shown a consistent medium sized ethnicity effect and a larger country of PMQ effect. This translates to a 10-15% gap in pass rates for UK BAME candidates and a larger approximately 30-50% gap in pass rates for IMGs. The CSA - submitting it to the panel of examiners, who will carry out objective assessments using same criteria as used in the CSA.6 The advantage is that the GPRs can select from the consultations carried out in their own surgery environment rather than in an artificial environment that involved actors.
Although understandably, this is posing some logistical challenges for the trainees, especially those working remotely due to personal risks such as pregnancy or other health conditions, this format may well give a basis or a 'trial run' of an alternative option. There is also concern that the CSA may well be an outdated method of assessment and not reflective of the changing nature of general practice.7 8 Why is Differential Attainment a problem?
Moral and Ethical Impact Clearly, the significant attainment gap based on ethnicity (and country of origin) poses a significant social justice issue. The fact that these attainment gaps have persisted for decades with no institutional redressal, compounds the ethical and moral problem and makes the case for urgent remediation.
For IMGs, whose visas or permission to remain in the UK may be dependent on exam success, this creates uncertainty, economic instability, anxiety and undue distress. In practice, the attainment gap serves to multiply the microaggressions that BAME students, trainees and staff face in clinical and educational settings. 9 BAPIO has received testimonies from a large number of individuals where exam related stress has been specifically identified as a source of great personal and professional difficulties. 10

Workforce and Financial Impact
Around a third of UK medical students (n ~ 11000) and graduates (who are not Consultants or GPs) are of BAME origin (n ~ 28000).11 IMGs also constitute a very large part of the workforce and especially so in some specialities such as Psychiatry and General Practice where they constitute >35% of the workforce.12 In 2019, the number of IMGs entering the General Medical Council (GMC) register exceeded the number of UK graduates. 13 These numbers illustrate the scale and extent of the impact of DA.
The inevitable necessity of the UK National Health Service (NHS) in depending on IMGs to deliver patient care is evident also in the high number of vacancy rates across the country in many clinical specialties, in various geographical locations and in the high cost of providing locum cover to run essential services.14 If clinical examinations prove an unfair barrier to career progression, this may represent a significant workforce challenge with direct adverse impact on patient care.15 Furthermore, the costs of failure in high stakes examinations costs (approximately £65,000 per failure) pose huge economic burden in further education and ancillary costs and organisational level. 16, 17

Impact on Patient Care
A sense of equality among health workers translates to better team working which inevitably leads to better patient outcomes and satisfaction for the organisation. It is known that the proportion of staff believing the employing organisation provides equal opportunities for career progression or promotion "was a very important predictor of patient satisfaction." 9 Unfortunately, BAME staff routinely report microaggressions at work. Several factors have been implicated as causative or contributory in DA. Prior educational attainment generally predicts future academic attainment, but multivariate analysis of data shows that DA in medical school finals persist even after accounting for prior educational attainment. DA persists even after accounting for socio-economic deprivation. In fact, ethnic differences in attainment persist even after controlling for type of school, personality, motivation, study habits and mental health of candidates as well as linguistic ability, often cited as a cause for DA. Ethnic differences in attainment persist after controlling for one's own first language and parents' first language. 24 There are a range of factors related to either the examination itself or to the training environment leading up to the examination that may explain DA. IMGs often face additional difficulties which impede examination success due to differences in educational experience, content familiarity and language, some of which may be potentially amenable to modification or additional support.25 Apart from the factors that have been ruled out (see above), possible candidate factors that have been implicated include relationship with peers, relationship with educators, the presence of undiagnosed and undetected learning disability such as dyslexia and undue pressure from expectations of passing/failure. 24 Factors relating to examinations may include unconscious or conscious bias in examiners, in the recruitment of examiners, in the choice of exam questions or case selection for OSCE stations or in standard setting and/or applying the set standards in the exam. 26,27 Are summative exams unfair?
Esmail and Roberts' study analysing the data of academic performance of ethnic minority candidates and discrimination in the MRCGP examinations between 2010 and 2012 showed that, even after controlling for performance on the machine-marked AKT, ethnic minority UK graduates were nearly four times and international medical graduates 14 times as likely to fail their first CSA attempt as white candidates. The authors concluded that "subjective bias due to racial discrimination in the CSA may be a cause of failure for UK trained candidates and IMGs. 28, 29 However, in the courts the examination was judged lawful. Others too, have argued that DA is indicative of a true attainment gap based on consistent and correlated DA seen in candidates taking both MRCGP and MRCP (UK) exams 30 31 lack of proven ethnicity or gender bias in examiners in MRCP exams on two-examiner stations 32 or the lack of proven role player bias in CSA exams. 33 It is indeed worth noting that gender or ethnicity bias have not been disproven in single examiner stations. Unconscious bias training often provided to examiners and role players to mitigate against DA has proved to be ineffective 34 and while systematic review evidence suggests that discrimination is unlikely to be the sole cause of DA, 3 the current evidence clearly does not rule out covert or overt discrimination as a cause of DA.
Assessment oversight committees and annual programmatic evaluations, while recommended, will not guarantee fairness within postgraduate medical education programs, but they can provide a window into 'hidden' threats to fairness, as everything from training experiences to assessment practices may be open to scrutiny. 35 Ensuring Fairness in Clinical Training and Assessment: Principles and examples of good practice, was recommended by the BMA outlined a few principles that need to be considered with respect to assessment methods.

Current Difficulties with Objective Structured Clinical Examinations (OSCE)
When evaluated against the standard criteria, independent of its ethnicity effect, a few problems emerge with the current traditional OSCE format.
Firstly, the artifice of OSCEs makes validity a significant concern. Rating scales and checklist assessment tools used to improve reliability ends up rewarding mechanistic "performance" from candidates. A striking example of this problem is the paradoxical third person rating of empathy often used in OSCEs assessing communication skills. OSCEs that reward feigning empathy rather than actual empathy have been blamed for the striking reduction in empathy seen in medical students as they progress through their medical training.36 Validity depends on high levels of fidelity but that is usually lacking as OSCEs usually test isolated skills in a fragmented fashion. 37 38 OSCEs improve on their reliability coefficients by increasing the duration of the exam but these remain susceptible to biases in sampling of stations. Standard setting in high-stakes exams is done variably for different cohorts and while this could be improved, there remains the variability in examiners. All exams do review the "hawks and doves" in their examiner pool but again this categorical distinction may mask granular details for e.g. the finding that IMG examiners may be more hawkish. 39 Another interesting finding relates to the finding that performance at the MRCGP clinical skills assessment in IMGs was better predicted by scores on a situational judgment test, evaluating interpersonal skills, than by achievement on a knowledge-based test. 17 This finding is also supported by previous reports that GMCs Professional and Linguistic Assessment Board examination (PLAB) part 2 scores, rather than those for part 1, predicted performance in the clinical components of MRCP and MRCGP CSA exams. 31 This is of concern particularly given the known ethnicity discriminatory effect (against BAME candidates) that is a consistent feature of the Situational Judgement Test. 40 Assessment does drive learning and clearly summative examinations have a role in not merely quality assurance but in also promoting essential learning and practice that delivers high quality and safe care for patients. However, this does depend on high quality, specific and credible feedback being delivered to failed candidates with tailored remediation. Currently, the feedback given to failed candidates fails to meet any of these criteria. Pertinently, there is no evidence to link success or failures in OSCE-style exams with patient safety or patient outcomes.
Alternatives to OSCEs -Programmatic Assessment; multiple low stakes assessments There is some shift in focus within medical education, from learning discrete skills and knowledge to continuous learning with authentic tasks focused on transfer to clinical practice. GMC's Generic Professional Capabilities Framework signals this direction very clearly and is now leading to changes in postgraduate curricula across the board. 41 The underlying message is clear -we need to move from "shows how" to "does".
The public expect their doctors to be capable of working in a range of different situations and settings and there is wide understanding that no single assessment method can capture it all. Current assessment strategy focusing as it does, on summative assessment at a single point of time, provides little weightage for longitudinal assessments.
Narrative feedback embedded in a dialogue (rather than one-way provision of feedback) is significantly more impactful in developing complex clinical skills than scores. Longitudinal and more diverse programmatic assessment can address the inherent difficulties in relying on a single data point viz. the summative OSCE examination. Moving from a sum of a few summative/formative assessments to a programme of multiple low-stakes assessment would provide multiple data points which can be optimised for learning. The format of assessments can be varied at various data points which would improve the validity of assessment.
Current summative examinations are focused on delivering a categorical pass/fail distinction and considerable effort is expended in designing exams that are defensible-the main focus of the assessment is this decision rather than on the primary function of assessment, which is to drive patient-centred learning.
Switching from decision-oriented to feedbackoriented multiple assessments with varying degrees of stakes at each data point would generate feedback focused on improving the quality of care for patients, something that current assessment strategies do not emphasise. Crucially, such longitudinal assessment delivers non-surprising results in the final stages of the assessment. The fact that the failure in highstakes assessment comes as a surprise to both trainers and trainees has been a significant problem with current summative exams. Those likely to fail should be identified earlier on in their learning trajectory and remedial action instituted.
Such programmatic assessments are being used in many centres across the world including the USA, Canada and Holland. Within the UK setting, the current system of Workplace Based Assessments, Annual Review of Competency Progression and summative paper exams including OSCEs should be adapted relatively easily to create a more longitudinal systematic and programmatic assessment. This will empower trainers to use their professional judgement (rather than relying on standard setting or on narrow checklists which have been associated with reduced validity). Increasing the number of data points will increase the diversity of the assessment sample, potentially increase the diversity in the examiner pool and aided by procedural bias reduction methods should deliver an exam that puts person-centred care and learning rather than pass/fail decisions at the heart of assessment.

Initiatives so far •
Following the legal challenge, the GMC and some Royal Colleges have had regular discussions with BAPIO and have produced examination preparation resources as well as enhanced guidance for trainers. • RCGP has introduced an exceptional 5th attempt for some candidates in the CSA.
• A Health Education North West Pilot programme for enhanced training has been shown to improve outcomes of CSA resits.

•
Use real patients rather than role players.
• Two examiners may mark rather than one at every station or virtual examiners as employed in some USA systems may reduce undue stress • Video of the assessment should be made available to failing candidates • Number of attempts may be increased or made unlimited as long as the doctor is continuing in active medical practice.
• Culvert Scoring: The Education Supervisor provides a 'culvert score' to the trainee about 6 months prior to proposed finishing date of training. This score ranges from 0-3 depending on the overall performance of the candidate during the whole period of training and will be influenced by overall knowledge, communication skills, quality of the WPBA and several other factors. This score is not disclosed to the trainee but is available to the examining body. If a candidate is marginally falling short of CSA pass score, this culvert score may be added to the marks obtained in the CSA examination. If the candidate has already scored the pass marks, there is no need to use a culvert score. • Weight allocation: "Weights" may be provided to the current three parts of the assessments (i.e. WPBA, AKT and CSA). Weighted scores from all three assessments then may be combined to provide the accreditation score. The accreditation score may be fixed beforehand again based on the survey results, for example 65% or 70%. Actual weights may be decided following a survey conducted from the trainees, trainers and examiners.

•
Promoting cultural safety, cultural humility and decolonization of the curriculum and content • Address the conscious and unconscious biases that exist amongst tutors as well as examiners