Interrater consistency

۱.

Development and Validation of a Training-Embedded Speaking Assessment Rating Scale: A Multifaceted Rasch Analysis in Speaking Assessment

نویسنده: هومن بیژنی بهاره هاشم پور Salim Said Bani Orabah

منبع: Research in English Education Volume ۷, Issue ۳ (۲۰۲۲) 32-45

کلیدواژه‌ها: bias Interrater consistency Intrarater consistency multifaceted Rasch measurement (MFRM) Rater training rating scale

حوزه‌های تخصصی:

حوزه‌های تخصصی زبان شناسی

تعداد بازدید : ۵۶۹ تعداد دانلود : ۳۱۲

Performance testing including the use of rating scales has become widespread in the evaluation of second/foreign oral language assessment. However, no study has used Multifaceted Rasch Measurement (MFRM) including the facets of test takers’ ability, raters’ severity, group expertise, and scale category, in one study. 20 EFL teachers scored the speaking performance of 200 test-takers prior and subsequent to a rater training program using an analytic rating scale consisting of fluency, grammar, vocabulary, intelligibility, cohesion, and comprehension categories. The outcome demonstrated that the categories were at different levels of difficulty even after the training program. However, this outcome by no means indicated the uselessness of the training program since data analysis reflected the constructive influence of training in providing enough consistency in raters’ rating of each category of the rating scale at the post-training phase. Such an outcome indicated that raters could discriminate the various categories of the rating scale. The outcomes also indicated that MFRM can result in enhancement in rater training and functionality validation of the rating scale descriptors. The training helped raters use the descriptors of the rating scale more efficiently of its various band descriptors resulting in a reduced halo effect. The findings conveyed that stakeholders had better establish training programs to assist raters in better use of the rating scale categories of various levels of difficulty in an appropriate way. Further research could be done to make a comparative analysis between the outcome of this study and the one using a holistic rating scale in oral assessment.

۲.

Facet Variability in the Light of Rater Training in Measuring Oral Performance: A Multifaceted Rasch Analysis(مقاله علمی وزارت علوم)

نویسنده: هومن بیژنی سلیم سعید بنی اوراباه

منبع: Issues in Language Teaching (ILT), Vol. ۱۱, No. ۲, ِDecember ۲۰۲۲ 255 - 290

کلیدواژه‌ها: bias Interrater consistency multifaceted Rasch measurement (MFRM) Rater training Severity/leniency

حوزه‌های تخصصی:

حوزه‌های تخصصی زبان شناسی

تعداد بازدید : ۳۶۱ تعداد دانلود : ۳۰۹

Due to subjectivity in oral assessment, much concentration has been put on obtaining a satisfactory measure of consistency among raters. However, obtaining consistency might not result in valid decisions. One matter that is at the core of both reliability and validity in oral performance is rater training. Recently, Multifaceted Rasch Measurement (MFRM) has been adopted to address the problem of rater bias and inconsistency; however, no research has incorporated the facets of test takers’ ability, raters’ severity, task difficulty, group expertise, scale criterion category, and test version together in a piece of research along with their two-sided impacts. Moreover, little research has investigated how long rater training effects last. Consequently, this study explored the influence of the training program and feedback by having 20 raters score the oral production, as measured by the CEP (Community English Program) test, produced by 300 test takers in three phases, i.e., before, immediately after and long after the training program. The results indicated that training can lead to higher degrees of interrater reliability and diminished measures of severity/leniency, and biasedness. However, it won't lead the raters into total unanimity, except for making them more self-consistent. Although rater training might result in higher internal consistency among raters, it cannot eradicate individual differences. That is, experienced raters, due to their idiosyncratic characteristics, did not benefit as much as inexperienced ones. This study also showed that the outcome of training might not endure in long run after training; thus, it requires ongoing training letting raters regain consistency.

۳.

Construct Validation of a Rating Scale through a Training Program: A Multifaceted Rasch Analysis in Speaking Assessment(مقاله علمی وزارت علوم)

نویسنده: Wander Lowie هومن بیژنی محمدرضا عروجی زینب خلافی پویا عباسی

منبع: Iranian Journal of Applied Linguistics (IJAL) Vol. ۲۶, No. ۲, September ۲۰۲۳ 48-80

کلیدواژه‌ها: bias Interrater consistency Intrarater consistency multifaceted Rasch measurement (MFRM) Rater training rating scale

حوزه‌های تخصصی:

حوزه‌های تخصصی زبان شناسی

تعداد بازدید : ۱۶۰ تعداد دانلود : ۱۱۲

Performance testing including the use of rating scales has become highly widespread in the evaluation of second/foreign oral assessment. However, few studies have used a pre-, post-training design investigating the impact of a training program on the reduction of raters’ biases to the rating scale categories resulting in increase in their consistency measures. Besides, no study has used MFRM including the facets of test takers’ ability, raters’ severity, task difficulty, group expertise, scale category, and test version all in a single study. 20 EFL teachers rated the oral performances produced by 200 test takers before and after a training program using an analytic rating scale including fluency, grammar, vocabulary, intelligibility, cohesion and comprehension categories. The outcome of the study indicated that MFRM can be used to investigate raters’ scoring behavior and can result in enhancement in rater training and validating the functionality of the rating scale descriptors. Training can also result in higher levels of interrater consistency and reduced levels of severity/leniency; however, it cannot turn raters into duplicates of one another, but can make them more self-consistent. Training helped raters use the descriptors of the rating scale more efficiently of its various band descriptors resulting in reduced halo effect. Finally, the raters improved consistency and reduced rater-scale category biases after the training program. The remaining differences regarding bias measures could probably be attributed to the result of different ways of interpreting the scoring rubrics which is due to raters’ confusion in the accurate application of the scale.