PTAM 2018-1 |   Rater effects: Advances in item response modeling of human ratings – Part II

Rater effects: Advances in item response modeling of human ratings – Part II

Thomas Eckes (Eds.)

CONTENT

Testing psychometric properties of the CFT 1-R for students with special educational needs
Jörg-Henrik Heine, Markus Gebhard, Susanne Schwab,
Phillip Neumann, Julia Gorges & Elke Wild
Jörg-Henrik Heine, Markus Gebhard, Susanne Schwab, Phillip Neumann , Julia Gorges & Elke Wild
Full article .pdf (Diamond Open Access)

Guest Editorial
Rater effects: Advances in item response modeling of human ratings  Part II

Thomas Eckes
Full article .pdf (Diamond Open Access)

A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings
George Engelhard, Jr, Jue Wang, & Stefanie A. Wind
Full article .pdf (Diamond Open Access)

Modeling rater effects using a combination of Generalizability Theory and IRT
Jinnie Choi & Mark R. Wilson
Full article .pdf (Diamond Open Access)

Comparison of human rater and automated scoring of test takers? speaking ability and classification using Item Response Theory
Zhen Wang & Yu Sun
Full article .pdf (Diamond Open Access)

Item response models for human ratings: Overview, estimation methods, and implementation in R
Alexander Robitzsch & Jan Steinfeld
Full article .pdf (Diamond Open Access)

ABSTRACTS

Testing psychometric properties of the CFT 1-R for students with special educational needs
Jörg-Henrik Heine, Markus Gebhard, Susanne Schwab, Phillip Neumann, Julia Gorges & Elke Wild

Abstract
The Culture Fair Intelligence Test CFT 1-R (Weiß & Osterland, 2013) is one of the most used tests in Germany when diagnosing learning disabilities (LD). The test is constructed according to the classical test theory and provides age specific norms for students with LD in special schools. In our study, we analyzed the test results of 138 students in special schools and 166 students with LD in inclusive settings in order to test the measurement invariance between students with LD, who are educated in these two different educational settings. Data were analyzed within an IRT framework using a non-iterative approach for (item) parameter recovery. This approach parallels with the principle of limited information estimation, which allows for IRT analyses based on small datasets. Analyses for Differential Item Functioning (DIF) as well as a test for global and local model violations with regard to both subgroups were conducted. The results confirmed the assumption of measurement invariance across inclusive and exclusive educational settings for students with LD.

Keywords: Measurement invariance, Rasch model, item parameter recovery, limited information estimation, learning disabilities

Jörg-Henrik Heine
Technical University of Munich
TUM School of Education
Centre for International Student Assessment (ZIB)
Arcisstr. 21
D-80333 München, Germany

joerg.heine@tum.de


Guest Editorial Rater effects: Advances in item response modeling of human ratings ? Part II
Thomas Eckes

Thomas Eckes, PhD
TestDaF Institute
University of Bochum
Universitätsstr. 134
44799 Bochum, Germany

thomas.eckes@testdaf.de


A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings
George Engelhard, Jr, Jue Wang, & Stefanie A. Wind

Abstract
The purpose of this study is to discuss two perspectives on rater-mediated assessments: psychometric and cognitive perspectives. In order to obtain high quality ratings in rater-mediated assessments, it is essential to be guided by both perspectives. It is also important that the specific models selected are congruent and complementary across perspectives. We discuss two measurement models based on Rasch measurement theory (Rasch, 1960, 1980) to represent the psychometric perspective, and we emphasize the Rater Accuracy Model (Engelhard, 1996, 2013). We build specific judgment models to reflect the cognitive perspective of rater scoring processes based on Brunswik’s Lens model frame- work. We focus on differential rater functioning in our illustrative analyses. Raters who possess in- consistent perceptions may provide different ratings, and this may cause various types of inaccuracy. We use a data set that consists of the ratings of 20 operational raters and three experts of 100 essays written by Grade 7 students. Student essays were scored using an analytic rating rubric for two do- mains: (1) idea, development, organization, and cohesion; as well as (2) language usage and conven- tion. Explicit consideration of both psychometric and cognitive perspectives has important implica- tions for rater training and maintaining the quality of ratings obtained from human raters.

Keywords: Rater-mediated assessments, Rasch measurement theory, Lens model, Rater judgment, Rater accuracy

George Engelhard, Jr., Ph.D.
Professor of Educational Measurement and Policy
Quantitative Methodology Program, Department of Educational Psychology
325W Aderhold Hall
The University of Georgia Athens
Georgia 30602, U.S.A.

gengelh@uga.edu


Modeling rater effects using a combination of Generalizability Theory and IRT
Jinnie Choi & Mark R. Wilson

Abstract
Motivated by papers on approaches to combine generalizability theory (GT) and item response theory (IRT), we suggest an approach that extends previous research to more complex measurement situations, such as those with multiple human raters. The proposed model is a logistic mixed model that contains the variance components needed for the multivariate generalizability coefficients. Once properly set-up, we can estimate the model by straightforward maximum likelihood estimation. We illustrate the use of the proposed method with a real multidimensional polytomous item response data set from classroom assessment that involved multiple human raters in scoring.

Keywords: generalizability theory, item response theory, rater effect, generalized linear mixed model

Jinnie Choi
Research Scientist at Pearson
221 River Street
Hoboken, NJ 07030

jinnie.choi@pearson.com


Comparison of human rater and automated scoring of test takers? speaking ability and classification using Item Response Theory
Zhen Wang & Yu Sun

Abstract
Automated scoring has been developed and has the potential to provide solutions to some of the obvious shortcomings in human scoring. In this study, we investigated whether SpeechRaterSM and a series of combined SpeechRater SM and human scores were comparable to human scores for an English language assessment speaking test. We found that there were some systematic patterns in the five tested scenarios based on item response theory.

Keywords: SpeechRater, human scoring, item response theory, ability estimation and classification

Zhen (Jane) Wang
Educational Testing
Service, Senior Psychometrician
Psychometrics, Statistics and Data Sciences (PSDS)
Rosedale-Anrig,
Princeton, NJ, U.S.A

jwang@ets.org


Item response models for human ratings: Overview, estimation methods, and implementation in R
Alexander Robitzsch & Jan Steinfeld

Abstract
Item response theory (IRT) models for human ratings aim to represent item and rater characteristics by item and rater parameters. First, an overview of different IRT models (many-facet rater models, covariance structure models, and hierarchical rater models) is presented. Next, different estimation methods and their implementation in R software are discussed. Furthermore, suggestions on how
to choose an appropriate rater model are made. Finally, the application of several rater models in R is illustrated by a sample dataset.

Keywords: multiple ratings, many-facet rater model, hierarchical rater model, R packages, parameter estimation, item response models

Leibniz Institute for Science and Mathematics Education (IPN) at Kiel University
Kiel Germany

Alexander Robitzsch, PhD
IPN
Olshausenstraße 62
D-24118 Kiel, Germany

robitzsch@ipn.uni-kiel.de

Psychological Test and Assessment Modeling
Volume 60 · 2018 · Issue
1
Pabst, 2018
ISSN 2190-0493 (Print)
ISSN 2190-0507 (Internet)