PTAM 2015-3 | Advances in Rasch modeling: New applications and directions

Advances in Rasch modeling: New applications and directions

Steffen Brandt, Mark Moulton & Brent Duckor (Eds.)

CONTENT

Modeling DIF for simulations: Continuous or categorical secondary trait?
Christine E. DeMars
Full article .pdf (Diamond Open Access)

Application of evolutionary algorithm-based symbolic regression to language assessment: Toward nonlinear modeling
Vahid Aryadoust
Full article .pdf (Diamond Open Access)

Special Topic:
Advances in Rasch modeling: New applications and directions
Guest editors: Steffen Brandt, Mark Moulton & Brent Duckor

Guest editorial
Steffen Brandt, Mark Moulton & Brent Duckor
Full article .pdf (Diamond Open Access)

Determinants of artificial DIF – a study based on simulated polytomous data
Curt Hagquist & David Andrich
Full article .pdf (Diamond Open Access)

The validity of polytomous items in the Rasch model – The role of statistical evidence of the threshold order
Thomas Salzberger
Full article .pdf (Diamond Open Access)

Modeling for directly setting theory-based performance levels
David Torres Irribarra, Ronli Diakow, Rebecca Freund & Mark Wilson
Full article .pdf (Diamond Open Access)

Evaluating the quality of analytic ratings with Mokken scaling
Stefanie A. Wind
Full article .pdf (Diamond Open Access)

ABSTRACTS

Modeling DIF for simulations: Continuous or categorical secondary trait?
Christine E. DeMars

Abstract
For DIF studies, item responses are often simulated using a unidimensional item response theory (IRT) model, with the item difficulty varying by group for the DIF item(s). This implies that the item is easier for all members of one group. However, many researchers may want to conceptualize DIF as a continuous factor. In this conceptualization, one group has higher average scores on the factor causing the DIF, but there is variance within groups. Multidimensional IRT models allow item responses to be generated to correspond to this perspective. Data were simulated under both unidimensional and multidimensional models and effect sizes were estimated from the resulting item responses using the Mantel-Haenszel DIF procedure. The bias and empirical standard errors of the effect sizes were virtually identical. Thus, practitioners using observed-score methods of DIF detection can trust results from DIF simulation studies regardless of the underlying model used to create the data.

Keywords: DIF, IRT, Mantel-Haenszel

Christine E. DeMars, PhD
James Madison University
Center for Assessment & Research
MSC 6806, 298 Port Republic Road
Harrisonburg, Virginia 22807, USA
demarsce@jmu.edu


Application of evolutionary algorithm-based symbolic regression to language assessment: Toward nonlinear modeling
Vahid Aryadoust

Abstract
This study applies evolutionary algorithm-based (EA-based) symbolic regression to assess the ability of metacognitive strategy use tested by the metacognitive awareness listening questionnaire (MALQ) and lexico-grammatical knowledge to predict listening comprehension proficiency among English learners. Initially, the psychometric validity of the MALQ subscales, the lexico-grammatical test, and the listening test was examined using the logistic Rasch model and the Rasch-Andrich rating scale model. Next, linear regression found both sets of predictors to have weak or inconclusive effects on listening comprehension; however, the results of EA-based symbolic regression suggested that both lexico-grammatical knowledge and two of the five metacognitive strategies tested predicted strongly and nonlinearly listening proficiency (R2 = .64). Constraining prediction modeling to linear relationships is argued to jeopardize the validity of language assessment studies, potentially leading these studies to inaccurately contradict otherwise well-established language assessment hypotheses and theories.

Keywords: evolutionary algorithm-based symbolic regression, lexico-grammatical knowledge, listening comprehension, metacognitive awareness, regression

Vahid Aryadoust, PhD
National University of Singapore
Centre for English Language Communication
10 Architecture Drive
Singapore 117511
vahidaryadoust@gmail.com


Determinants of artificial DIF – a study based on simulated polytomous data
Curt Hagquist & David Andrich

Abstract
A general problem in DIF analyses is that some items favouring one group can induce the appearance of DIF in others favouring the other group. Artificial DIF is used as a concept for describing and explaining that kind of DIF which is an artefact of the procedure for identifying DIF, contrasting it to real DIF which is inherent to an item.
The purpose of this paper is to elucidate how real both uniform and non-uniform DIF, referenced to the expected value curve, induce artificial DIF, how this DIF impacts on the person parameter estimates and how different factors affect real and artificial DIF, in particular the alignment of person and item locations.
The results show that the same basic principles apply to non-uniform DIF as to uniform DIF, but that the effects on person measurement are less pronounced in non-uniform DIF. Similar to artificial DIF induced by real uniform DIF, the size of artificial DIF is determined by the magnitude of the real non-uniform DIF. In addition, in both uniform and non-uniform DIF, the magnitude of artificial DIF depends on the location of the items relative to the distribution of the persons. In contrast to uniform DIF, the direction of non-uniform real DIF (e.g. favouring one group or the other) is affected by the location of the items relative to the distribution of the persons. The results of the simulation study also confirm that regardless of type of DIF, in the person estimates, artificial DIF never balances out real DIF.

Keywords: differential item functioning, uniform, non-uniform, artificial, Rasch models

Curt Hagquist, PhD
Centre for Research on Child and
Adolescent Mental Health
Karlstad University
SE-651 88 Karlstad, Sweden
curt.hagquist@kau.se


The validity of polytomous items in the Rasch model – The role of statistical evidence of the threshold order
Thomas Salzberger

Abstract
Rating scales involving more than two response categories are a popular response format in measurement in education, health and business sciences. Their primary purpose lies in the increase of information and thus measurement precision. For these objectives to be met, the response scale has to provide valid scores with higher numbers reflecting more of the property to be measured. Thus, the response scale is closely linked to construct validity since any kind of malfunctioning would jeopardize measurement. While tests of fit are not necessarily sensitive to violations of the assumed order of response categories, the order of empirical threshold estimates provides insight into the functionality of the scale. The Rasch model and, specifically, the so-called Rasch-Andrich thresholds are unique in providing this kind of evidence. The conclusion whether thresholds are to be considered truly ordered or disordered can be based on empirical point estimates of thresholds. Alternatively, statistical tests can be carried out taking standard errors of threshold estimates into account. Such tests might either stress the need for evidence of ordered thresholds or the need for a lack of evidence of disordered thresholds. Both approaches are associated with unacceptably high error rates, though. A hybrid approach that accounts for both evidence of ordered and disordered thresholds is suggested as a compromise. While the usefulness of statistical tests for a given data set is still limited, they provide some guidance in terms of a modified response scale in future applications.

Keywords: Polytomous Rasch model, threshold order

Thomas Salzberger, PhD
WU Wien, Institute for Marketing Management
& Institute for Statistics and Mathematics
Welthandelsplatz 1
1020 Vienna, Austria
Thomas.Salzberger@wu.ac.at


Modeling for directly setting theory-based performance levels
David Torres Irribarra, Ronli Diakow, Rebecca Freund & Mark Wilson

Abstract
This paper presents the Latent Class Level-PCM as a method for identifying and interpreting latent classes of respondents according to empirically estimated performance levels. The model, which combines elements from latent class models and reparameterized partial credit models for polytomous data, can simultaneously (a) identify empirical boundaries between performance levels and (b) estimate an empirical location of the centroid of each level. This provides more detailed information for establishing performance levels and interpreting student performance in the context of these levels. The paper demonstrates the use of the Latent Class L-PCM on an assessment of student reading proficiency for which there are strong ties between the hypothesized theoretical levels and the polytomously scored assessment data. Graphical methods for evaluating the estimated levels are illustrated.

Keywords: Construct Modeling, Performance Levels, Ordered Latent Class Analysis, Standard Setting, Level Partial Credit Model

David Torres Irribarra
Centro de Medición MIDE UC
Pontificia Universidad Católica de Chile
Santiago, Chile
davidtorres@uc.cl


Evaluating the quality of analytic ratings with Mokken scaling
Stefanie A. Wind

Abstract
Greatly influenced by the work of Rasch (1960/1980), Mokken (1971) presented a nonparametric scaling procedure that is based on the theory of invariant measurement, but draws upon less strict requirements related to the scale of measurement. Because they are theoretically and empirically related to Rasch models, Mokken’s nonparametric models have been recognized as a useful exploratory tool for examining data in terms of the basic requirements for invariant measurement before the application of a parametric model. In particular, recent research has explored the use of polytomous versions of Mokken’s (1971) nonparametric scaling models as a technique for evaluating the quality of holistic ratings (Wind & Engelhard, in press) and rating scales (Wind, 2014) for performance assessments in terms of the requirements for invariant measurement. The current study continues the extension of Mokken scaling to performance assessments by exploring the degree to which Mokken-based rating quality indices can be used to explore the quality of ratings assigned within domains on an analytic rubric. Using an illustrative analysis, this study demonstrates the use of a generalized rating design to explore the quality of analytic ratings within the framework of Mokken scaling. Findings from the illustrative analysis suggest that a generalized rating design can be used to examine the quality of analytic ratings in terms of the requirements for invariant measurement.

Keywords: Mokken scaling, analytic rubric, performance assessment, nonparametric item response theory

Stefanie A. Wind, PhD
College of Education
The University of Alabama
313C Carmichael Hall, Box 870231
Tuscaloosa, AL 35487-0231, USA
swind@ua.edu

Psychological Test and Assessment Modeling
Volume 57 · 2015 · Issue
3
Pabst, 2015
ISSN 2190-0493 (Print)
ISSN 2190-0507 (Internet)