PTAM 2024-1

CONTENT

Disentangling fluid and crystallized intelligence by means of Bayesian structural equation modeling and correlation-preserving mean plausible values
André Beauducel, Richard Bruntsch, Martin Kersting
DOI: https://doi.org/10.2440/001-0010
Full article .pdf (Diamond Open Access)

Examining the viability of the Continuous Matching Task in mobile assessment compared to laboratory testing
Johann-Christoph Münscher
DOI: https://doi.org/10.2440/001-0011
Full article .pdf (Diamond Open Access)

The impact of filtering out rapid-guessing examinees on PISA 2015 country rankings
Michalis P. Michaelides, Militsa G. Ivanova, Demetris Avraam
DOI: https://doi.org/10.2440/001-0012
​​​​​​​Full article .pdf (Diamond Open Access)​​​​​​​

Unraveling Performance on Multiple-Choice and Free-Response University Exams: A Multilevel Analysis of Study Time, Lecture Attendance and Personality Traits
Tuulia M. Ortner, Verena Keneder, Sonja Breuer, Freya M. Gruber, Thomas Scherndl
DOI: https://doi.org/10.2440/001-0013
​​​​​​​Full article .pdf (Diamond Open Access)

Perception-based anticipation of social conflicts (ASK): IRT analysis of a new image-based method
Gebhard Sammer & Annette Kroiß
DOI: https://doi.org/10.2440/001-0014
​​​​​​​Full article .pdf (Diamond Open Access)

Parameter Recovery for the Four-Parameter Item Response Model: A Comparison of Marginal Maximum Likelihood and Markov Chain Monte Carlo Approaches
Hoan Do & Gordon P. Brooks
DOI: https://doi.org/10.2440/001-0015
​​​​​​​Full article .pdf (Diamond Open Access)

ABSTRACTS

Disentangling fluid and crystallized intelligence by means of Bayesian structural equation modeling and correlation-preserving mean plausible values

André Beauducel, Richard Bruntsch, Martin Kersting

Abstract:
The present study presents Bayesian confirmatory factor analyses of data from an extensive computer intelligence test battery used in the applied field of assessment in Switzerland. Bayesian confirmatory factor analysis allows to constrain the variability and distribution of model parameters according to theoretical expectations using priors. Posterior distributions of the model parameters are then obtained by means of a Bayesian estimation procedure. A large sample of 4,677 participants completed the test battery comprising 21 different tasks. Factors for crystallized intelligence, fluid intelligence, memory, and basic skills/clerical speed were obtained. The latter factor is different from speed-factors in several other tests as it encompasses speeded performance on moderately complex tasks. Three types of models were compared: for one type, only the expected salient loadings were freely estimated, and all crossloadings were fixed to zero (i.e., independent clusters) whereas for the other two types of models normally distributed priors with a zero mean were defined. The latter two types were again altered regarding the amount of defined prior variance. Results show that defining substantial prior var-iances for the crossloadings in Bayesian confirmatory factor analysis allow to overcome limitations of the independent clusters model. In order to estimate individual scores for the factors, mean plausible values were computed. However, the intercorrelations of the mean plausible-values substantially overestimated the true correlations of the factors. To improve discriminant validity of individual score estimates, it was therefore proposed to compute correlation-preserving mean plausible values. The findings can be applied to derive estimates for factorial scoring of a test battery, especially if cross loadings of subtests must be expected.

Keywords: Fluid intelligence; crystallized intelligence; Bayesian confirmatory factor analysis, mean plausible values

Correspondence: André Beauducel, University of Bonn, Department of Psychology, Kaiser-Karl-Ring 9, 53111 Bonn, Germany, email: beauducel@uni-bonn.de


Examining the viability of the Continuous Matching Task in mobile assessment compared to laboratory testing.

Johann-Christoph Münscher

Abstract:
Two measures of attention, the Continuous Matching Task (CMT, measuring alertness), and the Stroop task (measuring selective attention) were applied under two conditions: In the laboratory using a standardized apparatus and in mobile measurements using participant’s smart devices. Both are cognitive performance tasks reliant on processing speed. In past research, implementing this type of measurement on mobile devices was called into question and the psychometric quality was assumed to be low. The present study aims to evaluate if the CMT can yield equivalent results from guided laboratory testing and self-administered mobile measurements. The Stroop task results are evaluated in the same way and results of the two tasks are compared. They were implemented identically in both conditions, with only slight modifications to the methods of input. Comparing and analyzing the results revealed that the CMT is not consistent across conditions and prone to age effects on mobile devices. Consequently, it is largely not suited for mobile assessment. The Stroop task showed more consistent measurements, although characteristic shortcomings were also observed. Generally, mobile assessment using response-time-based measurements appear to be problematic when tasks are more technically demanding.

Keywords: Mobile Assessment, Continuous Performance Tasks, Computer Applications, Attention, Reaction Times

Correspondence: Johann-Christoph Münscher
orcid.org/0000-0002-8434-7970
Department of Aviation and Space Psychology, German Aerospace Center (DLR) Institute of Aerospace Medicine.
E-mail: Johann-Christoph.Muenscher@dlr.de


The impact of filtering out rapid-guessing examinees on PISA 2015 country rankings

Michalis P. Michaelides, Militsa G. Ivanova, Demetris Avraam

Abstract:
International large-scale assessments are low-stakes tests for examinees and their motivation to perform at their best may not be high. Thus, these programs are criticized as invalid for accurately depicting individual and aggregate achievement levels. In this paper, we examine whether filtering out examinees who rapid-guess impacts country score averages and rankings. Building on an earlier analysis that identified rapid guessers using two different methods, we reestimated country average scores and rankings in three subject tests of PISA 2015 (Science, Mathematics, Reading) after filtering out rapid-guessing examinees. Results suggest that country mean scores increase for all countries after filtering, but in most conditions the change in rankings is minimal, if any. A few exceptions with considerable changes in rankings were observed in the Science and Reading tests with methods that were more liberal in identifying rapid guessing. Lack of engagement and effort is a validity concern for individual scores, but has a minor impact on aggregate scores and country rankings.

Keywords: Rapid guessing, response time effort, PISA, filtering

Correspondence: Michalis P. Michaelides
Institutional address: Dept. of Psychology, 1 Panepistimiou Avenue, 2109 Aglantzia, P.O. Box 20537, 1678 Nicosia, Cyprus
Email: Michaelides.michalis@ucy.ac.cy


Unraveling Performance on Multiple-Choice and Free-Response University Exams: A Multilevel Analysis of Study Time, Lecture Attendance and Personality Traits

Tuulia M. Ortner, Verena Keneder, Sonja Breuer, Freya M. Gruber, Thomas Scherndl

Abstract:
Assessment methods impact student learning and performance. Various recommendations address challenges of assessment in education, emphasizing test validity and reliability, aligning with ongoing efforts in psychological assessment to prevent test bias, a concern also relevant in evaluating student learning outcomes. Examinations in education commonly use either free-response (FR) or multiple-choice (MC) response formats, each with its advantages and disadvantages. Despite frequent reports of high construct equivalence between them, certain group differences based on differing person characteristics still need to be explained. In this study, we aimed to investigate how test takers’ characteristics and behavior—particularly test anxiety, risk propensity, conscientiousness, lecture attendance, and study time—impact test scores in exams with FR and MC format. Data was collected from 376 students enrolled in one of two Psychology lectures at a large Austrian University at the beginning of the semester and post-exam in a real-life setting. Multilevel analyses revealed that, overall, students achieved higher scores on FR items compared to MC items. Less test anxiety, higher conscientiousness, and more study time significantly increased student examination performance. Lecture attendance impacted performance differently according to the exam items’ response format: Students who attended more lectures scored higher on the MC items compared to the FR items. Risk propensity exhibited no significant effect on exam scores. The results offer deeper insights into the nuanced interplay between academic performance, personality, and other influencing factors with the aim of establishing more reliable and valid performance tests in the future. Limitations and implications of the results are discussed.

Keywords: evaluation methods, student evaluation, test performance, response format, personality

Correspondence: Univ. Prof. Dr. Tuulia M. Ortner, Department of Psychology, University of Salzburg. tuulia.ortner@sbg.ac.at


Perception-based anticipation of social con-flicts (ASK): IRT analysis of a new image-based method

Gebhard Sammer & Annette Kroiß

Abstract:
People perceive everyday situations very individually and react to them depending on their personality, lifestyle, development or a psychological disorder. In this study, an image-based method was developed to capture the perception of social situations that represent potential social conflict situations. To date, there is no comparable psychometric instrument.
Based on what we know about social conflict, the construction of the items mainly included resource conflict and conflict of interest in various social situations. Eighteen images were presented as an online survey with a 7-point response scale. In addition, personality, anxiety, anger, alexithymia, and sociodemographic data were collected.
2074 participants were recruited via online portals, social media and mailing lists. Cases usable for analysis (N = 1831) included more women and more highly educated individuals.
The items and scale were tested for a Rasch model for polytomous responses with the partial credit model. After the number of response categories was reduced to 4, the item categories were ordered and fitted to the model. Person estimates showed a misfit of about 6% of cases. Model tests demonstrated the unidimensionality of the scale (Martin-Loef test) and revealed that some response categories were likely to deviate from the model (Wald-type test). The Anderson likelihood ratio test indicated model validity when smaller sample sizes were created and tested through resampling (n = 300, 400, 500). The reliability based on classical test theory was sufficient (McDonald’s ω = 0.75, N = 1973). However, the validity of the model is preliminary and needs to be tested empirically with a new data set without pooling of response categories.
Although the model fit statistics are preliminary due to the post-hoc pooling of response categories, it is proposed that the test be used to assess social cognition in the context of anticipating social conflict (ASK), preferably for research questions and theory building on the perception of conflict-prone social situations.

Keywords: social cognition, social conflict, polytomous Rasch model, partial credit model, IRT

Correspondence: Gebhard Sammer, Centre for Psychiatry, Justus Liebig University Giessen, Klinikstrasse 36, 35392 Giessen, Germany; e-mail: gebhard.sammer@uni-giessen.de


Parameter Recovery for the Four-Parame-ter Item Response Model: A Comparison of Marginal Maximum Likelihood and Mar-kov Chain Monte Carlo Approaches

Hoan Do & Gordon P. Brooks

Abstract:
This study assessed the parameter recovery accuracy of Marginal Maximum Likelihood (MML) and two Markov Chain Monte Carlo (MCMC) methods, Gibbs and Hamiltonian Monte Carlo (HMC), under the four-parameter unidimensional binary item response function. Data were simulated under the mixed factorial design with sample size (1,000; 2,500; and 5,000 respondents) and latent trait distribution (normal and negatively skewed) as the between-subjects factors, and estimation method (MML, Gibbs, and HMC) as the within-subjects factor. Results indicated that in general, MML was more heavily impacted by latent trait skewness, but MML also improved its performance more strongly than MCMC when sample size increased. Two MCMC methods remained advantageous with lower root mean square errors (RMSE) of item parameter recovery across all conditions under investigation, but sample size increase brought a correspondingly narrower gap between MML and MCMC regardless of theta distributions. Gibbs and HMC provided nearly identical outcomes across all conditions, and no considerable difference between these two MCMC methods was detected. Sample size and latent trait distribution had little observable effect on trait score estimation by MCMC and Expected a Posteriori following MML (MML-EAP), which were essentially unbiased and had similar RMSE across all conditions. Discussions of the findings and model calibration issues are presented together with practical implications and future research recommendations.

Keywords: the four-parameter IRT model, Marginal Maximum Likelihood (MML), Markov chain Monte Carlo (MCMC), Gibbs sampling, Hamiltonian Monte Carlo (HMC)

Correspondence:
Dr. Hoan Do, Ph.D., Research Scientist – Quantitative Sciences, Clinical Outcomes Solu-tions, Tucson AZ 85718, td898911@ohio.edu


Psychological Test and Assessment Modeling
Volume 66 · 2024 · Issue 1

Pabst, 2024
ISSN 2190-0493 (Print)
ISSN 2190-0507 (Internet)