Assessing Critical Thinking in Open-ended Answers: An Automatic Approach

Antonella Poce; Francesca Amenduni; Carlo De Medio; Alessandra Norgini

doi:10.38069/edenconf-2020-ac0008

Open Journal Systems

Journal Help

User

Notifications

Journal Content
Browse

Font Size

Information

Assessing Critical Thinking in Open-ended Answers: An Automatic Approach

Antonella Poce
Roma Tre University
antonella.poce@uniroma3.it

Francesca Amenduni
Roma Tre University
amendoonia@gmail.com

Carlo De Medio
Roma Tre University
carlo.demedio@uniroma3.it

Alessandra Norgini
Roma Tre University
alessandra.norgini@student.unisi.it

Abstract

The role of Higher Education (HE) is growingly acknowledged for the promotion of Critical Thinking (CT). Constructed-response tasks (CRT) are recognized to be necessary for the CT assessment, though they present problems related to scoring quality and cost (Ku, 2009). Researchers (Liu, Frankel, & Roohr, 2014) have proposed using automated scoring to address the above concerns. The present work is aimed at comparing the features of different Natural Language Processing (NLP) techniques adopted to improve the reliability of a prototype designed to automatically assess six sub-skills of CT in CRT: use of language, argumentation, relevance, importance, critical evaluation and novelty (Poce, 2017). We will present the first (1.0) and the second (2.0) version of the CT prototype and their respective reliability results. Our research question is the following: Which level of reliability are shown respectively by the 1.0 and 2.0 automatic CT assessment prototype compared to expert human evaluation? Data collection is realized in two moments, to measure respectively the CT prototype 1.0 and 2.0 reliability from a total of 264 participants and 592 open-ended answers. Two human assessors rated all of these responses on each of the subskills on a scale of 1-5. Similarly, NLP approaches are adopted to compute a feature on each dimension. Quadratic Weighted Kappa and Pearson product-moment correlation were used to evaluate the between-human agreement and human-NLP agreement. Preliminary findings based on the first data set suggest adequate level of between-human rating agreement and a lower level human-NLP agreement (r > .43 for the subscales of Relevance and Importance). We are continuing the analysis of the data collected in the 2nd step and expect to complete them in June 2020.

Full Text:

PDF

DOI: https://doi.org/10.38069/edenconf-2020-ac0008

Refbacks

There are currently no refbacks.

ISSN: 2707-2819

Hosted by Mason Publishing, part of the George Mason University Libraries.

Username
Password
Remember me