Credit: Shutterstock
Teachers who have found themselves racking their brains to create suitable test items or staring forlornly at a stack of ungraded papers have likely wished they could wave a magic wand, Harry Potter-style, to get that test to write itself or whisk those ungraded papers into a stack of graded ones. Could AI grant that wish?
Two recent conference papers offer some interesting insights into whether AI, specifically the ChatGPT-4 Large Language Model (LLM), can accurately offer just such a magic wand.
In the first study, researchers at Stanford University examined the viability of using ChatGPT-4 to generate test items to assess sentence reading efficiency, a specific aspect of reading fluency that requires students to read simple sentences (such as, “Children play on the playground.”) and answer whether the statements in them are true or false.
Regularly tracking student progress on this measure requires building a high-quality item bank of hundreds of sentences—no small task. So, the research team wanted to see if ChatGPT-4 could build those items. They compared the results of testing 234 students with expert-generated sentences versus those created by the LLM (130 true and 130 false sentences in each set). Remarkably, the results showed that students scored similarly on the expert- and AI-created sentences. However, humans were still needed to initially screen the AI sentences to ensure they weren’t ambiguous (“A hill is flat and square.”), dangerous (“Babies drink gasoline.”), or subjective (“Dolls are fun to play with.”).
In the second study, researchers in the UK used a dataset of 1,700 student responses to open-ended science and history questions that were scored by humans to train ChatGPT-4 to score the responses (as correct or incorrect). ChatGPT-4 scoring of student responses matched human scoring 85 percent of the time—similar, as it turns out, to the level of agreement (87 percent) among humans themselves. Short-answer responses to open-ended questions are more effective ways to assess student learning, the researchers note, yet teachers often rely heavily on multiple-choice test items because they are easier to grade. If AI can effectively evaluate short-answer responses, it has the potential to elevate the quality of student learning and save teachers time.
Neither study suggests AI offers a magic wand for developing and scoring classroom assessments—at least not yet. Though with some caution and common sense, LLMs may help teachers do both, making the essential work of assessment and grading a bit easier.
End Notes
•
1 Zelikman, E., Ma, W., Tran, J., Yang, D., Yeatman, J., & Haber, N. (2023). Generating and evaluating tests for K-12 students with language model simulations: A case study on sentence reading efficiency. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 2190–2205). Singapore: Association for Computational Linguistics.
•
2 Henkel, O., Hills, L., Boxer, A., Roberts, B., & Levonian, Z. (2024, July). Can large language models make the grade? An empirical study evaluating LLMs ability to mark short answer questions in K-12 education. In Proceedings of the Eleventh ACM Conference on Learning@ Scale (pp. 300–304). Atlanta: Association for Computing Machinery.