Open Access

EVALUATING CONVERSATIONAL AND PLATFORM-INTEGRATED GENERATIVE AI FOR AUTOMATED, TIMELY FEEDBACK IN PROGRAMMING EDUCATION: A QUASI-EXPERIMENTAL STUDY UTILIZING GPT-4O-MINI

4 Faculty of Computer Science and Engineering, Meiji University of Technology, Tokyo, Japan

Abstract

Context: Effective feedback is critical for novice programmers, but providing it in a timely and scalable manner poses a significant challenge in higher education [13], [14], [37]. Generative Artificial Intelligence (GenAI), particularly Large Language Models (LLMs) trained on code [9], [36], offers a promising avenue to automate this process [1], [22].

Objectives: This quasi-experimental study aimed to evaluate the usability, student perceptions, and academic impact of two distinct GenAI-assisted feedback tools, both powered by GPT-4o-mini: a conversational assistant (tutorB@t) and a platform-embedded tool integrated with a virtual code evaluator (tutorBot+).

Methods: The study involved 91 undergraduate computer science students, with 37 assigned to the experimental AI-assisted group. We measured student programming performance, passing rates, and user perception using the System Usability Scale (SUS) [6] to assess the perceived utility and ease of use of the developed tools.

Results: Students highly valued the immediacy and accessibility of the AI feedback. Perception scores were positive, with tutorB@t achieving a SUS score of 70.6 and tutorBot+ scoring 65.2, and a high intent to reuse (81% and 79%, respectively). Crucially, despite positive perceptions, the study found no statistically significant difference in objective programming performance or passing rates between the groups. This outcome is attributed primarily to factors such as a lack of group homogeneity, external academic pressures, and occasional student misunderstanding of the GenAI-provided feedback.

Conclusion: Timely, automated feedback from GenAI is highly valued by students for its accessibility. Yet, the current study suggests that design limitations (usability, student misunderstandings, external factors) may mask the direct academic impact, highlighting a need for refined integration and future research incorporating affective measures [15], [38] to fully understand and unlock the pedagogical potential of LLM-based feedback [33].

Keywords

References

📄 Azaiz, I., Kiesler, N., & Strickroth, S. (2024). Feedback-generation for programming exercises with GPT4. In: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. ITiCSE 2024. ACM (pp. 31–37). https://doi.org/10.1145/3649217.3653594
📄 Bailey, R., & Garner, M. (2010). Is the retroalimentación in higher education assessment worth the paper it is written on? Teachers’ reflections on their practices. Teaching in Higher Education, 15(2), 187–198. https://doi.org/10.1080/13562511003620019
📄 Bangs, J. (2007). Teaching perfect and imperfect competition with context-rich problems. SSRN Electronic Journal, 92(3), 463. https://doi.org/10.2139/ssrn.1024000
📄 Bassner, P., Frankford, E., & Krusche, S. (2024). Iris: an AI-driven virtual tutor for computer science education. In: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. Milan, Italy: Association for Computing Machinery (pp. 394–400). https://doi.org/10.1145/3649217.3653543
📄 Billis, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., & Saunders, W. (2023). Language models can explain neurons in language models. Available at https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.
📄 Brooke, J. (1986). SUS—a quick and dirty usability scale. In: Usability Evaluation in Industry. United Kingdom: Taylor & Francis (pp. 189–194).
📄 Bull, C., & Kharrufa, A. (2024). Generative artificial intelligence assistants in software development education: a vision for integrating generative artificial intelligence into educational practice, not instinctively defending against it. IEEE Sof1tware, 41(2), 52–59). https://doi.org/10.1109/ms.2023.3300574
📄 Cardoso-Júnior, A., & Faria, R. M. D. D. (2021). Psychometric assessment of the Instructional Materials Motivation Survey (IMMS) instrument in a remote learning environment. Revista Brasileira de Educação Médica, 45(4), e197. https://doi.org/10.1590/1981-5271v45.4-20210066.ing
📄 Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei2
📄 Kumar Tiwari, S. (2023). Integration of AI and machine learning with automation testing in digital transformation. International Journal of Applied Engineering & Technology, 5(S1), 95–103. Roman Science Publications.
📄 Kesarpu, S., & Hari Prasad Dasari. (2025). Kafka Event Sourcing for Real-Time Risk Analysis. International Journal of Computational and Experimental Science and Engineering, 11(3). https://doi.org/10.22399/ijcesen.3715
📄 Singh, V. (2024). The impact of artificial intelligence on compliance and regulatory reporting. J. Electrical Systems, 20(11s), 4322–4328. https://doi.org/10.52783/jes.8484
📄 Real-Time Financial Data Processing Using Apache Spark and Kafka. (2025). International Journal of Data Science and Machine Learning, 5(01), 137-169. https://doi.org/10.55640/ijdsml-05-01-16

Most read articles by the same author(s)

Similar Articles

1-10 of 18

You may also start an advanced similarity search for this article.