EVALUATING CONVERSATIONAL AND PLATFORM-INTEGRATED GENERATIVE AI FOR AUTOMATED, TIMELY FEEDBACK IN PROGRAMMING EDUCATION: A QUASI-EXPERIMENTAL STUDY UTILIZING GPT-4O-MINI

Prof. Kenji A. Takada

doi:10.55640/

Authors

Prof. Kenji A. Takada Faculty of Computer Science and Engineering, Meiji University of Technology, Tokyo, Japan

DOI:

https://doi.org/10.55640/

Keywords:

Programming Education, Generative AI, GPT-4o-mini, Automated Feedback

Abstract

Context: Effective feedback is critical for novice programmers, but providing it in a timely and scalable manner poses a significant challenge in higher education [13], [14], [37]. Generative Artificial Intelligence (GenAI), particularly Large Language Models (LLMs) trained on code [9], [36], offers a promising avenue to automate this process [1], [22].

Objectives: This quasi-experimental study aimed to evaluate the usability, student perceptions, and academic impact of two distinct GenAI-assisted feedback tools, both powered by GPT-4o-mini: a conversational assistant (tutorB@t) and a platform-embedded tool integrated with a virtual code evaluator (tutorBot+).

Methods: The study involved 91 undergraduate computer science students, with 37 assigned to the experimental AI-assisted group. We measured student programming performance, passing rates, and user perception using the System Usability Scale (SUS) [6] to assess the perceived utility and ease of use of the developed tools.

Results: Students highly valued the immediacy and accessibility of the AI feedback. Perception scores were positive, with tutorB@t achieving a SUS score of 70.6 and tutorBot+ scoring 65.2, and a high intent to reuse (81% and 79%, respectively). Crucially, despite positive perceptions, the study found no statistically significant difference in objective programming performance or passing rates between the groups. This outcome is attributed primarily to factors such as a lack of group homogeneity, external academic pressures, and occasional student misunderstanding of the GenAI-provided feedback.

Conclusion: Timely, automated feedback from GenAI is highly valued by students for its accessibility. Yet, the current study suggests that design limitations (usability, student misunderstandings, external factors) may mask the direct academic impact, highlighting a need for refined integration and future research incorporating affective measures [15], [38] to fully understand and unlock the pedagogical potential of LLM-based feedback [33].

References

Azaiz, I., Kiesler, N., & Strickroth, S. (2024). Feedback-generation for programming exercises with GPT4. In: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. ITiCSE 2024. ACM (pp. 31–37). https://doi.org/10.1145/3649217.3653594

Bailey, R., & Garner, M. (2010). Is the retroalimentación in higher education assessment worth the paper it is written on? Teachers’ reflections on their practices. Teaching in Higher Education, 15(2), 187–198. https://doi.org/10.1080/13562511003620019

Bangs, J. (2007). Teaching perfect and imperfect competition with context-rich problems. SSRN Electronic Journal, 92(3), 463. https://doi.org/10.2139/ssrn.1024000

Bassner, P., Frankford, E., & Krusche, S. (2024). Iris: an AI-driven virtual tutor for computer science education. In: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1. Milan, Italy: Association for Computing Machinery (pp. 394–400). https://doi.org/10.1145/3649217.3653543

Billis, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., & Saunders, W. (2023). Language models can explain neurons in language models. Available at https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.

Brooke, J. (1986). SUS—a quick and dirty usability scale. In: Usability Evaluation in Industry. United Kingdom: Taylor & Francis (pp. 189–194).

Bull, C., & Kharrufa, A. (2024). Generative artificial intelligence assistants in software development education: a vision for integrating generative artificial intelligence into educational practice, not instinctively defending against it. IEEE Sof1tware, 41(2), 52–59). https://doi.org/10.1109/ms.2023.3300574

Cardoso-Júnior, A., & Faria, R. M. D. D. (2021). Psychometric assessment of the Instructional Materials Motivation Survey (IMMS) instrument in a remote learning environment. Revista Brasileira de Educação Médica, 45(4), e197. https://doi.org/10.1590/1981-5271v45.4-20210066.ing

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei2

Kumar Tiwari, S. (2023). Integration of AI and machine learning with automation testing in digital transformation. International Journal of Applied Engineering & Technology, 5(S1), 95–103. Roman Science Publications.

Kesarpu, S., & Hari Prasad Dasari. (2025). Kafka Event Sourcing for Real-Time Risk Analysis. International Journal of Computational and Experimental Science and Engineering, 11(3). https://doi.org/10.22399/ijcesen.3715

Singh, V. (2024). The impact of artificial intelligence on compliance and regulatory reporting. J. Electrical Systems, 20(11s), 4322–4328. https://doi.org/10.52783/jes.8484

Real-Time Financial Data Processing Using Apache Spark and Kafka. (2025). International Journal of Data Science and Machine Learning, 5(01), 137-169. https://doi.org/10.55640/ijdsml-05-01-16

International Research Journal of Advanced Engineering and Technology

Article Details Page