Recent research has explored the creation of questions from code submitted by students. These Questions about Learners' Code (QLCs) are created through program analysis, exploring execution paths, and then creating code comprehension questions from these paths and the broader code structure. Responding to the questions requires reading and tracing the code, which is known to support students' learning. At the same time, computing education researchers have witnessed the emergence of Large Language Models (LLMs) that have taken the community by storm. Researchers have demonstrated the applicability of these models especially in the introductory programming context, outlining their performance in solving introductory programming problems and their utility in creating new learning resources. In this work, we explore the capability of the state-of-the-art LLMs (GPT-3.5 and GPT-4) in answering QLCs that are generated from code that the LLMs have created. Our results show that although the state-of-the-art LLMs can create programs and trace program execution when prompted, they easily succumb to similar errors that have previously been recorded for novice programmers. These results demonstrate the fallibility of these models and perhaps dampen the expectations fueled by the recent LLM hype. At the same time, we also highlight future research possibilities such as using LLMs to mimic students as their behavior can indeed be similar for some specific tasks.

## Overview

- Explores how the large language model ChatGPT performs on program comprehension tasks
- Evaluates ChatGPT's ability to answer questions about the functionality and behavior of code snippets
- Provides insights into the capabilities and limitations of ChatGPT for introductory programming tasks

## Plain English Explanation

This paper investigates how well the artificial intelligence (AI) system ChatGPT can understand and explain computer programs. [ChatGPT is a large language model](https://aimodels.fyi/papers/arxiv/evaluation-chatgpt-usability-as-code-generation-tool) that is trained to generate human-like text, and the researchers were curious to see how it would perform on tasks related to programming.

The researchers presented ChatGPT with a series of questions about short code snippets, such as "What does this code do?" or "What is the output of this program?" They wanted to see if ChatGPT could accurately comprehend the purpose and behavior of the code. This is an important skill for [introductory programming students](https://aimodels.fyi/papers/arxiv/cseprompts-benchmark-introductory-computer-science-prompts) to develop, as understanding how code works is a crucial part of learning to program.

By evaluating ChatGPT's responses, the researchers gained insights into the AI's strengths and weaknesses when it comes to understanding code. They found that ChatGPT was generally able to provide accurate explanations of simple programs, but struggled with more complex code. This suggests that while large language models like ChatGPT can be useful tools for [generating and understanding natural language](https://aimodels.fyi/papers/arxiv/automatic-generation-evaluation-reading-comprehension-test-items), they may have limitations when it comes to [reasoning about the intricacies of computer programs](https://aimodels.fyi/papers/arxiv/large-language-models-mathematicians).

## Technical Explanation

The researchers conducted a series of experiments to assess ChatGPT's performance on program comprehension tasks. They selected a set of code snippets representing a range of programming concepts, from simple conditional statements to more complex data structures and algorithms.

For each code snippet, the researchers asked ChatGPT questions that tested its understanding of the program's functionality, such as "What is the output of this code?" or "Describe what this code does." ChatGPT's responses were then evaluated by human raters for accuracy and completeness.

The results showed that ChatGPT was generally able to provide accurate explanations for simple programs, but struggled with more complex code. The researchers found that ChatGPT's performance was influenced by factors such as the length and complexity of the code, the programming concepts involved, and the specific wording of the questions.

The researchers also noted that ChatGPT sometimes generated plausible-sounding but incorrect responses, highlighting the need for caution when relying on large language models for tasks that require precise reasoning about code behavior. [This aligns with findings from other studies](https://aimodels.fyi/papers/arxiv/enhancing-general-agent-capabilities-low-parameter-llms) that have explored the limitations of large language models in mathematical and technical domains.

## Critical Analysis

The researchers acknowledge several limitations of their study. First, the code snippets used were relatively short and focused on introductory programming concepts, so the findings may not generalize to more complex, real-world code. Additionally, the researchers only tested ChatGPT's comprehension of code, and did not evaluate its ability to generate or modify code, which are also important programming skills.

Another potential limitation is the reliance on human raters to evaluate ChatGPT's responses. While the researchers took steps to ensure consistency, there could still be some subjectivity in the assessment process. [It would be interesting to see if the results hold up under more rigorous, automated evaluation methods](https://aimodels.fyi/papers/arxiv/automatic-generation-evaluation-reading-comprehension-test-items).

Overall, the researchers provide valuable insights into the capabilities and limitations of large language models like ChatGPT when it comes to program comprehension. While these models may be useful tools for certain tasks, such as [generating natural language descriptions of code](https://aimodels.fyi/papers/arxiv/large-language-models-mathematicians), the findings suggest they may not be sufficient for tasks that require deep understanding and reasoning about the intricacies of computer programs.

## Conclusion

This study offers a nuanced perspective on the use of large language models like ChatGPT for programming-related tasks. While the results suggest that ChatGPT can provide accurate explanations for simple code, the model's performance degrades as the complexity of the code increases. This highlights the need for continued research and development to [enhance the general capabilities of large language models](https://aimodels.fyi/papers/arxiv/enhancing-general-agent-capabilities-low-parameter-llms) in technical domains.

The findings also have implications for the potential use of large language models in educational settings, where they could be leveraged to support [introductory programming instruction](https://aimodels.fyi/papers/arxiv/cseprompts-benchmark-introductory-computer-science-prompts). However, the limitations identified in this study suggest that such models should be used with caution and as part of a broader, multifaceted approach to teaching programming concepts.

Overall, this research contributes to our understanding of the strengths and weaknesses of large language models like ChatGPT, and underscores the importance of continued exploration and evaluation of these powerful AI systems.