Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop several local and global explanation methods, including a gradient-based method for input-output attribution, and techniques for interpreting patterns and concepts in self-attention and feed-forward layers. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. This approach provides an internal perspective of the model shifts on a human-comprehensible level. Our findings reveal three significant impacts of instruction tuning: 1) It empowers LLMs to recognize the instruction parts of user prompts, and promotes the response generation constantly conditioned on the instructions. 2) It encourages the self-attention heads to capture more word-word relationships about instruction verbs. 3) It encourages the feed-forward networks to rotate their pre-trained knowledge toward user-oriented tasks. These insights contribute to a more comprehensive understanding of instruction tuning and lay the groundwork for future work that aims at explaining and optimizing LLMs for various applications. Our code and data are publicly available at https://github.com/JacksonWuxs/Interpret_Instruction_Tuning_LLMs.

## Overview

- Large language models (LLMs) have achieved remarkable success, with instruction tuning being a critical step in aligning them with user intentions.
- This work investigates how instruction tuning adjusts pre-trained models, focusing on intrinsic changes.
- The researchers developed explanation methods to understand the impact of instruction tuning on LLMs.

## Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human-like text. A key step in making these models useful is "instruction tuning," which trains them to follow specific instructions from users. 

In this study, the researchers looked at how instruction tuning changes the inner workings of LLMs. They developed new ways to "look under the hood" and explain what's happening inside the models, both before and after instruction tuning.

The main findings are:

1. Instruction tuning helps LLMs recognize when users are giving them instructions, and generates responses that are constantly focused on following those instructions.

2. Instruction tuning changes the model's "attention" mechanism, making it better at recognizing relationships between instruction words.

3. Instruction tuning also adjusts the model's core knowledge and abilities, rotating them towards being more useful for user-oriented tasks.

These insights help us better understand how instruction tuning shapes LLMs to be more aligned with human intentions. This lays the groundwork for future work on explaining and improving LLMs for real-world applications.

## Technical Explanation

The researchers first developed new "explanation methods" to analyze the inner workings of LLMs. This includes a gradient-based technique for tracing the influence of different input parts on the model's outputs, as well as ways to interpret the patterns and concepts captured by the model's attention and core processing layers.

They then used these explanation methods to compare pre-trained LLMs and LLMs that had undergone instruction tuning. This allowed them to see how the instruction tuning process changes the model's behavior and inner representations.

The key findings were:

1. Instruction tuning helps the model recognize when the input contains instructions, and shapes the response generation to be constantly conditioned on following those instructions.

2. Instruction tuning encourages the model's attention mechanism to better capture relationships between instruction-related words.

3. Instruction tuning also encourages the model's core "feed-forward" processing to shift its pre-trained knowledge towards being more useful for user-oriented tasks.

These insights contribute to a deeper understanding of how instruction tuning works to align LLMs with human intentions. The researchers make their code and data publicly available to support further work in this area.

## Critical Analysis

The paper provides a thoughtful and rigorous analysis of the impact of instruction tuning on LLMs. The researchers' development of new explanation methods is a valuable contribution, as it allows for a more nuanced and interpretable view of these complex models.

However, the study is limited to a single LLM architecture (GPT-3) and a specific set of tasks. It would be important to see if the findings generalize to other LLM models and a broader range of applications. Additionally, the paper does not delve into potential risks or unintended consequences of instruction tuning, such as [backdoor vulnerabilities](https://aimodels.fyi/papers/arxiv/instructions-as-backdoors-backdoor-vulnerabilities-instruction-tuning) or [misalignment with human feedback](https://aimodels.fyi/papers/arxiv/understanding-learning-dynamics-alignment-human-feedback).

Further research could explore the [psychometric and predictive power of LLMs](https://aimodels.fyi/papers/arxiv/psychometric-predictive-power-large-language-models) under different instruction tuning regimes, or investigate more [effective instruction tuning techniques](https://aimodels.fyi/papers/arxiv/flawn-t5-empirical-examination-effective-instruction-tuning) that balance capabilities and alignment.

## Conclusion

This study provides valuable insights into how instruction tuning shapes the inner workings of large language models. By developing new explanation methods, the researchers were able to uncover several key impacts of the instruction tuning process, including improved recognition of instructions, better attention to instruction-related words, and a shift in the model's core knowledge towards user-oriented tasks.

These findings contribute to a deeper understanding of how to align LLMs with human intentions, which is crucial as these models become increasingly ubiquitous in real-world applications. The publicly available code and data from this work will also support further research in this important area of AI development.