Get a weekly rundown of the latest AI models and research... subscribe!


Models by this creator




Total Score


PairRM is a Pairwise Reward Model for Large Language Models (LLMs) developed by the LLM-Blender team. It takes an instruction and a pair of output candidates as input, and outputs a score for each candidate to measure their relative quality. Unlike other reward models that encode and score each candidate independently, PairRM compares the candidates side-by-side to identify subtle differences. It is based on the efficient microsoft/deberta-v3-large model, making it a compact 0.4B parameter model. PairRM was trained on a diverse collection of six human-preference datasets, as described in the LLM-Blender paper. This allows it to effectively evaluate and compare the quality of LLM outputs across a wide range of tasks and domains. Model inputs and outputs Inputs Instruction**: The task or prompt given to the LLM. Pair of output candidates**: Two potential responses generated by the LLM for the given instruction. Outputs Relative quality score**: A score indicating the relative quality of the two output candidates, with a higher score indicating the first candidate is better than the second. Capabilities PairRM can be used to efficiently assess the quality of LLM outputs in a local environment. It can be used to (re-)rank a list of candidate outputs, effectively functioning as an LLM evaluator. Additionally, PairRM can be used to enhance the decoding process of LLMs by performing "best-of-n sampling" - reranking N sampled outputs to select the highest quality response. What can I use it for? The LLM-Blender team suggests that PairRM can be used for a variety of applications, such as: LLM Evaluation**: Assessing the quality of outputs from LLMs in a local environment, allowing for efficient model selection and comparison. Decoding Enhancement**: Improving the decoding process of LLMs by reranking a set of sampled outputs to select the highest quality response. RLHF Alignment**: PairRM can be used in conjunction with Reinforcement Learning from Human Feedback (RLHF) methods to further align instruction-tuned LLMs with human preferences. Things to try One key insight about PairRM is its ability to compare output candidates side-by-side, rather than scoring them independently. This allows it to identify more subtle differences in quality that may be missed by other reward models. Developers and researchers could explore how this pairwise scoring approach affects the model's performance compared to other reward models, and how it could be leveraged in different LLM optimization and evaluation workflows.

Read more

Updated 5/15/2024