Large Language Models (LLMs) have demonstrated significant potential in handling complex reasoning tasks through step-by-step rationale generation. However, recent studies have raised concerns regarding the hallucination and flaws in their reasoning process. Substantial efforts are being made to improve the reliability and faithfulness of the generated rationales. Some approaches model reasoning as planning, while others focus on annotating for process supervision. Nevertheless, the planning-based search process often results in high latency due to the frequent assessment of intermediate reasoning states and the extensive exploration space. Additionally, supervising the reasoning process with human annotation is costly and challenging to scale for LLM training. To address these issues, in this paper, we propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories, which are ranked according to synthesized process rewards. Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework, showing that our 7B model can surpass the strong counterparts like GPT-3.5-Turbo.

## Overview

- Large language models (LLMs) have shown promise in complex reasoning tasks through step-by-step rationale generation.
- However, recent studies have raised concerns about hallucination and flaws in their reasoning process.
- Efforts are underway to improve the reliability and faithfulness of the generated rationales.
- Some approaches model reasoning as planning, while others focus on annotating for process supervision.
- Planning-based search often results in high latency due to frequent assessment of intermediate reasoning states and the large exploration space.
- Annotating the reasoning process with human feedback is costly and challenging to scale for LLM training.

## Plain English Explanation

Large language models are powerful AI systems that can understand and generate human-like text. Researchers have found that these models can be used for complex reasoning tasks, where the model explains its thought process step-by-step. However, recent studies have shown that these models can sometimes produce incorrect or nonsensical reasoning, a problem known as "hallucination." [https://aimodels.fyi/papers/arxiv/llm-reasoners-new-evaluation-library-analysis-step]

To address this issue, researchers are exploring different approaches to improve the reliability and accuracy of the models' reasoning process. Some researchers are treating reasoning as a planning problem, where the model plans out a sequence of steps to arrive at the final answer. [https://aimodels.fyi/papers/arxiv/improving-language-model-reasoning-self-motivated-learning] Others are focusing on teaching the models to explain their reasoning in a more transparent and faithful way, by having humans annotate the models' thought processes. [https://aimodels.fyi/papers/arxiv/direct-preference-optimization-video-large-multimodal-models]

However, these approaches have their own challenges. The planning-based approach can be slow and computationally intensive, as the model has to constantly evaluate intermediate reasoning states. And the human annotation approach is costly and difficult to scale up for training large language models. [https://aimodels.fyi/papers/arxiv/can-only-llms-do-reasoning-potential-small]

## Technical Explanation

In this paper, the researchers propose a new framework to address these issues. They use a technique called "Direct Preference Optimization" (DPO) to learn planning-based reasoning from collected trajectories, which are ranked according to synthesized process rewards. This allows the model to learn reliable reasoning without the high computational cost of the traditional planning-based approach, and without the need for expensive human annotations.

The researchers evaluated their framework on challenging logical reasoning benchmarks and found that their 7B-parameter model was able to outperform strong counterparts like the GPT-3.5-Turbo model. [https://aimodels.fyi/papers/arxiv/can-small-language-models-help-large-language]

## Critical Analysis

The researchers have presented a promising approach to improving the reliability and faithfulness of large language models' reasoning process. By utilizing DPO to learn from ranked trajectories, they were able to bypass some of the key challenges of traditional planning-based and annotation-based methods.

However, the paper does not delve into the specific details of how the process rewards were synthesized, which could be an important factor in the model's performance. Additionally, the researchers only evaluated their framework on logical reasoning tasks, so it's unclear how well it would generalize to other types of complex reasoning.

Further research could explore the broader applicability of this approach, as well as investigate the sensitivity of the framework to the quality and composition of the training data. Ultimately, this work represents an important step towards more reliable and trustworthy language models for complex reasoning tasks.

## Conclusion

This paper proposes a new framework for learning planning-based reasoning in large language models, using a technique called Direct Preference Optimization. By learning from ranked trajectories with synthesized process rewards, the researchers were able to improve the reliability and faithfulness of the models' reasoning process, while avoiding the high computational cost and scalability challenges of traditional approaches.

The results on logical reasoning benchmarks are promising, showing that the researchers' 7B-parameter model can outperform strong counterparts. This work represents an important advancement in the field of large language models and their application to complex reasoning tasks, with potential implications for a wide range of real-world applications.