paint-brush
How AI Learns from Human Preferences by@textmodels

How AI Learns from Human Preferences

tldt arrow

Too Long; Didn't Read

RLHF is a method to train AI models using human feedback. It involves three phases: 1) training on labeled data, 2) gathering human preferences, and 3) optimizing the model based on those preferences. The Bradley-Terry model is often used to analyze human preferences.
featured image - How AI Learns from Human Preferences
Writings, Papers and Blogs on Text Models HackerNoon profile picture

Authors:

(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier; (2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier; (3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier; (4) Stefano Ermon, CZ Biohub; (5) Christopher D. Manning, Stanford University; (6) Chelsea Finn, Stanford University.

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Direct Preference Optimization

5 Theoretical Analysis of DPO

6 Experiments

7 Discussion, Acknowledgements, and References

Author Contributions


A Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2

A.6 Proof of Theorem 1


B DPO Implementation Details and Hyperparameters


C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details

C.2 GPT-4 prompts for computing summarization and dialogue win rates

C.3 Unlikelihood baseline


D Additional Empirical Results

D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments

D.3 Human study details

3 Preliminaries

We review the RLHF pipeline in Ziegler et al. (and later [38, 1, 26]). It usually includes three phases: 1) supervised fine-tuning (SFT); 2) preference sampling and reward learning and 3) RL optimization.




This paper is under CC BY-NC-ND 4.0 DEED license.


바카라사이트 바카라사이트 온라인바카라