visit
Authors:
(1) Nathan Lambert, Allen Institute for AI;
(2) Roberto Calandra, TU Dresden.
Understanding Objective Mismatch
Acknowledgments, and References
This work was partly supported by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) as part of Germany’s Excellence Strategy – EXC 2050/1 – Project ID 390696704 – Cluster of Excellence “Centre for Tactile Internet with Human-in-the-Loop” (CeTI) of Technische Universität Dresden, and by Bundesministerium für Bildung und Forschung (BMBF) and German Academic Exchange Service (DAAD) in project 57616814 (SECAI, School of Embedded and Composite AI).
Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., . . . others (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
Baheti, A., Lu, X., Brahman, F., Bras, R. L., Sap, M., & Riedl, M. (2023). Improving language models with advantage-based offline policy gradients. arXiv preprint arXiv:2305.14718.
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., . . . others (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., . . . others (2023). Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31.
Coste, T., Anwar, U., Kirk, R., & Krueger, D. (2023). Reward model ensembles help mitigate overoptimization.
Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., . . . Sun, M. (2023). Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
Deng, H., & Raffel, C. (2023). Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. arXiv preprint arXiv:2310.09520.
Ethayarajh, K., Choi, Y., & Swayamdipta, S. (2022, 17–23 Jul). Understanding dataset difficulty with V-usable information. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th international conference on machine learning (Vol. 162, pp. 5988–6008). PMLR.
Feng, X., Wan, Z., Wen, M., Wen, Y., Zhang, W., & Wang, J. (2023). Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179.
Fernandes, P., Madaan, A., Liu, E., Farinhas, A., Martins, P. H., Bertsch, A., . . . others (2023). Bridging the gap: A survey on integrating (human) feedback for natural language generation. arXiv preprint arXiv:2305.00955.
Gao, L., Schulman, J., & Hilton, J. (2022). Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760.
Gilbert, T. K., Dean, S., Zick, T., & Lambert, N. (2022). Choices, risks, and reward reports: Charting public policy for reinforcement learning systems. arXiv preprint arXiv:2202.05716.
Glaese, A., McAleese, N., Tr˛ebacz, M., Aslanides, J., Firoiu, V., Ewalds, T., . . . others (2022). Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
Janner, M., Li, Q., & Levine, S. (2021). Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34, 1273–1286.
Kiela, D., Thrush, T., Ethayarajh, K., & Singh, A. (2023). Plotting progress in ai. Contextual AI Blog. (//contextual.ai/blog/plotting-progress)
Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., & Raileanu, R. (2023). Understanding the effects of rlhf on llm generalisation and diversity.
Knox, W. B., Hatgis-Kessell, S., Adalgeirsson, S. O., Booth, S., Dragan, A., Stone, P., & Niekum, S. (2023). Learning optimal advantage from preferences and mistaking it for reward.
Knox, W. B., & Stone, P. (2008). Tamer: Training an agent manually via evaluative reinforcement. In 2008 7th ieee international conference on development and learning (pp. 292–297).
Lambert, N., Amos, B., Yadan, O., & Calandra, R. (2020). Objective mismatch in model-based reinforcement learning. In Learning for dynamics and control (pp. 761–770).
Lambert, N., Gilbert, T. K., & Zick, T. (2023). Entangled preferences: The history and risks of reinforcement learning and human feedback.
Lambert, N., Pister, K., & Calandra, R. (2022). Investigating compounding prediction errors in learned dynamics models. arXiv preprint arXiv:2203.09637.
Lambert, N., Tunstall, L., Rajani, N., & Thrush, T. (2023). Huggingface h4 stack exchange preference dataset. Retrieved from //huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
Lambert, N., Wilcox, A., Zhang, H., Pister, K. S., & Calandra, R. (2021). Learning accurate long-term dynamics for model-based reinforcement learning. In 2021 60th ieee conference on decision and control (cdc) (pp. 2880–2887).
Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., & Legg, S. (2018). Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871.
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., . . . Hashimoto, T. B. (2023). Alpacaeval: An automatic evaluator of instruction-following models. //github.com/tatsu-lab/alpaca_eval. GitHub.
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., . . . others (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
This paper is under CC 4.0 license.