[Paper Review] Less is more for alignment (LIMA)

We hypothesize that alignment can be a simple process where the model learns the style or format for interacting with users, to expose the knowledgeand capabilities that were already acquired during pretraining.

Alignment는 지식을 새로 배우는게 아니라 유저와 interaction을 하는 방식을 배우는 것

Abstract

Large language models are trained in two stages:
- (1) unsupervised pretraining from raw text, to learn general-purpose representations, and
- (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences.
We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling.
LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history.
Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data.
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output

Introduction

We hypothesize that alignment can be a simple process where the model learns the style or format for interacting with users, to expose the knowledgeand capabilities that were already acquired during pretraining.
To test this hypothesis, we curate 1,000 examples that approximate real user prompts and high-quality responses. We select 750 top questions and answers from community forums, such as Stack Exchange and wikiHow, sampling for quality and diversity.
In addition, we manually write 250 examples of prompts and responses, while optimizing for task diversity and emphasizing a uniform response style in the spirit of an AI assistant.
Finally, we train LIMA, a pretrained 65B-parameter LLaMa model [Touvron et al., 2023] fine-tuned on this set of 1,000 demonstrations.
나머진 대충 성능 좋았다는 얘기

Alignment Data

We define the Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users
To that end, we collect a dataset of 1,000 prompts and responses, where the outputs (responses) are stylistically aligned with each other, but the inputs (prompts) are diverse.
Specifically, we seek outputs in the style of a helpful AI assistant
We curate such examples from a variety of sources, primarily split into community Q&A forums and manually authored examples. We also collect a test set of 300 prompts and a development set of 50.
최대한 다양한 도메인으로, 하이 퀄러티, I, my, link, image 등을 필터링
We also include 13 training prompts with some degree of toxicity or malevolence. We carefully write response that partially or fully reject the command.
50 Supernatural language generation tasks
such as summarization, paraphrasing, style transfer, and pick a single random example from each one. –> 그리고 스타일을 맞추기위해 조금 고침

Training

65B LLaMa
EOT token at the end of each utterance.
15 epoch, AdamW, 1e-5~1e-6. batch 32, 2048 msl
One notable deviation from the norm is the use of residual dropout
5~10 에폭 사이에서 가장 좋은 모델 선택

Why is Less More

Diversity: more diverse Stack Exchange data yields significantly higher performance.
데이터 양은 더블링을 해도 나아지지 않음(7B에 대해서)

Multi-Turn Dialogue

Multi turn으로 학습 안해도 6/10 이 성공. 보통 3개 턴에서 실패
30개의 멀티턴 대화를 수집함. 10개는 직접 만듦, 20개는 Stack Exchange에서 수집
싱글턴 1000개 + 멀티턴 30개로 학습

Discussion

Primarily, the mental effort in constructing such examples is significant and difficult to scale up.
Secondly, LIMA is not as robust as product-grade models; while LIMA typically generates good responses, an unlucky sample during decoding or an adversarial prompt can often lead to a weak response