드디어 LLaMA를 잇는 구글의 오픈소스 LLM 등장!!

✍🏼 Abstract

Architecture

Transformer Decoder 사용
Context Length = 8K
Multi-Query Attention: 7B uses Multi-Head Attention, 2B use multi-query attention
RoPE Embeddings: rotray positional embeddings in each layer, share embeddings across inputs and outputs to reduce model size
GeGLU Activations
Normalizer Location: normalize both the input and the output of each transformer sub-layer(보통은 둘 중 하나에서만 normalize), RMSNorm

Filter personal information and sensitive data
Heuristic and model-based classifiers to remove harmful or low-quality content.
Filter all evaluation dataset
We stage training to alter the corpus mixture throughout training to increase the weight of relevant, high-quality data towards the end of training
- 학습의 마지막에 갈 수록 관련이 깊고 양질의 데이터를 썼다는 건가?

SFT/RLHF on a mix of text-only, English-only synthetic and human-generated prompt-response dataset
- 자동 생성 및 인간이 만든 데이터를 섞어서 사용했다는 이야기이나, 정확한 내용은 나와있지 않음
Formatting
RLHF
- PPO 대신 REINFORCE를 사용(+ KL regularization toweards SFT model)
- 그럼 Critic도 없다는 말인가??

Human Evaluation
- 1000개 질문 사용(creative writing tasks, coding, following instruction)
- 400개 safety 관련 질문
Automated Benchmark
- physical reasoning, social reasoning, question answering, coding, mathematics, commonsense reasoning, language modeling, reading comprehension, and more
- Mistral과 비교하기 위해 최대한 유사하게 평가
- ARC, CommonsenseQA, Big Bench Hard, AGI Eval(English-only)
- Gemma는 특히 수학과 코딩에 강함(CodeLLaMA 7B 이상)
Memorization Evaluation
- 10000개의 랜덤 corpus에서 첫 50 토큰을 넣고 다음 50개의 토큰이 ground truth와 일치하는지 체크
- 왼쪽은 PaLM2에서 사용한 것과 동일한 방식(데이터?)로 측정했을 경우. Gemma는 PaLM2의 학습 데이터와 다르기 때문에 점수가 낮을 수 밖에.
- 오른쪽은 사용된 모든 Pretraining 데이터(각 모델에 쓰인?)에 대해서 평가한 경우. PaLM과 유사하게 낮은 점수 나옴(1%대)
- 이 실험은 왜한거지?;;;;