👨🏽‍💻 Title & Authors (Affiliation)

✍🏼 한 줄 요약

1.1B Llama2를 3 Trillion tokens으로 학습
Pretraining Data
- SlimPajama: RedPajama를 정제한 1.2T 데이터
- Starcoderdata: 250 bilion tokens across 86 programming languages, 깃헙 이슈 데이터도 사용
- 위 두개를 합쳐서 950B tokens을 만들고 3에폭 학습, NLP 데이터가 7, 코드가 3으로 비율 조정
모델: Transformer Decoder 기반(LLama2와 유사)
- Rotary Positional Embedding
- RMSNorm
- SwiGLU
- Grouped-query Attention
Speed Optimization
- Fully Shared Data Parallel
- Flash Attention
- xFormer: replaced the fused SwiGLU module from the xFormers repository with the original SwiGLU
- Pythia, MPT에 비해 1.5~2배 빠른 학습

|Hidden Size|Intermediate Hidden Size|Context Len|Heads|Layers|Vocab Size| |:-:|:-:|:-:|:-:|:-:|:-:| |2048|5632|2048|16|22|32000|

Harness로 평가
- hellaswag, piqa, arc 등 갑자기 점수가 뛰는 구간이 있는데, 실수로 eos토큰을 여러개 넣어서 학습 중인
Problem-soling evaluation: InstructEval Benchmark 사용

한글 데이터만 있으면 Polyglot을 Further Pretraining 해도 성능이 올라갈까?
Instruction Finetuning 해본 결과
- link
- This is the chat model finetuned on top of TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T. We follow HF’s Zephyr’s training recipe. The model was “ initially fine-tuned on a variant of the UltraChat dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT. We then further aligned the model with 🤗 TRL’s DPOTrainer on the openbmb/UltraFeedback dataset, which contain 64k prompts and model completions that are ranked by GPT-4.”