[Paper Review] Tiny Time Mixers(TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivaraiate Time Series

🔒 Problem

Multi-variative time series 데이터는 비전이나 텍스트에 비해 데이터가 부족하기 때문에 pretraining 하기 어렵다
LLM으로 이걸 극복하려는 시도들이 있으나, 그런 방법들은 상대적으로 시간과 리소스가 많이 소요됨(TimeLLM, GPT4TS 등)
또한 Multi-channel correlation은 고려하지 않는 경우가 많음
TS 데이터는 Text와 다르게 양도 적지만 종류는 다양해서 LLM을 그대로 fine-tuning하면 overfitting되기가 쉬움

🔑 Main Ideas

매우 작은 사이즈의 cross-channel 아키텍처를 제안함
first fast and tiny general pre-trained model(1< M) exclusively trained on public TS dataset
다양한 데이터들을 고려하기 위해, adaptive patching, dataset augmentation, resolution prefix tuning을 진행
channel correlation and 외인성을 고려하기 위해 multi-level modeling strategy를 제안
public 244M Samples, TSMixer Architecture 기반으로 학습
Amidst the prevalance of pre-trained models demanding significant compute and training time,
Multi-level Modeling
- TTM backbone 은 TSMixer에서 가져옴.
- TSMixer는 multi-resolution data(1M, 1H, 10M 등등)에 대한 고려가 부족하기 때문에 다음과 같은 variation을 줌
  - TTM decoder: 아키텍처는 똑같지만 매우 작음(원본 대비10~20%)
  - Forecast Head: produce forecase output
  - Exogenous mixer(optional)
  - TTM decoder와 forecase head는 TTM Head에 포함되어 학습됨
Pre-processing
- normalize X to have zero mean and unit standard deviation for each channel
- denormalize before comtuning the loss.
- 오버랩핑 되지 않도록 n개의 pl length를 가진 patch 단위로 쪼갬
TTM Methodology
- pre-training workflow
  - pre-trained in a univaraite fashion with independent channels on all the existing datasets(MSE Loss)
- Multi-Resolution Pretraining via TTM Backbone
  - 이 논문의 목적은 극도로 작지만 generalizing 능력도 있는 모델임. 이 문제를 해결하기 위해 아래와 같은 enhancement를 TSMixer에 추가
  - Adaptive pathing(AP)
    - TTM backbone is crafted with an adaptive patching architecture where differenct layers of the backbone operate at varying patch lengths and numbers of patches.
    - Moreover, it helps in scenarios when the availability of the pre-training data is limited as adaptive patching quickly generalizes the model across different granularities.
    - TTM backbone consists of L levels, each comprising M TTM blocks with identical patch configurations
    - Each TTM Block is further comprised of a patch partition block, a vanilla TSMixer bolck, and a patch merging block.
    - The patch partition block at level i increases the number of patches by a factor of K_i and reduces the patch dimension size by the same factor by reshaping c x n x hf -> c x (n x K_i) x (hf/K_i)
    - The output from TSMixer reshaped to its original shape in the patch merging block
    - In subsequent layers, the number of patches is halved and the patch dimension doubled. This enabels better generalization for small models as we pre-train across multiple datasets.

[Paper Review] Tiny Time Mixers(TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivaraiate Time Series

🔒 Problem

🔑 Main Ideas

[책 리뷰] Same as ever(불변의 원칙)

[책 리뷰] 최고의 주식 최적의 타이밍(ing)

[Paper Review] (작성중) 2402 Instruction-tuned Language Models are Better Knowledge Learners

[Paper Review] Gemma: Open Models Based on Gemini Research and Technology

[Paper Review] 2402 Chain-of-Thought Reasoning Without Prompting

[Paper Review] 2305 Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Recent posts

[책 리뷰] 퀀트 투자 처음 공부

[책 리뷰] 최고의 주식 최적의 타이밍(ing)

[Paper Review] Tiny Time Mixers(TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivaraiate Time Series

🔒 Problem

🔑 Main Ideas

Related