[Paper Review] Tiny Time Mixers(TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivaraiate Time Series

Vijay Ekambaram et. al., IBM
[Paper Review] Tiny Time Mixers(TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivaraiate Time Series

🔒 Problem

  • Multi-variative time series 데이터는 비전이나 텍스트에 비해 데이터가 부족하기 때문에 pretraining 하기 어렵다
  • LLM으로 이걸 극복하려는 시도들이 있으나, 그런 방법들은 상대적으로 시간과 리소스가 많이 소요됨(TimeLLM, GPT4TS 등)
  • 또한 Multi-channel correlation은 고려하지 않는 경우가 많음
  • TS 데이터는 Text와 다르게 양도 적지만 종류는 다양해서 LLM을 그대로 fine-tuning하면 overfitting되기가 쉬움

🔑 Main Ideas

  • 매우 작은 사이즈의 cross-channel 아키텍처를 제안함
  • first fast and tiny general pre-trained model(1< M) exclusively trained on public TS dataset
  • 다양한 데이터들을 고려하기 위해, adaptive patching, dataset augmentation, resolution prefix tuning을 진행
  • channel correlation and 외인성을 고려하기 위해 multi-level modeling strategy를 제안
  • public 244M Samples, TSMixer Architecture 기반으로 학습
  • Amidst the prevalance of pre-trained models demanding significant compute and training time,
  • Multi-level Modeling
    • image
    • TTM backbone 은 TSMixer에서 가져옴.
    • TSMixer는 multi-resolution data(1M, 1H, 10M 등등)에 대한 고려가 부족하기 때문에 다음과 같은 variation을 줌
      • TTM decoder: 아키텍처는 똑같지만 매우 작음(원본 대비10~20%)
      • Forecast Head: produce forecase output
      • Exogenous mixer(optional)
      • TTM decoder와 forecase head는 TTM Head에 포함되어 학습됨
  • Pre-processing
    • normalize X to have zero mean and unit standard deviation for each channel
    • denormalize before comtuning the loss.
    • 오버랩핑 되지 않도록 n개의 pl length를 가진 patch 단위로 쪼갬
  • TTM Methodology
    • pre-training workflow
      • pre-trained in a univaraite fashion with independent channels on all the existing datasets(MSE Loss)
    • Multi-Resolution Pretraining via TTM Backbone
      • 이 논문의 목적은 극도로 작지만 generalizing 능력도 있는 모델임. 이 문제를 해결하기 위해 아래와 같은 enhancement를 TSMixer에 추가
      • Adaptive pathing(AP)
        • TTM backbone is crafted with an adaptive patching architecture where differenct layers of the backbone operate at varying patch lengths and numbers of patches.
        • Moreover, it helps in scenarios when the availability of the pre-training data is limited as adaptive patching quickly generalizes the model across different granularities.
        • image
        • TTM backbone consists of L levels, each comprising M TTM blocks with identical patch configurations
        • Each TTM Block is further comprised of a patch partition block, a vanilla TSMixer bolck, and a patch merging block.
        • The patch partition block at level i increases the number of patches by a factor of K_i and reduces the patch dimension size by the same factor by reshaping c x n x hf -> c x (n x K_i) x (hf/K_i)
        • The output from TSMixer reshaped to its original shape in the patch merging block
        • In subsequent layers, the number of patches is halved and the patch dimension doubled. This enabels better generalization for small models as we pre-train across multiple datasets.