🔒 Problem
- Multi-variative time series 데이터는 비전이나 텍스트에 비해 데이터가 부족하기 때문에 pretraining 하기 어렵다
- LLM으로 이걸 극복하려는 시도들이 있으나, 그런 방법들은 상대적으로 시간과 리소스가 많이 소요됨(TimeLLM, GPT4TS 등)
- 또한 Multi-channel correlation은 고려하지 않는 경우가 많음
- TS 데이터는 Text와 다르게 양도 적지만 종류는 다양해서 LLM을 그대로 fine-tuning하면 overfitting되기가 쉬움
🔑 Main Ideas
- 매우 작은 사이즈의 cross-channel 아키텍처를 제안함
- first fast and tiny general pre-trained model(1< M) exclusively trained on public TS dataset
- 다양한 데이터들을 고려하기 위해, adaptive patching, dataset augmentation, resolution prefix tuning을 진행
- channel correlation and 외인성을 고려하기 위해 multi-level modeling strategy를 제안
- public 244M Samples, TSMixer Architecture 기반으로 학습
- Amidst the prevalance of pre-trained models demanding significant compute and training time,
- Multi-level Modeling
- TTM backbone 은 TSMixer에서 가져옴.
- TSMixer는 multi-resolution data(1M, 1H, 10M 등등)에 대한 고려가 부족하기 때문에 다음과 같은 variation을 줌
- TTM decoder: 아키텍처는 똑같지만 매우 작음(원본 대비10~20%)
- Forecast Head: produce forecase output
- Exogenous mixer(optional)
- TTM decoder와 forecase head는 TTM Head에 포함되어 학습됨
- Pre-processing
- normalize X to have zero mean and unit standard deviation for each channel
- denormalize before comtuning the loss.
- 오버랩핑 되지 않도록 n개의 pl length를 가진 patch 단위로 쪼갬
- TTM Methodology
- pre-training workflow
- pre-trained in a univaraite fashion with independent channels on all the existing datasets(MSE Loss)
- Multi-Resolution Pretraining via TTM Backbone
- 이 논문의 목적은 극도로 작지만 generalizing 능력도 있는 모델임. 이 문제를 해결하기 위해 아래와 같은 enhancement를 TSMixer에 추가
- Adaptive pathing(AP)
- TTM backbone is crafted with an adaptive patching architecture where differenct layers of the backbone operate at varying patch lengths and numbers of patches.
- Moreover, it helps in scenarios when the availability of the pre-training data is limited as adaptive patching quickly generalizes the model across different granularities.
- TTM backbone consists of L levels, each comprising M TTM blocks with identical patch configurations
- Each TTM Block is further comprised of a patch partition block, a vanilla TSMixer bolck, and a patch merging block.
- The patch partition block at level i increases the number of patches by a factor of K_i and reduces the patch dimension size by the same factor by reshaping c x n x hf -> c x (n x K_i) x (hf/K_i)
- The output from TSMixer reshaped to its original shape in the patch merging block
- In subsequent layers, the number of patches is halved and the patch dimension doubled. This enabels better generalization for small models as we pre-train across multiple datasets.
- pre-training workflow