The blog introduces the paper : Triformer
Introduction
Recent studies show that attentions are able to capture better long term dependencies. But For a time series of
timestamps, canonical self-attention has a quadratic complexity of . The author proposes the Triformer with linear complexity
by defining a novel Patch Attention with linear complexity for each patch. Introduce a pseudo timestamp for a patch and compute attentions of the timestamps in the patch only to the single pseudo timestamp, making patch attention linear. Then, only the pseudo timestamps are fed into the next layer, such that the layer sizes shrink exponentially.
Existing forecasting models often use variable-agnostic parameters, though different variables may exhibit distinct temporal patterns.
Variable-agnostic modeling forces the learned parameters to capture an “average” pattern among all the variables, thus failing to capture distinct temporal pattern of each variable and hurting accuracy.
By proposing a distinct set of matrices
for the -th variable to capture distinct temporal patterns for different variables. Factorize the projection matrices into variable-agnostic and variable-specific matrices and make the variable-specific matrices very compact, thus aboding increasing the parameter space and computation overhead.
The main contributions of the paper
- Propose a novel, efficient attention mechanism, namely Patch Attention, along with its triangular, multi-layer structure to ensure high efficiency.
- Propose a light-weight approach to enable variable-specific model parameters, making it possible to capture distinct temporal patterns from different variables thus enhancing accuracy
- Conduct extensive experiments on four public , commonly used multivariate time series data sets from different domains, justifying design choice and demonstrating that the proposal outperforms state-of-the-art methods.
Related Work
The author makes some analysis to prove that variable-specific model would perform better on long-term time-series forecasting tasks.
Preliminaries
A multivariate time series records the values of
where
Triformer
The Triformer is proposed for learning long-term and multi-scale dependencies in multivariate time series.
- Design the Patch Attention with linear complexity as the basic block
- Propose a triangular structure when stacking multiple layers of patch attentions, such that the layer sizes shrink exponentially.
- Propose a light-weight method to enable variable specific modeling, thus being able to capture distinct temporal patterns form different variables, without compromising efficiency.
Linear Patch Attention
First the author breaks down the input times series of length
To reduce the complexity to linear, a learnable pseudo timestamp
In Triformer, the author chooses to use the attention mechanism to update the pseudo timestamp, where the pseudo timestamp works as the Query in self-attentions. It queries all the “real” timestamps in the patch, thus only computing a single attention score for each real timestamp, giving rise to linear complexity.
Note that
The reduced complexity of
This makes it harder to capture relationships among different patches and also the long-term dependencies, thus adversely affecting accuracy. To compensate for the reduced temporal receptive field, introduce a recurrent connection to connect the outputs of the patches, such that the temporal information flow is maintained.
where
Triangular Stacking
In attention based models, each attention layer has the same input, e.g. , the input time series size
When using
In a multi-layer Triformer, each layer consists of different numbers of patches and thus having different number of outputs ,i.e., pseudo timestamps.
Instead of only using the last pseudo timestamp per layer, we aggregate all pseudo timestamps per layer into an aggregated output. The aggregate output
where
Finally, the aggregate outputs from all layers are connected to the predictor(A fully connected neural network). It represents features from different temporal scales, contributing to different temporal views and provides multiple gradient feedback short-paths, thus easing the learning processes.
Variable-Specific Modeling
Introduce a
The left and right matrices are variable-agnostic, thus being shared with all variables. Different variables have their own middle matrices
Use a generator
Experiments
The paper
If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !