Triformer:Triangular,Variable-Specific Attentions for Long Sequence

An introduction to Triformer

Posted by Ccloud on 2023-02-08
Estimated Reading Time 5 Minutes
Words 956 In Total
Viewed Times

The blog introduces the paper : Triformer

Introduction

  • Recent studies show that attentions are able to capture better long term dependencies. But For a time series of timestamps, canonical self-attention has a quadratic complexity of .

  • The author proposes the Triformer with linear complexity by defining a novel Patch Attention with linear complexity for each patch.

  • Introduce a pseudo timestamp for a patch and compute attentions of the timestamps in the patch only to the single pseudo timestamp, making patch attention linear. Then, only the pseudo timestamps are fed into the next layer, such that the layer sizes shrink exponentially.

  • Existing forecasting models often use variable-agnostic parameters, though different variables may exhibit distinct temporal patterns.
    variable

    Variable-agnostic modeling forces the learned parameters to capture an “average” pattern among all the variables, thus failing to capture distinct temporal pattern of each variable and hurting accuracy.

  • By proposing a distinct set of matrices for the -th variable to capture distinct temporal patterns for different variables.

  • Factorize the projection matrices into variable-agnostic and variable-specific matrices and make the variable-specific matrices very compact, thus aboding increasing the parameter space and computation overhead.

  • The main contributions of the paper

    1. Propose a novel, efficient attention mechanism, namely Patch Attention, along with its triangular, multi-layer structure to ensure high efficiency.
    2. Propose a light-weight approach to enable variable-specific model parameters, making it possible to capture distinct temporal patterns from different variables thus enhancing accuracy
    3. Conduct extensive experiments on four public , commonly used multivariate time series data sets from different domains, justifying design choice and demonstrating that the proposal outperforms state-of-the-art methods.

category

The author makes some analysis to prove that variable-specific model would perform better on long-term time-series forecasting tasks.

Preliminaries

A multivariate time series records the values of variables over time.

denotes the values of all variables at and is the value of the -th variable at . The problem can be formulated as

where denotes the learnable parameters of the forecasting model and is the predicted values at timestamp .

Triformer

The Triformer is proposed for learning long-term and multi-scale dependencies in multivariate time series.

triangle

  1. Design the Patch Attention with linear complexity as the basic block
  2. Propose a triangular structure when stacking multiple layers of patch attentions, such that the layer sizes shrink exponentially.
  3. Propose a light-weight method to enable variable specific modeling, thus being able to capture distinct temporal patterns form different variables, without compromising efficiency.

Linear Patch Attention

First the author breaks down the input times series of length into patches along the temporal dimension, where is the patch size. Then compute attention scores per patch in another way.

To reduce the complexity to linear, a learnable pseudo timestamp is proposed.

patch_attention

In Triformer, the author chooses to use the attention mechanism to update the pseudo timestamp, where the pseudo timestamp works as the Query in self-attentions. It queries all the “real” timestamps in the patch, thus only computing a single attention score for each real timestamp, giving rise to linear complexity.

Note that and each variable has specific pseudo timestamp ,thus being variable-specific.

The reduced complexity of comes with a price that the temporal receptive field of each timestamp is reduced to the patch size . In contrast, it is H covering all timestamps in the canonical attention.

This makes it harder to capture relationships among different patches and also the long-term dependencies, thus adversely affecting accuracy. To compensate for the reduced temporal receptive field, introduce a recurrent connection to connect the outputs of the patches, such that the temporal information flow is maintained.

where and are learned parameters for the recurrent gates, is element-wise product, is tanh activation function and is a sigmoid function controlling the information ratio that is passed to the next pseudo timestamp.

Triangular Stacking

In attention based models, each attention layer has the same input, e.g. , the input time series size . In traditional self-attention same input and output have the same shape.

When using s, only feed the pseudo timestamps from the patches to the next layer, which shrinks the layer sizes exponentially. The size of -th layer is only of the size of the -th layer, where is the patch if the -th layer.

In a multi-layer Triformer, each layer consists of different numbers of patches and thus having different number of outputs ,i.e., pseudo timestamps.

Instead of only using the last pseudo timestamp per layer, we aggregate all pseudo timestamps per layer into an aggregated output. The aggregate output at the -th layer is defined as

where is a neural network, denotes the pseudo timestamp for patch at the -th layer.

Finally, the aggregate outputs from all layers are connected to the predictor(A fully connected neural network). It represents features from different temporal scales, contributing to different temporal views and provides multiple gradient feedback short-paths, thus easing the learning processes.

Variable-Specific Modeling

Variable_Specific

Introduce a -dimensional memory vector for each variable with .The memory is randomly initialized and learnable. This makes the method purely data-driven and can lean the most prominent characteristics of each variable.

The left and right matrices are variable-agnostic, thus being shared with all variables. Different variables have their own middle matrices ,thus making the middle matrices variable-specific. Make the matrix compact to make the method light-weight.

Use a generator (a 1-layer neural network) to generate from . Learning a full matrix directly requires parameters. In contrast, when using a generator, it requires for the memories, and an additional overhead of for the generator. (Reduce to .)We don’t need the Query matrix as does not need it and employs the pseudo timestamp instead.

Experiments

overall_accuracy

The paper


If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !