Informer:Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

The blog introduces the paper : Informer

Introduction

The conventional time-series forecasting models perform well in short-term prediction tasks. The increasingly long sequences strain the models’ prediction capacity to the point where this trend is holding the research on LSTF (long sequence time-series forecasting).
The major challenge for LSTF is to enhance the prediction capacity to meet the increasingly long sequence demand
extraordinary long-range alignment ability
efficient operations on long sequence inputs and outputs
The contributions of this paper are summarized as follows:
- We propose Informer to successfully enhance the prediction capacity in the LSTF problem, which validates the Transformer-like model’s potential value to capture individual long-range dependency between long sequence time-series outputs and inputs.
- We propose ProbSparse self-attention mechanism to effificiently replace the canonical self-attention. It achieves the time complexity and memory usage on dependency alignments.
- We propose self-attention distilling operation to privilege dominating attention scores in J-stacking layers and sharply reduce the total space complexity to be , which helps receiving long sequence input.
- We propose generative style decoder to acquire long sequence output with only one forward step needed, simultaneously avoiding cumulative error spreading during the inference phase.

Preliminary

The LSTF problem definition:

the input where at time , and the output is to predict corresponding sequence where

Methodology

overview

Efficient Self-attention Mechanism

The canonical self-attention in Transformer is defined based on the tuple inputs, i.e, query, key and value, which performs the scaled dot-product as , where and is the input dimension.

Let stand for the -th row in , the -th query’s attention is defined as a kernel smoother in a probability form:

where and selects the asymmetric exponential kernel . It requires the quadratic times dot-product computation and memory usage, which is the major drawback when enhancing prediction capacity.

Some attempts have revealed that the distribution of self-attention probability has potential sparsity but they are limited to theoretical analysis from following heuristic methods and tackle each multi-head self-attention with the same strategy, which narrows their further improvement.

The author performs a qualitative assessment on the learned attention patterns of the canonical self-attention in Transformer and the “sparsity” self-attention score forms a long tail distribution.

long_tail

That means only a few dot-product pairs contribute to the major attention and others generate trivial attention.

Query Sparsity Measurement

The dominant dot-product pairs encourages the corresponding query’s attention probability distribution away from the uniform distribution.
So that if , the self-attention becomes a trivial sum of values and is redundant to the residential input. Naturally, the “likeness” between distribution and can be used to distinguish the “important” queries. Measure the “likeness” through Kullback-Leibler divergence

Dropping the constant, the -th query’s sparsity measurement is

(The first term is the Log-Sum-Exp(LSE), and the second term is the arithmetic mean)

If the -th query gains a larger , its attention probability is more “diverse” and has a high chance to contain the dominate dot-product pairs in the header field of the long tail self-attention distribution.

ProbSparse Self-attention

Propose the ProbSparse self-attention by allowing each key to only attend to the dominant queries:

where is sparse matrix of the same size of and it only contains the Top- queries under the sparsity measurement . Controlled by a constant sampling factor . Set , which makes the ProbSparse self-attention only need to calculate dot-product for each query-key lookup and the layer memory usage maintains . Under the multi-head perspective, this attention generates different sparse query-key pairs for each head, which avoids severe information loss in return.

Propose an empirical approximation for the efficient acquisition of the query sparsity measurement

Lemma 1. For each query and in the keys set , we have the bound as

From the Lemma1,the max-mean measurement is defined as

Under the long tail distribution, we only need to randomly sample dot-product pairs to calculate the ,i.e., filling other pairs with zero. Then we select sparse Top- from them as . The max-operator in is less sensitive to zero values and is numerical stable.

In practice, the input length of queries and keys are typically equivalent in the self-attention computation,i.e, such the time and space complexity are

Encoder:Allowing for Processing Longer Sequential Inputs under the Memory Usage Limitation

architecture

Self-attention Distilling

A the natural consequence of the ProbSparse self-attention mechansim, the encoder’s feature map has redundant combinations of value V.

where represents the attention block. It contains the Mulit-head probSparse self -attention and the essential operations, where performs an 1-D convolutional filters (kernel width=3) on time dimension with the activation function. A max-pooling layer with stride 2 and down-sample into its half slice after stacking a layer, which reduces the whole memory usage to be

Decoder: Generating Long Sequential Outputs Through One Forward Procedure

Use a standard decoder structure in Transformer and it is composed of a stack of two identical multi-head attention layers.

overview

Experiments

exp1

exp2

exp3

exp4

exp5

exp6

The paper

If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !

Informer:Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

An introduction to Informer

Introduction

Preliminary

Methodology

Efficient Self-attention Mechanism

Query Sparsity Measurement

ProbSparse Self-attention

Encoder:Allowing for Processing Longer Sequential Inputs under the Memory Usage Limitation

Self-attention Distilling

Decoder: Generating Long Sequential Outputs Through One Forward Procedure

Experiments

The paper

FEATURED TAGS

FRIENDS