The blog introduces the paper : Informer
Introduction
The conventional time-series forecasting models perform well in short-term prediction tasks. The increasingly long sequences strain the models’ prediction capacity to the point where this trend is holding the research on LSTF (long sequence time-series forecasting).
The major challenge for LSTF is to enhance the prediction capacity to meet the increasingly long sequence demand
extraordinary long-range alignment ability
efficient operations on long sequence inputs and outputs
The contributions of this paper are summarized as follows:
- We propose Informer to successfully enhance the prediction capacity in the LSTF problem, which validates the Transformer-like model’s potential value to capture individual long-range dependency between long sequence time-series outputs and inputs.
- We propose ProbSparse self-attention mechanism to effificiently replace the canonical self-attention. It achieves the
time complexity and memory usage on dependency alignments. - We propose self-attention distilling operation to privilege dominating attention scores in J-stacking layers and sharply reduce the total space complexity to be
, which helps receiving long sequence input. - We propose generative style decoder to acquire long sequence output with only one forward step needed, simultaneously avoiding cumulative error spreading during the inference phase.
Preliminary
The LSTF problem definition:
the input
Methodology
Efficient Self-attention Mechanism
The canonical self-attention in Transformer is defined based on the tuple inputs, i.e, query, key and value, which performs the scaled dot-product as
Let
where
Some attempts have revealed that the distribution of self-attention probability has potential sparsity but they are limited to theoretical analysis from following heuristic methods and tackle each multi-head self-attention with the same strategy, which narrows their further improvement.
The author performs a qualitative assessment on the learned attention patterns of the canonical self-attention in Transformer and the “sparsity” self-attention score forms a long tail distribution.
That means only a few dot-product pairs contribute to the major attention and others generate trivial attention.
Query Sparsity Measurement
The dominant dot-product pairs encourages the corresponding query’s attention probability distribution away from the uniform distribution.
So that if
Dropping the constant, the
(The first term is the Log-Sum-Exp(LSE), and the second term is the arithmetic mean)
If the
ProbSparse Self-attention
Propose the ProbSparse self-attention by allowing each key to only attend to the
where
Propose an empirical approximation for the efficient acquisition of the query sparsity measurement
Lemma 1. For each query
From the Lemma1,the max-mean measurement is defined as
Under the long tail distribution, we only need to randomly sample
In practice, the input length of queries and keys are typically equivalent in the self-attention computation,i.e,
Encoder:Allowing for Processing Longer Sequential Inputs under the Memory Usage Limitation
Self-attention Distilling
A the natural consequence of the ProbSparse self-attention mechansim, the encoder’s feature map has redundant combinations of value V.
where
Decoder: Generating Long Sequential Outputs Through One Forward Procedure
Use a standard decoder structure in Transformer and it is composed of a stack of two identical multi-head attention layers.
Experiments
The paper
If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !