The blog introduces the paper : ST-GRAT
Introduction
- A method for predicting traffic speed should not only find spatial-temporal dependencies among roads but also understand how these dependencies change over time and influence other traffic conditions
- Existing models assume fixed spatial dependencies among roads without considering dynamically changing traffic conditions
- To handle the issue before, recent models utilize multi-head attention to model spatial dependencies but they neglect the overall graph structure information
- While existing Attention-based models can be used for replacing GCNN-based spatial modeling, they all have a drawback that they do not consider the information embedded in the graph structure (such as distances, connectivity, and flow directions between nodes) in their spatial dependency modeling processes in deciding when and where to attend. Compared to previous models, ST-GRAT has a novel spatial attention mechanism that can consider all of the mentioned graph structure information.
- In this work, a novel spatial -temporal graph attention network(ST-GRAT) is proposed to consider both of dynamically adjusting attention weights and the graph structure information
Proposed approach
problem setting for traffic speed forecasting
Define the sensor graph as
Denote
Use a encoder-decoder architecture to model the ST-GRAT:
Encoder Architecture
A single encoder layer consists of three sequential sub-layers to capture both temporal and spatial information: the spatial attention layer, the temporal attention layer, and the point-wise feed-forward neural networks.
The spatial attention layer attends to neighbor nodes spatially correlated to the center node at each time-step, while the temporal attention layer attends to individual nodes and focuses on different time steps of a given input sequence. The point-wise feed-forward networks create high-level features that integrate information from two attention layers. These layer consist of two sequential fully connected networks with GELU activation function.
The encoder has a skip-connection to bypass the sub-layer and use layer normalization performance and dropout after each sub-layer to improve the generalization performance. The encoder architecture is a stack of one embedding layer and four(L=4) identical encoder layers and it transforms the spatial and temporal dependencies of an input signal into a hidden representation vector which is used for attention layers in the decoder.
Embedding Layer
To incorporate the proximity information, the embedding layer in ST-GRAT takes a pre-trained node-embedding vector generated by LINE.
The embedding layer also performs positional embedding to acquire the order of input sequences using the positional encoding scheme of the Transformer and residual skip connections to prevent the vanishing effect increases.
Concatenate each node embedding result with the node features and then project the concatenated features onto
Spatial Attention
The spatial attention mechanism consists of multiple inflow attention heads(odd head indices) and outflow attention heads (even head indices) because the model considers the direction in a road network.
The hidden state of the encoder can be denoted as
Fig2 shows that the multi-head attention mechanism distinguish the even and odd indices by distribute them different tasks. In the 1st attention head, node a and its inflow neighbors node b and node d are taken account of. Every two attention heads extract whole spatial dependencies and multi-head mechanism ensures extracting as more features in different aspects as possible.
The self-attention mechanism in Transformer has the constraint that the sum of the weights has to be one. There may exist no spatial correlation between two nodes while they have connectivity on graph. To prevent such unnecessary attention, the author adds spatial sentinel key and value vectors which are linear transformations of a query node by improving the formula as follows:
where
where
where parameter matrices
To explicitly provide edge information, the additional prior knowledge
where
The sentinel energy logit excludes prior knowledge and uses a sentinel key vector instead and is defined as follows:
where
So that,if
After computing the output features
where
Temporal Attention
- temporal attention does not use the sentinel vectors and the diffusion prior
- temporal attention attends to important time steps of each node, while spatial attention attends to important nodes at each time step
Use the multi-head attention mechanism proposed in Transformer.
Decoder Architecture
Designed like the decoder in Transformer. The decoder consists of the embedding layer and four other sub-layers: the spatial attention layer, two temporal attention layers, and the feed-forward neural networks. After each sub-layer, layer normalization is applied. The masked attention layer masks future time step inputs to restrict attention to present and past information. The encoder-decoder attention layer extracts features by using the encoded vectors from both the encoder and the masked attention layer. In this layer, the encoder-decoder attention from the encoder is used as both the key and the value, and query features of each node are passed along with the masked self-attention.
Experiment
Evaluate the model on METR-LA and PEMS-BAY
The paper
If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !