ST-GRAT:A Novel Spatio-temporal Graph Attention Network for Accurately Forecasting Dynamically Changing Road Speed

An introduction to ST-GRAT

Posted by Ccloud on 2023-02-02
Estimated Reading Time 7 Minutes
Words 1.1k In Total
Viewed Times

The blog introduces the paper : ST-GRAT

Introduction

  • A method for predicting traffic speed should not only find spatial-temporal dependencies among roads but also understand how these dependencies change over time and influence other traffic conditions
  • Existing models assume fixed spatial dependencies among roads without considering dynamically changing traffic conditions
  • To handle the issue before, recent models utilize multi-head attention to model spatial dependencies but they neglect the overall graph structure information
  • While existing Attention-based models can be used for replacing GCNN-based spatial modeling, they all have a drawback that they do not consider the information embedded in the graph structure (such as distances, connectivity, and flow directions between nodes) in their spatial dependency modeling processes in deciding when and where to attend. Compared to previous models, ST-GRAT has a novel spatial attention mechanism that can consider all of the mentioned graph structure information.
  • In this work, a novel spatial -temporal graph attention network(ST-GRAT) is proposed to consider both of dynamically adjusting attention weights and the graph structure information

Proposed approach

problem setting for traffic speed forecasting

Define the sensor graph as ,where is the set of all the different sensor nodes, (), is the set of edges, and is a weighted adjacency matrix. includes tow types of proximity information in the road network: connectivity and edge weights.

Denote as the input feature matrix at time t, where is the number of nodes and 2 is the number of features (velocity and the timestamp). The problem is to learn a map that predicts the speed of the next time steps(), given the previous input speeds in a sequence () and graph , i.e.,

Use a encoder-decoder architecture to model the ST-GRAT:

architecture

Encoder Architecture

A single encoder layer consists of three sequential sub-layers to capture both temporal and spatial information: the spatial attention layer, the temporal attention layer, and the point-wise feed-forward neural networks.

The spatial attention layer attends to neighbor nodes spatially correlated to the center node at each time-step, while the temporal attention layer attends to individual nodes and focuses on different time steps of a given input sequence. The point-wise feed-forward networks create high-level features that integrate information from two attention layers. These layer consist of two sequential fully connected networks with GELU activation function.

The encoder has a skip-connection to bypass the sub-layer and use layer normalization performance and dropout after each sub-layer to improve the generalization performance. The encoder architecture is a stack of one embedding layer and four(L=4) identical encoder layers and it transforms the spatial and temporal dependencies of an input signal into a hidden representation vector which is used for attention layers in the decoder.

Embedding Layer

To incorporate the proximity information, the embedding layer in ST-GRAT takes a pre-trained node-embedding vector generated by LINE.

The embedding layer also performs positional embedding to acquire the order of input sequences using the positional encoding scheme of the Transformer and residual skip connections to prevent the vanishing effect increases.

Concatenate each node embedding result with the node features and then project the concatenated features onto and add the positional encoding vector to each time step lastly.

Spatial Attention

spatial_attention

The spatial attention mechanism consists of multiple inflow attention heads(odd head indices) and outflow attention heads (even head indices) because the model considers the direction in a road network.

The hidden state of the encoder can be denoted as ,where is the hidden state of the -th node. The set of -th node and its neighbor nodes are denoted as . The dimensions of the query, key, and value vector are defined as , where is the number of attention head in multi-head self-attention.

Fig2 shows that the multi-head attention mechanism distinguish the even and odd indices by distribute them different tasks. In the 1st attention head, node a and its inflow neighbors node b and node d are taken account of. Every two attention heads extract whole spatial dependencies and multi-head mechanism ensures extracting as more features in different aspects as possible.

The self-attention mechanism in Transformer has the constraint that the sum of the weights has to be one. There may exist no spatial correlation between two nodes while they have connectivity on graph. To prevent such unnecessary attention, the author adds spatial sentinel key and value vectors which are linear transformations of a query node by improving the formula as follows:

where is the output feature of the -th node in the -th attention head, and indicate the linear transformation matrices of the sentinel value vector and the value vector of spatial attention. The attention coefficient is computed as

where indicates the energy logits and represents the sentinel energy logit.

where parameter matrices and $W _h ^{V n} \in R ^{d{model} \times d _k}$ are the linear transformation matrices of the query vector and the key vector.

To explicitly provide edge information, the additional prior knowledge called diffusion prior is introduced. It comes from the diffusion process in a graph and distinguish inflow from outflow

where is the number of truncation steps of the diffusion process, is the adjacency matrix while and are the transition matrices, is the weight of the diffusion process at step in the -th attention head.

The sentinel energy logit excludes prior knowledge and uses a sentinel key vector instead and is defined as follows:

where is a linear transformation matrix of the sentinel key vector.

So that,if if higher than , the model will assign less attention to the nodes in nodes.

After computing the output features on each attention head, they are concatenated and projected as

where ​ is the projection layer. The projection layer helps the model to combine various aspects of spatial correlation features and the outputs of the inflow and outflow attention heads.

Temporal Attention

  1. temporal attention does not use the sentinel vectors and the diffusion prior
  2. temporal attention attends to important time steps of each node, while spatial attention attends to important nodes at each time step

Use the multi-head attention mechanism proposed in Transformer.

Decoder Architecture

Designed like the decoder in Transformer. The decoder consists of the embedding layer and four other sub-layers: the spatial attention layer, two temporal attention layers, and the feed-forward neural networks. After each sub-layer, layer normalization is applied. The masked attention layer masks future time step inputs to restrict attention to present and past information. The encoder-decoder attention layer extracts features by using the encoded vectors from both the encoder and the masked attention layer. In this layer, the encoder-decoder attention from the encoder is used as both the key and the value, and query features of each node are passed along with the masked self-attention.

Experiment

Evaluate the model on METR-LA and PEMS-BAY

exp1

exp2

The paper


If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !