AutoST:Efficient Neural Architecture Search for Spatio-Temporal Prediction

The blog introduces the paper : AutoST:Efficient Neural Architecture Search for Spatio-Temporal Prediction

Introduction

How to find the optimal neural architecture at various scenarios in cities is still an unsettled problem.
- The spatio-temporal correlation is diverse from location to location
- spatio-temporal correlation is complex including spatial dependency between regions and temporal correlation among timestamps
- spatio-temporal correlation is heterogeneous to different tasks
Two important aspects neglected in existing approaches.
- Different areas may have different spatial ranges preference while the the size of convolution kernel which models the range of neighborhood dependency is usually fixed and empirically set.
  Select the optimal architecture found in neural architecture search (NAS) and analyze the search process.
  
  Finally, it shows that the Beijing achieves slightly higher score than Guiyang in convolution score (add the weighted kernel size in all layers as the total scores for each architecture at every iteration)
- Current approaches usually use the residual network to aggregate the features in adjacent layers, failing to fuse the low and high-level features. The more the number of skip connections at the search stage indicates the more important the low-level features. It turns out that different cities have distinct architecture preferences and core cities usually pay attention to the global spatial dependency.(shown in Fig1(c))
Propose a general network called AutoST shown in Fig2.

The traditional processes of ST prediction consists of three components:
1. initial ST feature extractor: constructing ST features from raw ST data
2. feature learning network
3. external fusion and predictor: fusing the external factors with flow information and then predicting urban flows in the future
The contributions can be summarized as three aspects:
- Propose a novel model named AutoST for spatio-temporal prediction which introduces the neural architecture search technique to dynamically capture the various-range spatial correlations and to fuse multi-level features. In addition, AutoST is oriented to ST data rather than specific application scenarios, which can be easily applied to a series of deep models.
- Design an efficient and effective search space including two basic modules: i) a mix convolution block at each layer to capture various-range spatial correlations; ii) a mix skip connection block among layers to dynamically fuse the multi-level features. Specifically, the mix convolution block consists of multiple convolutions kernels, the larger size of which indicates the longer range correlations. Besides, the mix skip connection block includes a connection cell and a do-not-connection cell to dynamically learn to fuse the low- and high level features.
- Perform extensive experiments on four real-world spatio-temporal datasets varying from taxi flow to crowd flow, and the experimental results demonstrate that AutoST can significantly improve the spatial-temporal prediction. Furthermore, our proposed spatio-temporal NAS method is more efficient than existing NAS methods.

Preliminaries

notations

DEFINITION 1. Spatial-Temporal prediction: We partition a city into an I ∗ J grids based on longitude and latitude where a grid denotes a region. There are many types of measures in a region for different ST applications, such as the crowd flows, bike recent and return, taxi pick-ups and drop-offs. Then, the urban information at time t can be denoted as where C is the number of measured values.

DEFINITION 2. External Features: We denote as the external features including the meteorological and holiday information.

Problem Statement: Given historical observed citywide urban flows , and external features , predict the traffic

flow for all locations in the next timestamps .

Methodology

fig3

Spatio-Temporal Search Space

Distinct to neural network with fixed architecture, NAS network is composed of three modules form simple to complex:

a candidate cell module which defines the search unit.
an operation block module which perform weighted sum over all possible operations to make the sarch sape continuous.
a NAS network module which is consisted of a series of mix operations.

Candidate Cells

Divide a city into a grid map where each grid denotes a region and these grids make up an image. Select the candidate cells based on following three considerations:

Since nearby regions may affect each other for ST prediction, convolution operation is important. Convolutions with different kernel size model the spatial dependency in different ranges
Local correlation captured by low-level convolution and global correlation encoded with high-level convolution are both important for city-wide flow forecasting. Therefore, skip connection which fuse multi-level correlations is necessary.
Distinct to the common CNN networks in image domain, ST prediction task doesn’t need the pooling operation because pooling operation can cause the information loss probably.

Cells:

Convolutional operations consisting of standard convolution (Std_conv_3), standard convolution (Std_conv_5), separable convolution(Sep_conv_3), separable convolution (Sep_conv_5)
Skip connection operation including don’t connection (none) operation and connection (identity) operation.

Operation Blocks

Gradient-based search strategy usually calculate the weighted sums over the outputs of all basic operations to avoid the discrete choice of basic operation unit.

In DARTS, the author simply add the outputs of eight basic cells as the final output of one operation block, which can be formulated as follows:

However, the calculation of all basic convolution operation at each layer may cause large memory consumption. There are possible operations and each operation is selected from eight basic search cells. There are possible network architectures in total.

In order to solve the memory inefficient problem, the basic cells are grouped into two categories and define twp types of mix operations:

mix convolution block represented as blue arrow which calculates the weighted sum of all convolution outputs
mix connection block represented as green arrow which multiples the outputs of each layers with connection probability.

The architecture parameters () controls the possibility of choosing candidate cells. Besides , every convolution cell has parameters including kernels and bias which are defined as

The number of skip connections increases when iterating more epochs and is caused by the inherent unfair competition. Therefore, they propose to relax the choice of operations to be dependent and make each operation has an equal opportunity to develop its strength. They apply a sigmoid activation instead of softmax to generate the architecture weights.

The values in ate initialized with zeros instead of random numbers. Suppose and are the search units for convolution operation and skip connection respectively, then the calculation of mix convolution block and mix connection block is defined as:

There are mix convolution blocks and mix connection blocks, then the number of possible architectures is in total, which greatly reduces the search space.

NAS Network

DARTS divided the network into two parts, the inner network which learns the architecture by NAS and the outer network which has the fixed architecture.

However, the fixed architecture of outer network may reduce the performance. In order to make the network more efficient, we constrain the search space according to the characteristics of ST prediction in the following two aspects:

There are only one convolution in adjacent layers to capture large-range spatial dependency
There are no external feature transform when fusing multi-level features so that we simply add the outputs of previous layers to current layer.

The output of -th layer can be formulated as:

where is the output of -th layer. The is to generate high-level features at layer and is to fuse features in layer and layer .

The proposed network admits the following three advantages:

it is more effective than fixed architecture due that existing networks are a subset of AutoST
it releases experts from refining networks
it is more efficient than recent neural architecture search method because the search space is carefully designed and constrained for ST prediction task

AutoST for Spatio-Temporal Prediction

fig4

We evaluate AutoST on three popular spatio-temporal prediction models (STResNet, ST-3DNet and DeepSTN) to verify the efficiency and generality.

We follow the same CPT (closeness, period and trend) paradigm and take DeepSTN as example to explain the AutoST for spatio-temporal prediction.

where represents the number of convolution cells and is the number of connections. The is the convolution parameter at layer . Besides, the and are parameters for Conv1, Conv2 and FCs respectively. Finally, the output of ST architecture search network in Fig4 is:

Where is the output of AutoST and represents the external factor.

Algorithm and Optimization

algorithm

Experiments

exp1

exp2

exp3

exp4

exp5

The paper

If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !

AutoST:Efficient Neural Architecture Search for Spatio-Temporal Prediction

An introduction to AutoST

Introduction

Preliminaries

Methodology

Spatio-Temporal Search Space

Candidate Cells

Operation Blocks

NAS Network

AutoST for Spatio-Temporal Prediction

Algorithm and Optimization

Experiments

The paper

FEATURED TAGS

FRIENDS