AutoCTS+:Joint Neural Architecture and Hyperparameter Search for Correlated Times Series Forecasting

An introduction to AutoCTS+

Posted by Ccloud on 2023-02-16
Estimated Reading Time 12 Minutes
Words 1.9k In Total
Viewed Times

The blog introduces the paper : AutoCTS+

Introduction

  • Leveraging the powerful feature extraction capabilities of deep learning models, expert-designed ST-blocks have been proposed to capture spatio-temporal dependencies to enable accurate forecasting.

  • A more recent approach is to automate the design of effective ST-blocks, represented as a supernet-subnet architecture. The goal is to learn the operator-associated weights , upon which an optimal subnet is obtained by picking the operator with the highest weight between every two nodes.

  • The existing automated CTS forecasting methods above still suffer from three main limitations:

    • Lack of support for joint search for architectures and hyperparameters .Rely on predefined hyperparameters including architectural hyperparameters(e.g. the number of latent representations, i.e., nodes, in an ST-block and the size of a latent representation) and training hyperparameters(e.g. the dropout rate). Thus semi-automated.
    • Poor scalability .The entire supernet must reside in memory during training, which may cause memory overflow in large-scale CTS settings. The memory usage of a supernet goes overflow more than a subnet, which limits the scalability of neural architecture search.
    • One-time use. Existing automated CTS forecasting methods train a supernet for each specific dataset from scratch, which is costly.
  • SEARCH, A Scalable and Efficient joint ARChitecture and Hyperparameter search framework is proposed to address the limitations above.

    • Propose a novel search space for correlated time series forecasting to facilitate joint search for architectures and hyperparameter settings.
    • A memory-efficient Architecture-Hyperparameter Comparator (AHC) is proposed to rank arch-hyper candidates that encompasses an easy-to-obtain proxy metric to generate pseudo-labels to train an AHC in a denoising manner, thereby improving search efficiency.
    • Propose a transfer method that is able to quickly adapt a trained AHC to a new dataset, thus significantly improving the AHC training efficiency on new datasets.
    • Extensive experiments on six benchmark datasets show that SEARCH is able to efficiently find better-performing CTS forecasting models compared to state-of-the-art manual and automatic methods.

cmp

Preliminaries

Problem Settings

The single-step CTS forecasting:

The multi-step CTS forecasting:

where denotes the feature vectors of all time series at timestamp . The combined data is . , where denotes the sensors, denotes the correlation relationships between sensors, denotes the adjacency matrix about the strengths of the relationships between time series.

The goal is to automatically build an optimal ST-block from a predefined combined architecture-hyperparameter search space that minimizes the forecasting error on a validation dataset .

Forecasting Models

The common framework of manually designed neural CTS forecasting models has three components: an input module, an ST-backbone and an output module.

CTS_model

The specific types of S/T-operators in ST-blocks and their connections are critical to the success of a CTS forecasting model.

Existing Automated Methods

Existing Automated Methods(AutoCTS) take account of the design of ST-blocks and ST-backbone by using the method proposed in DARTS so that the model can be trained by gradient descent.

Joint Search Space

The joint search space considers two aspects of ST-blocks:

  1. the architecture, including operators and their connections
  2. the hyperparameters, including architecture-related structural hyperparameters(e.g., the hidden dimension) and optimization-related training hyperparameters(e.g. the dropout rate).

Architecture Search Space

Candidate operators

Obtain a candidate operator set composed of five operators, i.e.

The “identity” operator to support skip-connections between nodes.

Two T-operators:

  1. The Gated Dilated Causal Convolution used to capture short-term temporal dependencies.
  2. Informer(INF-T) excels at learning long-term temporal dependencies.

Two S-operators:

  1. The Diffusion Graph Convolution Network is effective at capturing static spatial correlations.
  2. Informer(INF-S) is effective at discovering dynamic spatial correlations.

The set can accommodate additional operators by following steps:

  1. Include the operator in the candidate operator set .

  2. Sample some arch-hypers that include the new operator and use them to generate additional clean and noisy samples to retrain the AHC.

  3. The clean and noisy samples collected before can be reused when retraining the AHC, and AHC training is quite efficient.

Topological connections

Consider the possible topological connections among the operators within an ST-block.

An ST-block can be represented a directed acyclic graph (DAG) with nodes, where each node represents a feature representation and each edge represents an operator . Propose the rules to generate candidate ST-blocks:

  1. There is at most one edge from node to node , and no edge is allowed from node to node , where .
  2. The operator on an edge is selected from the chosen candidate set, including the identity operator.

ST_block

Hyperparameter Search Space

Consider two kinds of hyperparameters:

  1. Structural hyperparameters
  2. Training hyperparameters

hyper

The framework can easily include additional hyperparameters as well as expanded ranges of values for existing hyperparameters.

Structural hyperparameters related to the sepcific structure of an ST-block.

denotes the output mode, takes the last node as the output or takes the sum of all nodes as the output.

Training hyperparameters include the dropout rate .

Encoding of the Joint Search Space

Combine the Architecture and Hyperparameter search spaces to construct a joint search space to support the search for an optimal arch-hyper.

Design the joint search space as a joint dual DAG:

  1. Convert the original DAG of an architecture in the architecture search space into its dual graph where nodes represent operators and edges represent information flow.
  2. Add a new “Hyper” node that represents the hyperparameter setting on the architecture to the dual DAG and it connects to all other nodes. So that a single DAG can represent a complete ST-block containing both the candidate architecture and the hyperparameters.

arch_hyper

Use an adjacency matrix and a feature matrix to encode an arch-hyper graph .

reflects the topology information of ,where the binary value of an entry indicates whether there is information flow between these two nodes. Self-connections are also added to all nodes.

contains operator information of each node.

For the “Hyper” node, the original feature is an -dimensional vector from the hyperparameter search space. Employ min-max normalization to normalize the original feature of the “Hyper” node and then convert the normalized feature into a -dimensional embedding:

where is the original feature vector of the “Hyper” node and is learnable matrix, and is the embedding of the “Hyper” node.

For the other operator nodes, first embed each operator with an one-hot embedding and then introduce a learnable matrix that converts the one-hot embeddings of all operator nodes into an embedding matrx:

where and are the one-hot embeddings and the transformed embedding matrix of the nodes. is the learnable matrix, and is the number of candidate operator types in the architecture search space.

This way, each arch-hyper in the joint search space can be encoded as an adjacency matrix and a feature matrix .

and are learned together with the model parameters of the AHC.

Architecture-Hyperparameter Comparator

Propose an architecture-hyperparameter comparator(AHC) to compare and rank the arch-hypers so that one could avoid evaluating all candidate pairs.

Many existing studies propose an accuracy estimator(a neural network) trained by a large volume of , where is an arch-hyper and is the validation accuracy of a full obtained . It is very time-consuming. Instead of comparing by an absolute accuracy, using the relative comparison result of two arch-hypers is sufficient to obtain their ranking.

Use a comparator to achieve the relative accuracy relation of two candidate arch-hypers.

Given a measured pairs to build training samples for in the form of by pairing every two of pairs, where is the binary value indicating which arch-hyper has higher accuracy, thus alleviating the issue of requiring a large amount of training samples.

A well trained AHC can rank all pairs from the joint search space.

AHC

Considering the powerful ability of GIN(Graph Isomorphism Network) to distinguish any two graphs, leverage GINs to encode the arch-hyper as a compact continuous embedding.

where is the number of GIN layers, , is a trainable bias, and MLP is the multi-layer perceptron.

Use to represent an arch-hyper.

𝟙

GIN is proposed in the following paper “How powerful are Graph Neural Networks?”

The Weisfeiler_Lehman_Isomorphism_Test is shown as followed:

Training AHC with Noisy Proxies

Design a computation-efficient to easily obtain a large number of noisy training samples .

Proxy Performance Metric

Naively using proxy metrics form CV domain can create severely mislabeled training samples and is too noisy to achieve a reliable AHC.

Propose a computation-efficient proxy metric specifically for CTS forecasting which performs better than metircs from CV domain(e.g. (nparam) , Synflow , Snip, (NTK) score).

During experiments, the author notices that if the validation accuracy of an arch-hyper is higher than another in the first few epochs, then its final validation accuracy is very likely to be higher as well.

So the author proposes to train an arch-hyper for only epochs and use the validation accuracy of the under-trained model as a proxy.

early-validation proxy:

where is the CTS forecasting model under arch-hyper setting with only epochs training.

exp1

proxy_metric

Training AHC with denoising algorithms

To minimize the impact of training with noisy samples, the author fully train a few additional arch-hypers to obtain clean training samples of the form and then train the AHC with both noisy and clean samples in a denoising manner.

Adopt a commonly used denoising training method, fine-tune.

  1. Randomly sample arch hypers from the joint search space and use the proposed proxy metric to obtain the proxy score for each arch-hyper .
  2. Pair up these arch-hypers to produce noisy samples of the form .
  3. Randomly sample arch-hypers and train them completely to obtain the validation accuracy for each arch-hyper .
  4. Also pair up these arch-hypers to produce clean samples of the form .

The processes of collecting noisy and clean samples are independent and can be highly parallelized. After collecting noisy and clean samples, the author first warm up the AHC for epochs using the noisy samples, and then use the clean samples to finetune the AHC until convergence.

Search Strategy and AHC Transfer

Search Strategy

Since the joint seach space is enormous, it is inefficient to compare all candidate arch-hypers using the AHC to get the optimal one.

Address the issue with following steps:

  1. Remove the arch-hypers that do not contain either spatial or temporal operators.

  2. Consider a heuristic approach e.g., evolutionary algorithm to find the best arch-hyper in the shrunk joint search space.

  3. First sample arch-hypers, which are paired up to produce comparison pairs of the form , and the descending ranking of the arch-hypers can be easily obtained based on the comparative performance determined by the trained AHC.

  4. Select the top from the arch-hypers in descending order as the initial population. Each arch-hyper has crossover and mutation probability and ​​, respectively, when generating new offspring in each evolution step. The offspring are added to the population, and the learned AHC is used to compare arch-hypers in the population and to remove inferior arch-hypers to keep the population size at .

  5. Choose the top- arch-hypers from the population to collect their exact forecasting accuracy and pick the one with the highest accuracy as the final searched ST-block.

The evolutionary algorithm is shown as followed:

Transfer a well-trained AHC

Give a well-trained AHC on a source dataset (use noisy samples and clean samples for training), a new well-trained AHC from to with much less arch-hyper samples on :

where consists of noisy samples and clean samples, and . The transfer process is particularly implemented using the fine-tuning technique.

search_algorithm

Experiments

exp2

exp3

exp5

exp4

exp6

The paper


If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !