The blog introduces the paper : AutoCTS+
Introduction
Leveraging the powerful feature extraction capabilities of deep learning models, expert-designed ST-blocks have been proposed to capture spatio-temporal dependencies to enable accurate forecasting.
A more recent approach is to automate the design of effective ST-blocks, represented as a supernet-subnet architecture. The goal is to learn the operator-associated weights
, upon which an optimal subnet is obtained by picking the operator with the highest weight between every two nodes. The existing automated CTS forecasting methods above still suffer from three main limitations:
- Lack of support for joint search for architectures and hyperparameters .Rely on predefined hyperparameters including architectural hyperparameters(e.g. the number of latent representations, i.e., nodes, in an ST-block and the size of a latent representation) and training hyperparameters(e.g. the dropout rate). Thus semi-automated.
- Poor scalability .The entire supernet must reside in memory during training, which may cause memory overflow in large-scale CTS settings. The memory usage of a supernet goes overflow more than a subnet, which limits the scalability of neural architecture search.
- One-time use. Existing automated CTS forecasting methods train a supernet for each specific dataset from scratch, which is costly.
SEARCH, A Scalable and Efficient joint ARChitecture and Hyperparameter search framework is proposed to address the limitations above.
- Propose a novel search space for correlated time series forecasting to facilitate joint search for architectures and hyperparameter settings.
- A memory-efficient Architecture-Hyperparameter Comparator (AHC) is proposed to rank arch-hyper candidates that encompasses an easy-to-obtain proxy metric to generate pseudo-labels to train an AHC in a denoising manner, thereby improving search efficiency.
- Propose a transfer method that is able to quickly adapt a trained AHC to a new dataset, thus significantly improving the AHC training efficiency on new datasets.
- Extensive experiments on six benchmark datasets show that SEARCH is able to efficiently find better-performing CTS forecasting models compared to state-of-the-art manual and automatic methods.
Preliminaries
Problem Settings
The single-step CTS forecasting:
The multi-step CTS forecasting:
where
The goal is to automatically build an optimal ST-block
Forecasting Models
The common framework of manually designed neural CTS forecasting models has three components: an input module, an ST-backbone and an output module.
The specific types of S/T-operators in ST-blocks and their connections are critical to the success of a CTS forecasting model.
Existing Automated Methods
Existing Automated Methods(AutoCTS) take account of the design of ST-blocks and ST-backbone by using the method proposed in DARTS so that the model can be trained by gradient descent.
SCALABLE AND EFFICIENT JOINT SEARCH
Joint Search Space
The joint search space considers two aspects of ST-blocks:
- the architecture, including operators and their connections
- the hyperparameters, including architecture-related structural hyperparameters(e.g., the hidden dimension) and optimization-related training hyperparameters(e.g. the dropout rate).
Architecture Search Space
Candidate operators
Obtain a candidate operator set
The “identity” operator to support skip-connections between nodes.
Two T-operators:
- The Gated Dilated Causal Convolution used to capture short-term temporal dependencies.
- Informer(INF-T) excels at learning long-term temporal dependencies.
Two S-operators:
- The Diffusion Graph Convolution Network is effective at capturing static spatial correlations.
- Informer(INF-S) is effective at discovering dynamic spatial correlations.
The set
Include the operator in the candidate operator set
. Sample some arch-hypers that include the new operator and use them to generate additional clean and noisy samples to retrain the AHC.
The clean and noisy samples collected before can be reused when retraining the AHC, and AHC training is quite efficient.
Topological connections
Consider the possible topological connections among the operators within an ST-block.
An ST-block can be represented a directed acyclic graph (DAG)
- There is at most one edge from node
to node , and no edge is allowed from node to node , where . - The operator on an edge is selected from the chosen candidate set, including the identity operator.
Hyperparameter Search Space
Consider two kinds of hyperparameters:
- Structural hyperparameters
- Training hyperparameters
The framework can easily include additional hyperparameters as well as expanded ranges of values for existing hyperparameters.
Structural hyperparameters related to the sepcific structure of an ST-block.
Training hyperparameters include the dropout rate
Encoding of the Joint Search Space
Combine the Architecture and Hyperparameter search spaces to construct a joint search space to support the search for an optimal arch-hyper.
Design the joint search space as a joint dual DAG:
- Convert the original DAG
of an architecture in the architecture search space into its dual graph where nodes represent operators and edges represent information flow. - Add a new “Hyper” node that represents the hyperparameter setting on the architecture to the dual DAG and it connects to all other nodes. So that a single DAG can represent a complete ST-block containing both the candidate architecture and the hyperparameters.
Use an adjacency matrix
For the “Hyper” node, the original feature is an
where
For the other
where
This way, each arch-hyper in the joint search space can be encoded as an adjacency matrix
Architecture-Hyperparameter Comparator
Propose an architecture-hyperparameter comparator(AHC) to compare and rank the arch-hypers so that one could avoid evaluating all candidate pairs.
Many existing studies propose an accuracy estimator(a neural network) trained by a large volume of
Use a comparator to achieve the relative accuracy relation of two candidate arch-hypers.
Given a measured
A well trained AHC can rank all
Considering the powerful ability of GIN(Graph Isomorphism Network) to distinguish any two graphs, leverage GINs to encode the arch-hyper as a compact continuous embedding.
where
Use
GIN is proposed in the following paper “How powerful are Graph Neural Networks?”
The Weisfeiler_Lehman_Isomorphism_Test is shown as followed:
Training AHC with Noisy Proxies
Design a computation-efficient
Proxy Performance Metric
Naively using proxy metrics form CV domain can create severely mislabeled training samples
Propose a computation-efficient proxy metric specifically for CTS forecasting which performs better than metircs from CV domain(e.g. (nparam) , Synflow , Snip, (NTK) score).
During experiments, the author notices that if the validation accuracy of an arch-hyper is higher than another in the first few epochs, then its final validation accuracy is very likely to be higher as well.
So the author proposes to train an arch-hyper for only
early-validation proxy:
where
Training AHC with denoising algorithms
To minimize the impact of training with noisy samples, the author fully train a few additional arch-hypers to obtain clean training samples of the form
Adopt a commonly used denoising training method, fine-tune.
- Randomly sample
arch hypers from the joint search space and use the proposed proxy metric to obtain the proxy score for each arch-hyper . - Pair up these arch-hypers to produce
noisy samples of the form . - Randomly sample
arch-hypers and train them completely to obtain the validation accuracy for each arch-hyper . - Also pair up these arch-hypers to produce
clean samples of the form .
The processes of collecting noisy and clean samples are independent and can be highly parallelized. After collecting noisy and clean samples, the author first warm up the AHC for
Search Strategy and AHC Transfer
Search Strategy
Since the joint seach space is enormous, it is inefficient to compare all candidate arch-hypers using the AHC to get the optimal one.
Address the issue with following steps:
Remove the arch-hypers that do not contain either spatial or temporal operators.
Consider a heuristic approach e.g., evolutionary algorithm to find the best arch-hyper in the shrunk joint search space.
First sample
arch-hypers, which are paired up to produce comparison pairs of the form , and the descending ranking of the arch-hypers can be easily obtained based on the comparative performance determined by the trained AHC. Select the top
from the arch-hypers in descending order as the initial population. Each arch-hyper has crossover and mutation probability and , respectively, when generating new offspring in each evolution step. The offspring are added to the population, and the learned AHC is used to compare arch-hypers in the population and to remove inferior arch-hypers to keep the population size at . Choose the top-
arch-hypers from the population to collect their exact forecasting accuracy and pick the one with the highest accuracy as the final searched ST-block.
The evolutionary algorithm is shown as followed:
Transfer a well-trained AHC
Give a well-trained AHC
where
Experiments
The paper
If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !