DARTS:Differentiable Architecture Search

The blog introduces the paper : DARTS

Introduction

The conventional architecture search algorithms are computationally demanding despite their remarkable performance. An inherent cause of inefficiency for the dominant approaches based on reinforcement learning(RL), evolution, MCTS, SMBO or Bayesian optimization is the fact that architecture search is treated as a black-box optimization problem over a discrete domain, which leads to a large number of architecture evaluations required.

In this work, the author approaches the problem from a different angle and propose a method for efficient architecture search called DARTS(Differentiable ARchiTecture Search). Instead of searching over a discrete set of candidate architectures, the author relax the search space to be continuous so that the architecture can be optimization with respect to its validation set performance by gradient descent. It’s able to achieve competitive performance with the state of the art using orders of magnitude less computation resources. It’s also simpler than many existing approaches as it does not involve controllers, hypernetworks or performance predictors.

DARTS is able to learn high-performance architecture building blocks with complex graph topologies within a rich search space. It is not restricted to any specific architecture family and is applicable to both convolutional and recurrent networks.

The contributions of this paper can be summarized as follows:

Introduce a novel algorithm for differentiable network architecture search based on bilevel optimization, which is applicable to both convolutional and recurrent architectures.
Achieve highly competitive results on CIFAR-10 and outperforms the state of the art on PTB.
Achieve remarkable efficiency improvement (reducing the cost of architecture discovery to a few GPU days), which the author attributes to the use of gradient-based optimization as opposed to non-differentiable search techniques.
Show that the architectures learned by DARTS on CIFAR-10 and PTB are transferable to ImageNet and WikiText-2, respectively.

Differentiable Architecture Search

Search Space

Search for a computation cell as the building block of the final architecture. The learned cell could either be stacked to form a convolutional network or recursively connected to form a recurrent network.

A cell is a directed acyclic graph consisting of an ordered sequence of nodes. Each node is a latent representation(e.g. a feature map in convolutional networks) and each directed edge is associated with some operation that transforms . Assume the cell to have two input nodes and a single output node.

For convolutional cells, the input nodes are defined as the cell outputs in the previous two layers.
For recurrent cells, these are defined as the input at the current step and the state carried from the previous step. The output of the cell is obtained by applying a reduction operation (e.g. concatenation) to all the intermediate nodes.

Each intermediate node is computed based on all of its predecessors:

A special zero operation is also included to indicate a lack of connection between two nodes. The task of learning the cell therefore reduces to learning the operations on its edge.\

Continuous Relaxation and Optimization

Let be a set of candidate operations(e.g. convolution, max pooling, zero) where each operation represents some function to be applied to . To make the search space continuous, relax the categorical choice of a particular operation to a softmax over all possible operations:

where the operation mixing weights for a pair of nodes are parameterized by a vector of dimension . The task of architecture search then reduces to learning a set of continuous variables .

cell

At the end of the search, a discrete architecture can be obtained by replacing each mixed operation with the most likely operation, i.e., .

After relaxation, the goal it to jointly learn the architecture and the weights within all the mixed operations(e.g. weights of the convolution filters).

Denote by and the training and the validation loss. Both losses are determined not only by the architecture but also the weights in the network.

The goal for architecture search is to find that minimizes the validation loss , where the weights associated with the architecture are obtained by minimizing the training loss .

This implies a bilevel optimization problem with as the upper-level variable and as the lower-level variable:

Approximate Architecture Gradient

Propose a simple approximation scheme as follows:

The idea is to approximate by adapting using only a single training step, without solving the inner optimization completely by training until convergence.

The expression above contains an expensive matrix-vector product in its second term. The complexity can be substantially reduced using the finite difference approximation:

Evaluating the finite difference requires only two forward passes for the weights and two backward passes for , and the complexity is reduced from to .

Deriving Discrete Architectures

Retain the top- strongest operations among all non-zero candidate operations collected form all the previous nodes to form each node in the discrete architecture.

To make the derived architecture comparable with those in the existing work, use for convolutional cells and for recurrent cells.

Experiments and Results

Experiments on CIFAR-10 and PTB consist of two stages, architecture search and architecture evaluation.

Search for the cell architectures using DARTS and determine the best cells based on their validation performance.
Use these cells to construct larger architectures, and test their performance.
Investigate the transferability of the best cells learned on CIFAR-10 and PTB by evaluating them on ImageNet and WikiText-2(WT2) respectively.

Architecture Search

comp

learned_arc

Architecture Evaluation

comp1

comp2

comp3

comp4

Conclusion

We presented DARTS, a simple yet effificient architecture search algorithm for both convolutional and recurrent networks. By searching in a continuous space, DARTS is able to match or outperform the state-of-the-art non-differentiable architecture search methods on image classifification and language modeling tasks with remarkable effificiency improvement by several orders of magnitude. There are many interesting directions to improve DARTS further. For example, the current method may suffer from discrepancies between the continuous architecture encoding and the derived discrete architecture. This could be alleviated, e.g., by annealing the softmax temperature (with a suitable schedule) to enforce one-hot selection. It would also be interesting to investigate performance-aware architecture derivation schemes based on the shared parameters learned during the search process.

The paper

If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !

DARTS:Differentiable Architecture Search

An introduction to DARTS

Introduction

Differentiable Architecture Search

Search Space

Continuous Relaxation and Optimization

Approximate Architecture Gradient

Deriving Discrete Architectures

Experiments and Results

Architecture Search

Architecture Evaluation

Conclusion

The paper

FEATURED TAGS

FRIENDS