- Links
- “Patches Are All You Need?”, Anonymous 2021
- “Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?”, Tang et al 2021
- “ConvMLP: Hierarchical Convolutional MLPs for Vision”, Li et al 2021
- “Sparse-MLP: A Fully-MLP Architecture with Conditional Computation”, Lou et al 2021
- “RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?”, Tatsunami & Taki 2021
- “CycleMLP: A MLP-like Architecture for Dense Prediction”, Chen et al 2021
- “AS-MLP: An Axial Shifted MLP Architecture for Vision”, Lian et al 2021
- “Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition”, Hou et al 2021
- “Towards Biologically Plausible Convolutional Networks”, Pogodin et al 2021
- “Regularization is all you Need: Simple Neural Nets can Excel on Tabular Data”, Kadra et al 2021
- “Adapting the Function Approximation Architecture in Online Reinforcement Learning”, Martin & Modayil 2021
- “PairConnect: A Compute-Efficient MLP Alternative to Attention”, Xu et al 2021
- “MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis”, Tae et al 2021
- “When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations”, Chen et al 2021
- “Luna: Linear Unified Nested Attention”, Ma et al 2021
- “Container: Context Aggregation Network”, Gao et al 2021
- “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers”, Xie et al 2021
- “Less is More: Pay Less Attention in Vision Transformers”, Pan et al 2021
- “MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation”, Cazenavette & Guevara 2021
- “Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks”, Guo et al 2021
- “RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition”, Ding et al 2021
- “Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?”, Pellegrini & Biroli 2021
- “Revisiting Simple Neural Probabilistic Language Models”, Sun & Iyyer 2021
- “KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs”, Reiser et al 2021
- “Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes”, Takikawa et al 2021
- “Image Generators with Conditionally-Independent Pixel Synthesis”, Anokhin et al 2020
- “Fourier Neural Operator for Parametric Partial Differential Equations”, Li et al 2020
- “The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers”, Nakkiran et al 2020
- “The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers”, Nakkiran et al 2020
- “An Attention Free Transformer”, Anonymous 2020
- “Towards Learning Convolutions from Scratch”, Neyshabur 2020
- “Implicit Neural Representations with Periodic Activation Functions”, Sitzmann et al 2020
- “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”, Mildenhall et al 2020
- “ReZero is All You Need: Fast Convergence at Large Depth”, Bachlechner et al 2020
- “Gesticulator: A framework for semantically-aware speech-driven gesture generation”, Kucherenko et al 2020
- “Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias”, d'Ascoli et al 2019 {#d’ascoli-et-al-2019-section}
- “MoGlow: Probabilistic and controllable motion synthesis using normalising flows”, Henter et al 2019
- “Fixup Initialization: Residual Learning Without Normalization”, Zhang et al 2019
- “SwitchNet: a neural network model for forward and inverse scattering problems”, Khoo & Ying 2018
- “Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science”, Mocanu et al 2018
- “Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks”, Xiao et al 2018
- “NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations”, Ciccone et al 2018
- “Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU”, Devlin 2017
- “The Shattered Gradients Problem: If resnets are the answer, then what is the question?”, Balduzzi et al 2017
- “Skip Connections Eliminate Singularities”, Orhan & Pitkow 2017
- “Do Deep Convolutional Nets Really Need to be Deep and Convolutional?”, Urban et al 2016
- “How far can we go without convolution: Improving fully-connected networks”, Lin et al 2015
- “Tensorizing Neural Networks”, Novikov et al 2015
- “Deep Neural Networks for Large Vocabulary Handwritten Text Recognition”, Bluche 2015
- “Do Deep Nets Really Need to be Deep?”, Ba & Caruana 2013
- “Network In Network”, Lin et al 2013
- “Deep Big Multilayer Perceptrons for Digit Recognition”, Cireşan et al 2012
- “Extraction de séquences numériques dans des documents manuscrits quelconques”, Chatelain 2006
- “Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis”, Simard et al 2003

- Miscellaneous

# Links

## “Patches Are All You Need?”, Anonymous 2021

“Patches Are All You Need?”, (2021-10-05; backlinks):

Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet. Our code is available at https://github.com/tmp-iclr/convmixer.

[

Keywords: computer vision, vision transformer, mixer, patch embeddings, convolution, convolutional neural network]

## “Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?”, Tang et al 2021

“Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?”, (2021-09-12; backlinks):

Transformers have sprung up in the field of computer vision. In this work, we explore whether the core self-attention module in Transformer is the key to achieving excellent performance in image recognition. To this end, we build an attention-free network called sMLPNet based on the existing MLP-based vision models. Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module. For 2D image tokens, sMLP applies 1D MLP along the axial directions and the parameters are shared among rows or columns. By sparse connection and weight sharing, sMLP module significantly reduces the number of model parameters and computational complexity, avoiding the common over-fitting problem that plagues the performance of MLP-like models. When only trained on the ImageNet-1K dataset, the proposed sMLPNet achieves 81.9% top-1 accuracy with only 24M parameters, which is much better than most CNNs and vision Transformers under the same model size constraint. When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer. The success of sMLPNet suggests that the self-attention mechanism is not necessarily a silver bullet in computer vision. Code will be made publicly available.

## “ConvMLP: Hierarchical Convolutional MLPs for Vision”, Li et al 2021

“ConvMLP: Hierarchical Convolutional MLPs for Vision”, (2021-09-09; backlinks):

MLP-based architectures, which consist of a sequence of consecutive multi-layer perceptron blocks, have recently been found to reach comparable results to convolutional and transformer-based methods. However, most adopt spatial MLPs which take fixed dimension inputs, therefore making it difficult to apply them to downstream tasks, such as object detection and semantic segmentation. Moreover, single-stage designs further limit performance in other computer vision tasks and fully connected layers bear heavy computation. To tackle these problems, we propose ConvMLP: a hierarchical Convolutional MLP for visual recognition, which is a light-weight, stage-wise, co-design of convolution layers, and MLPs. In particular, ConvMLP-S achieves 76.8% top-1 accuracy on ImageNet-1k with 9M parameters and 2.4G MACs (15% and 19% of MLP-Mixer-B/16, respectively). Experiments on object detection and semantic segmentation further show that visual representation learned by ConvMLP can be seamlessly transferred and achieve competitive results with fewer parameters. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Convolutional-MLPs.

## “Sparse-MLP: A Fully-MLP Architecture with Conditional Computation”, Lou et al 2021

“Sparse-MLP: A Fully-MLP Architecture with Conditional Computation”, (2021-09-05; backlinks):

Mixture of Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost. In this paper, we propose Sparse-MLP, scaling the recent MLP-Mixer model with sparse MoE layers, to achieve a more computation-efficient architecture. We replace a subset of dense MLP blocks in the MLP-Mixer model with Sparse blocks. In each Sparse block, we apply two stages of MoE layers: one with MLP experts mixing information within channels along image patch dimension, one with MLP experts mixing information within patches along the channel dimension. Besides, to reduce computational cost in routing and improve experts capacity, we design Re-represent layers in each Sparse block. These layers are to re-scale image representations by two simple but effective linear transformations. By pre-training on ImageNet-1k with MoCo v3 algorithm, our models can outperform dense MLP models with comparable parameters and less computational cost on several downstream image classification tasks.

## “RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?”, Tatsunami & Taki 2021

“RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?”, (2021-08-09; backlinks):

For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer is on the rise. However, the quadratic computational cost of self-attention has become a severe problem of practice. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple idea designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. Thus, there is still a possibility to build a non-convolutional inductive bias into the architecture itself, and we built in an inductive bias using two simple ideas. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. Compared to other MLP-based models, the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. In addition, our work indicates that MLP-based models have the potential to replace CNNs by adopting inductive bias. The source code in PyTorch version is available at https: / / github.com / okojoalg / raft-mlp.

## “CycleMLP: A MLP-like Architecture for Dense Prediction”, Chen et al 2021

“CycleMLP: A MLP-like Architecture for Dense Prediction”, (2021-07-21; backlinks):

This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions, unlike modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation. CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have quadratic computations because of their fully spatial connections. We build a family of models that surpass existing MLPs and achieve a comparable accuracy (83.2%) on ImageNet-1K classification compared to the state-of-the-art Transformer such as Swin Transformer (83.3%) but using fewer parameters and FLOPs. We expand the MLP-like models’ applicability, making them a versatile backbone for dense prediction tasks. CycleMLP aims to provide a competitive baseline on object detection, instance segmentation, and semantic segmentation for MLP models. In particular, CycleMLP achieves 45.1 mIoU on ADE20K val, comparable to Swin (45.2 mIOU). Code is available at https: / / github.com / ShoufaChen / CycleMLP.

## “AS-MLP: An Axial Shifted MLP Architecture for Vision”, Lian et al 2021

“AS-MLP: An Axial Shifted MLP Architecture for Vision”, (2021-07-18; backlinks):

An Axial Shifted MLP architecture (AS-MLP) is proposed in this paper. Different from MLP-Mixer, where the global spatial feature is encoded for the information flow through matrix transposition and one token-mixing MLP, we pay more attention to the local features communication. By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different axial directions, which captures the local dependencies. Such an operation enables us to utilize a pure MLP architecture to achieve the same local receptive field as CNN-like architecture. We can also design the receptive field size and dilation of blocks of AS-MLP, etc, just like designing those of convolution kernels. With the proposed AS-MLP architecture, our model obtains 83.3% Top-1 accuracy with 88M parameters and 15.2 GFLOPs on the ImageNet-1K dataset. Such a simple yet effective architecture outperforms all MLP-based architectures and achieves competitive performance compared to the transformer-based architectures (e.g., Swin Transformer) even with slightly lower FLOPs. In addition, AS-MLP is also the first MLP-based architecture to be applied to the downstream tasks (e.g., object detection and semantic segmentation). The experimental results are also impressive. Our proposed AS-MLP obtains 51.5 mAP on the COCO validation set and 49.5 MS mIoU on the ADE20K dataset, which is competitive compared to the transformer-based architectures. Code is available at https://github.com/svip-lab/AS-MLP.

## “Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition”, Hou et al 2021

“Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition”, (2021-06-23; backlinks):

In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies along one spatial direction and meanwhile preserve precise positional information along the other direction. The resulting position-sensitive outputs are then aggregated in a mutually complementing manner to form expressive representations of the objects of interest. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers. Without the dependence on spatial convolutions or attention mechanisms, Vision Permutator achieves 81.5% top-1 accuracy on ImageNet without extra large-scale training data (e.g., ImageNet-22k) using only 25M learnable parameters, which is much better than most CNNs and vision transformers under the same model size constraint. When scaling up to 88M, it attains 83.2% top-1 accuracy. We hope this work could encourage research on rethinking the way of encoding spatial information and facilitate the development of MLP-like models. Code is available at https://github.com/Andrew-Qibin/VisionPermutator.

## “Towards Biologically Plausible Convolutional Networks”, Pogodin et al 2021

“Towards Biologically Plausible Convolutional Networks”, (2021-06-22; backlinks):

Convolutional networks are ubiquitous in deep learning. They are particularly useful for images, as they reduce the number of parameters, reduce training time, and increase accuracy. However, as a model of the brain they are seriously problematic, since they require weight sharing—something real neurons simply cannot do. Consequently, while neurons in the brain can be locally connected (one of the features of convolutional networks), they cannot be convolutional. Locally connected but non-convolutional networks, however, significantly underperform convolutional ones. This is troublesome for studies that use convolutional networks to explain activity in the visual system. Here we study plausible alternatives to weight sharing that aim at the same regularization principle, which is to make each neuron within a pool react similarly to identical inputs. The most natural way to do that is by showing the network multiple translations of the same image, akin to saccades in animal vision. However, this approach requires many translations, and doesn’t remove the performance gap. We propose instead to add lateral connectivity to a locally connected network, and allow learning via Hebbian plasticity. This requires the network to pause occasionally for a sleep-like phase of “weight sharing”. This method enables locally connected networks to achieve nearly convolutional performance on ImageNet, thus supporting convolutional networks as a model of the visual stream.

## “Regularization is all you Need: Simple Neural Nets can Excel on Tabular Data”, Kadra et al 2021

“Regularization is all you Need: Simple Neural Nets can Excel on Tabular Data”, (2021-06-21; backlinks):

Tabular datasets are the last “unconquered castle” for deep learning, with traditional ML methods like Gradient-Boosted Decision Trees still performing strongly even against recent specialized neural architectures. In this paper, we hypothesize that the key to boosting the performance of neural networks lies in rethinking the joint and simultaneous application of a large set of modern regularization techniques. As a result, we propose regularizing plain Multilayer Perceptron (MLP) networks by searching for the optimal combination/cocktail of 13 regularization techniques for each dataset using a joint optimization over the decision on which regularizers to apply and their subsidiary hyperparameters. We empirically assess the impact of these regularization cocktails for MLPs on a large-scale empirical study comprising 40 tabular datasets and demonstrate that (i) well-regularized plain MLPs significantly outperform recent state-of-the-art specialized neural network architectures, and (ii) they even outperform strong traditional ML methods, such as XGBoost.

## “Adapting the Function Approximation Architecture in Online Reinforcement Learning”, Martin & Modayil 2021

“Adapting the Function Approximation Architecture in Online Reinforcement Learning”, (2021-06-17; backlinks):

The performance of a reinforcement learning (RL) system depends on the computational architecture used to approximate a value function. Deep learning methods provide both optimization techniques and architectures for approximating nonlinear functions from noisy, high-dimensional observations. However, prevailing optimization techniques are not designed for strictly-incremental online updates. Nor are standard architectures designed for observations with an a priori unknown structure: for example, light sensors randomly dispersed in space. This paper proposes an online RL prediction algorithm with an adaptive architecture that efficiently finds useful nonlinear features. The algorithm is evaluated in a spatial domain with high-dimensional, stochastic observations. The algorithm outperforms non-adaptive baseline architectures and approaches the performance of an architecture given side-channel information. These results are a step towards scalable RL algorithms for more general problems, where the observation structure is not available.

## “PairConnect: A Compute-Efficient MLP Alternative to Attention”, Xu et al 2021

“PairConnect: A Compute-Efficient MLP Alternative to Attention”, (2021-06-15; backlinks):

Transformer models have demonstrated superior performance in natural language processing. The dot product self-attention in Transformer allows us to model interactions between words. However, this modeling comes with significant computational overhead. In this work, we revisit the memory-compute trade-off associated with Transformer, particularly multi-head attention, and show a memory-heavy but statistically-significantly more compute-efficient alternative to Transformer. Our proposal, denoted as PairConnect, a multilayer perceptron (MLP), models the pairwise interaction between words by explicit pairwise word embeddings. As a result, PairConnect substitutes self dot product with a simple embedding lookup. We show mathematically that despite being an MLP, our compute-efficient PairConnect is strictly more expressive than Transformer. Our experiment on language modeling tasks suggests that PairConnect could achieve comparable results with Transformer while reducing the computational cost associated with inference significantly.

## “MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis”, Tae et al 2021

“MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis”, (2021-06-15; ; backlinks):

Recent developments in deep learning have significantly improved the quality of synthesized singing voice audio. However, prominent neural singing voice synthesis systems suffer from slow inference speed due to their autoregressive design. Inspired by MLP-Mixer, a novel architecture introduced in the vision literature for attention-free image classification, we propose MLP Singer, a parallel Korean singing voice synthesis system. To the best of our knowledge, this is the first work that uses an entirely MLP-based architecture for voice synthesis. Listening tests demonstrate that MLP Singer outperforms a larger autoregressive GAN-based system, both in terms of audio quality and synthesis speed. In particular, MLP Singer achieves a real-time factor of up to 200 and 3400 on CPUs and GPUs respectively, enabling order of magnitude faster generation on both environments.

## “When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations”, Chen et al 2021

“When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations”, (2021-06-03; backlinks):

Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pretraining and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rate). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models’ data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations. They also possess more perceptive attention maps.

## “Luna: Linear Unified Nested Attention”, Ma et al 2021

“Luna: Linear Unified Nested Attention”, (2021-06-03; backlinks):

The quadratic computational and memory complexities of the Transformer’s attention mechanism have limited its scalability for modeling long sequences. In this paper, we propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear (as opposed to quadratic) time and space complexity. Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function. As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly, while also storing adequate contextual information. We perform extensive evaluations on three benchmarks of sequence modeling tasks: long-context sequence modeling, neural machine translation and masked language modeling for large-scale pretraining. Competitive or even better experimental results demonstrate both the effectiveness and efficiency of Luna compared to a variety

## “Container: Context Aggregation Network”, Gao et al 2021

“Container: Context Aggregation Network”, (2021-06-02; backlinks):

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers—originally introduced in natural language processing—have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP-based solution without any traditional convolutional or Transformer components can produce effective visual representations.

While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the CONTAINER (

CONTextAggregatIonNEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactionsa laTransformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs.In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named CONTAINER-LIGHT, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 points respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size.

Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework.

## “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers”, Xie et al 2021

“SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers”, (2021-05-31; backlinks):

We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5× smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.

## “Less is More: Pay Less Attention in Vision Transformers”, Pan et al 2021

“Less is More: Pay Less Attention in Vision Transformers”, (2021-05-29; backlinks):

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works can be prohibitively expensive due to the quadratic complexity of self-attention over a long sequence of representations, especially for high-resolution dense prediction tasks. To this end, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that convolutions, fully-connected (FC) layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. Specifically, we propose a hierarchical Transformer where we use pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks. Code is available at: https://github.com/MonashAI/LIT

## “Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks”, Guo et al 2021

“Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks”, (2021-05-05; backlinks):

Attention mechanisms, especially self-attention, have played an increasingly important role in deep feature representation for visual tasks. Self-attention updates the feature at each position by computing a weighted sum of features using pair-wise affinities across all positions to capture the long-range dependency within a single sample. However, self-attention has quadratic complexity and ignores potential correlation between different samples. This paper proposes a novel attention mechanism which we call external attention, based on two external, small, learnable, shared memories, which can be implemented easily by simply using two cascaded linear layers and two normalization layers; it conveniently replaces self-attention in existing popular architectures. External attention has linear complexity and implicitly considers the correlations between all data samples. We further incorporate the multi-head mechanism into external attention to provide an all-MLP architecture, external attention MLP (EAMLP), for image classification. Extensive experiments on image classification, object detection, semantic segmentation, instance segmentation, image generation, and point cloud analysis reveal that our method provides results comparable or superior to the self-attention mechanism and some of its variants, with much lower computational and memory costs.

## “RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition”, Ding et al 2021

“RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition”, (2021-05-05; backlinks):

We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition). The code and models are available at https://github.com/DingXiaoH/RepMLP.

## “Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?”, Pellegrini & Biroli 2021

“Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?”, (2021-04-27; backlinks):

Pruning methods can considerably reduce the size of artificial neural networks without harming their performance. In some cases, they can even uncover sub-networks that, when trained in isolation, match or surpass the test accuracy of their dense counterparts. Here we study the inductive bias that pruning imprints in such “winning lottery tickets”. Focusing on visual tasks, we analyze the architecture resulting from iterative magnitude pruning of a simple fully connected network (FCN). We show that the surviving node connectivity is local in input space, and organized in patterns reminiscent of the ones found in convolutional networks (CNN). We investigate the role played by data and tasks in shaping the architecture of pruned sub-networks. Our results show that the winning lottery tickets of FCNs display the key features of CNNs. The ability of such automatic network-simplifying procedure to recover the key features “hand-crafted” in the design of CNNs suggests interesting applications to other datasets and tasks, in order to discover new and efficient architectural inductive biases.

## “Revisiting Simple Neural Probabilistic Language Models”, Sun & Iyyer 2021

“Revisiting Simple Neural Probabilistic Language Models”, (2021-04-08; backlinks):

Recent progress in language modeling has been driven not only by advances in neural architectures, but also through hardware and optimization improvements. In this paper, we revisit the neural probabilistic language model (NPLM) of , which simply concatenates word embeddings within a fixed window and passes the result through a feed-forward network to predict the next word. When scaled up to modern hardware, this model (despite its many limitations) performs much better than expected on word-level language model benchmarks. Our analysis reveals that the NPLM achieves lower perplexity than a baseline Transformer with short input contexts but struggles to handle long-term dependencies. Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM’s local concatenation layer, which results in small but consistent perplexity decreases across three word-level language modeling datasets.

## “KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs”, Reiser et al 2021

“KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs”, (2021-03-25; backlinks):

NeRF synthesizes novel views of a scene with unprecedented quality by fitting a neural radiance field to RGB images. However, NeRF requires querying a deep Multi-Layer Perceptron (MLP) millions of times, leading to slow rendering times, even on modern GPUs. In this paper, we demonstrate that significant speed-ups are possible by utilizing thousands of tiny MLPs instead of one single large MLP. In our setting, each individual MLP only needs to represent parts of the scene, thus smaller and faster-to-evaluate MLPs can be used. By combining this divide-and-conquer strategy with further optimizations, rendering is accelerated by two orders of magnitude compared to the original NeRF model without incurring high storage costs. Further, using teacher-student distillation for training, we show that this speed-up can be achieved without sacrificing visual quality.

## “Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes”, Takikawa et al 2021

“Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes”, (2021-01-26; backlinks):

Neural signed distance functions (SDFs) are emerging as an effective representation for 3D shapes. State-of-the-art methods typically encode the SDF with a large, fixed-size neural network to approximate complex shapes with implicit surfaces. Rendering with these large networks is, however, computationally expensive since it requires many forward passes through the network for every pixel, making these representations impractical for real-time graphics. We introduce an efficient neural representation that, for the first time, enables real-time rendering of high-fidelity neural SDFs, while achieving state-of-the-art geometry reconstruction quality. We represent implicit surfaces using an octree-based feature volume which adaptively fits shapes with multiple discrete levels of detail (LODs), and enables continuous LOD with SDF interpolation. We further develop an efficient algorithm to directly render our novel neural SDF representation in real-time by querying only the necessary LODs with sparse octree traversal. We show that our representation is 2–3 orders of magnitude more efficient in terms of rendering speed compared to previous works. Furthermore, it produces state-of-the-art reconstruction quality for complex shapes under both 3D geometric and 2D image-space metrics.

## “Image Generators with Conditionally-Independent Pixel Synthesis”, Anokhin et al 2020

“Image Generators with Conditionally-Independent Pixel Synthesis”, (2020-11-27; backlinks):

Existing image generator networks rely heavily on spatial convolutions and, optionally, self-attention blocks in order to gradually synthesize images in a coarse-to-fine manner. Here, we present a new architecture for image generators, where the color value at each pixel is computed independently given the value of a random latent vector and the coordinate of that pixel. No spatial convolutions or similar operations that propagate information across pixels are involved during the synthesis. We analyze the modeling capabilities of such generators when trained in an adversarial fashion, and observe the new generators to achieve similar generation quality to state-of-the-art convolutional generators. We also investigate several interesting properties unique to the new architecture.

## “Fourier Neural Operator for Parametric Partial Differential Equations”, Li et al 2020

“Fourier Neural Operator for Parametric Partial Differential Equations”, (2020-10-18; backlinks):

The classical development of neural networks has primarily focused on learning mappings between finite-dimensional Euclidean spaces. Recently, this has been generalized to neural operators that learn mappings between function spaces. For partial differential equations (PDEs), neural operators directly learn the mapping from any functional parametric dependence to the solution. Thus, they learn an entire family of PDEs, in contrast to classical methods which solve one instance of the equation. In this work, we formulate a new neural operator by parameterizing the integral kernel directly in Fourier space, allowing for an expressive and efficient architecture. We perform experiments on Burgers’ equation, Darcy flow, and Navier-Stokes equation. The Fourier neural operator is the first ML-based method to successfully model turbulent flows with zero-shot super-resolution. It is up to three orders of magnitude faster compared to traditional PDE solvers. Additionally, it achieves superior accuracy compared to previous learning-based solvers under fixed resolution.

## “The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers”, Nakkiran et al 2020

“The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers”, (2020-10-16; backlinks):

We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap (2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning. We then give empirical evidence that this gap between worlds can be small in realistic deep learning settings, in particular supervised image classification. For example, CNNs generalize better than MLPs on image distributions in the Real World, but this is “because” they optimize faster on the population loss in the Ideal World. This suggests our framework is a useful tool for understanding generalization in deep learning, and lays a foundation for future research in the area.

“The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers”, (2020-10-16; backlinks):

We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap (2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning. We then give empirical evidence that this gap between worlds can be small in realistic deep learning settings, in particular supervised image classification. For example, CNNs generalize better than MLPs on image distributions in the Real World, but this is “because” they optimize faster on the population loss in the Ideal World. This suggests our framework is a useful tool for understanding generalization in deep learning, and lays a foundation for future research in the area.

## “An Attention Free Transformer”, Anonymous 2020

“An Attention Free Transformer”, (2020-09-28; backlinks):

We propose an efficient Transformer that eliminates attention.

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for spatial attention. AFT offers great simplicity compared with standard Transformers, where the multi-head attention operation is replaced with the composition of element-wise multiplications/divisions and global/local pooling. We provide several variants of AFT along with simple yet efficient implementations that are supported by main stream deep learning libraries. We show that, surprisingly, we are able to train AFT effectively on challenging benchmarks, and also to match or surpass the standard Transformer counterparts.

[

Keywords: Transformers, attention, efficient]

## “Towards Learning Convolutions from Scratch”, Neyshabur 2020

“Towards Learning Convolutions from Scratch”, (2020-07-27; backlinks):

Convolution is one of the most essential components of architectures used in computer vision. As machine learning moves towards reducing the expert bias and learning it from data, a natural next step seems to be learning convolution-like structures from scratch. This, however, has proven elusive. For example, current state-of-the-art architecture search algorithms use convolution as one of the existing modules rather than learning it from data.

In an attempt to understand the inductive bias that gives rise to convolutions, we investigate minimum description length as a guiding principle and show that in some settings, it can indeed be indicative of the performance of architectures./p>

To find architectures with small description length, we propose β-LASSO, a simple variant of LASSO algorithm that, when applied on fully-connected networks for image classification tasks, learns architectures with local connections and achieves state-of-the-art accuracies for training fully-connected nets on CIFAR-10 (85.19%), CIFAR-100 (59.56%) and SVHN (94.07%) bridging the gap between fully-connected and convolutional nets.

## “Implicit Neural Representations with Periodic Activation Functions”, Sitzmann et al 2020

“Implicit Neural Representations with Periodic Activation Functions”, (2020-06-17; backlinks):

Implicitly defined, continuous, differentiable signal representations parameterized by neural networks have emerged as a powerful paradigm, offering many possible benefits over conventional representations. However, current network architectures for such implicit neural representations are incapable of modeling signals with fine detail, and fail to represent a signal’s spatial and temporal derivatives, despite the fact that these are essential to many physical signals defined implicitly as the solution to partial differential equations. We propose to leverage periodic activation functions for implicit neural representations and demonstrate that these networks, dubbed sinusoidal representation networks or Sirens, are ideally suited for representing complex natural signals and their derivatives. We analyze Siren activation statistics to propose a principled initialization scheme and demonstrate the representation of images, wavefields, video, sound, and their derivatives. Further, we show how Sirens can be leveraged to solve challenging boundary value problems, such as particular Eikonal equations (yielding signed distance functions), the Poisson equation, and the Helmholtz and wave equations. Lastly, we combine Sirens with hypernetworks to learn priors over the space of Siren functions.

## “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”, Mildenhall et al 2020

“NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”, (2020-03-19; backlinks):

We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (

x,y,z) and viewing direction (Θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.

## “ReZero is All You Need: Fast Convergence at Large Depth”, Bachlechner et al 2020

“ReZero is All You Need: Fast Convergence at Large Depth”, (2020-03-10; backlinks):

Deep networks often suffer from vanishing or exploding gradients due to inefficient signal propagation, leading to long training times or convergence difficulties. Various architecture designs, sophisticated residual-style networks, and initialization schemes have been shown to improve deep signal propagation. Recently, Pennington et al used free probability theory to show that dynamical isometry plays an integral role in efficient deep learning. We show that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches. Although much simpler than its predecessors, this gate enables training thousands of fully connected layers with fast convergence and better test performance for ResNets trained on CIFAR-10. We apply this technique to language modeling and find that we can easily train 120-layer Transformers. When applied to 12 layer Transformers, it converges 56% faster on enwiki8.

## “Gesticulator: A framework for semantically-aware speech-driven gesture generation”, Kucherenko et al 2020

“Gesticulator: A framework for semantically-aware speech-driven gesture generation”, (2020-01-25; backlinks):

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (e.g., raising a hand when saying “high”): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. Subjective and objective evaluations confirm the success of our approach. The code and video are available at the project page https://svito-zar.github.io/gesticulator .

## “Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias”, d'Ascoli et al 2019 {#d’ascoli-et-al-2019-section}

“Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias”{#linkBibliography-d’ascoli-et-al-2019 .docMetadata}, (2019-06-16; backlinks):

Despite the phenomenal success of deep neural networks in a broad range of learning tasks, there is a lack of theory to understand the way they work. In particular, Convolutional Neural Networks (CNNs) are known to perform much better than Fully-Connected Networks (FCNs) on spatially structured data: the architectural structure of CNNs benefits from prior knowledge on the features of the data, for instance their translation invariance. The aim of this work is to understand this fact through the lens of dynamics in the loss landscape.

We introduce a method that maps a CNN to its equivalent FCN (denoted as eFCN). Such an embedding enables the comparison of CNN and FCN training dynamics directly in the FCN space. We use this method to test a new training protocol, which consists in training a CNN, embedding it to FCN space at a certain “relax time”, then resuming the training in FCN space. We observe that for all relax times, the deviation from the CNN subspace is small, and the final performance reached by the eFCN is higher than that reachable by a standard FCN of same architecture. More surprisingly, for some intermediate relax times, the eFCN outperforms the CNN it stemmed, by combining the prior information of the CNN and the expressivity of the FCN in a complementary way. The practical interest of our protocol is limited by the very large size of the highly sparse eFCN. However, it offers interesting insights into the persistence of architectural bias under stochastic gradient dynamics. It shows the existence of some rare basins in the FCN loss landscape associated with very good generalization. These can only be accessed thanks to the CNN prior, which helps navigate the landscape during the early stages of optimization.

## “MoGlow: Probabilistic and controllable motion synthesis using normalising flows”, Henter et al 2019

“MoGlow: Probabilistic and controllable motion synthesis using normalising flows”, (2019-05-16; backlinks):

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive, task-specific assumptions regarding the motion or the character morphology. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method outperforms task-agnostic baselines and attains a motion quality close to recorded motion capture.

## “Fixup Initialization: Residual Learning Without Normalization”, Zhang et al 2019

“Fixup Initialization: Residual Learning Without Normalization”, (2019-01-27; backlinks):

Normalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and improve generalization, though the reason for their effectiveness is still an active research topic. In this work, we challenge the commonly-held beliefs by showing that none of the perceived benefits is unique to normalization. Specifically, we propose fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization. We find training residual networks with Fixup to be as stable as training with normalization—even for networks with 10,000 layers. Furthermore, with proper regularization, Fixup enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation.

## “SwitchNet: a neural network model for forward and inverse scattering problems”, Khoo & Ying 2018

“SwitchNet: a neural network model for forward and inverse scattering problems”, (2018-10-23; backlinks):

We propose a novel neural network architecture, SwitchNet, for solving the wave equation based inverse scattering problems via providing maps between the scatterers and the scattered field (and vice versa). The main difficulty of using a neural network for this problem is that a scatterer has a global impact on the scattered wave field, rendering typical convolutional neural network with local connections inapplicable. While it is possible to deal with such a problem using a fully connected network, the number of parameters grows quadratically with the size of the input and output data. By leveraging the inherent low-rank structure of the scattering problems and introducing a novel switching layer with sparse connections, the SwitchNet architecture uses much fewer parameters and facilitates the training process. Numerical experiments show promising accuracy in learning the forward and inverse maps between the scatterers and the scattered wave field.

## “Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science”, Mocanu et al 2018

“Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science”, (2018-06-19; backlinks):

Through the success of deep learning in various domains, artificial neural networks are currently among the most used artificial intelligence methods. Taking inspiration from the network properties of biological neural networks (eg. sparsity, scale-freeness), we argue that (contrary to general practice) artificial neural networks, too, should not have fully-connected layers.

Here we propose sparse evolutionary training of artificial neural networks, an algorithm which evolves an initial sparse topology (Erdős-Rényi random graph) of 2 consecutive layers of neurons into a scale-free topology, during learning. Our method replaces artificial neural networks’ fully-connected layers with sparse ones before training, reducing quadratically the number of parameters, with no decrease in accuracy. We demonstrate our claims on restricted Boltzmann machines, multi-layer perceptrons, and convolutional neural networks for unsupervised and supervised learning on 15 datasets.

Our approach has the potential to enable artificial neural networks to scale up beyond what is currently possible.

## “Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks”, Xiao et al 2018

“Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks”, (2018-06-14; backlinks):

In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enable training at these depths, it has remained unclear whether such specialized architecture designs are truly necessary to train deep CNNs. In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by developing a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix. These conditions require that the convolution operator be an orthogonal transformation in the sense that it is norm-preserving. We present an algorithm for generating such random initial orthogonal convolution kernels and demonstrate empirically that they enable efficient training of extremely deep architectures.

## “NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations”, Ciccone et al 2018

“NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations”, (2018-04-19; backlinks):

This paper introduces

Non-Autonomous Input-Output Stable Network(NAIS-Net), a very deep architecture where each stacked processing block is derived from a time-invariant non-autonomous dynamical system.Non-autonomy is implemented by skip connections from the block input to each of the unrolled processing stages and allows stability to be enforced so that blocks can be unrolled adaptively to a

pattern-dependent processing depth. NAIS-Net inducesnon-trivial, Lipschitz input-output maps, even for an infinite unroll length.We prove that the network is globally asymptotically stable so that for every initial condition there is exactly one input-dependent equilibrium assuming

tanhunits, and incrementally stable for ReLU units. An efficient implementation that enforces the stability under derived conditions for both fully-connected and convolutional layers is also presented. Experimental results show how NAIS-Net exhibits stability in practice, yielding a substantial reduction ingeneralization gapcompared to ResNets.

## “Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU”, Devlin 2017

“Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU”, (2017-05-04; backlinks):

Attentional sequence-to-sequence models have become the new standard for machine translation, but one challenge of such models is a statistically-significant increase in training and decoding cost compared to phrase-based systems. Here, we focus on efficient decoding, with a goal of achieving accuracy close the state-of-the-art in neural machine translation (NMT), while achieving CPU decoding speed/throughput close to that of a phrasal decoder.

We approach this problem from two angles: First, we describe several techniques for speeding up an NMT beam search decoder, which obtain a 4.4× speedup over a very efficient baseline decoder without changing the decoder output. Second, we propose a simple but powerful network architecture which uses an RNN (GRU/LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep. This architecture achieves similar accuracy to a deep recurrent model, at a small fraction of the training and decoding cost. By combining these techniques, our best system achieves a very competitive accuracy of 38.3 BLEU on WMT English-French NewsTest2014, while decoding at 100 words/sec on single-threaded CPU. We believe this is the best published accuracy/speed trade-off of an NMT system.

## “The Shattered Gradients Problem: If resnets are the answer, then what is the question?”, Balduzzi et al 2017

“The Shattered Gradients Problem: If resnets are the answer, then what is the question?”, (2017-02-28; backlinks):

A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. Although, the problem has largely been overcome via carefully constructed initializations and batch normalization, architectures incorporating skip-connections such as highway and resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, the gradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new “looks linear” (LL) initialization that prevents shattering, with preliminary experiments showing the new initialization allows to train very deep networks without the addition of skip-connections.

## “Skip Connections Eliminate Singularities”, Orhan & Pitkow 2017

“Skip Connections Eliminate Singularities”, (2017-01-31; backlinks):

Skip connections made the training of very deep networks possible and have become an indispensable component in a variety of neural architectures. A completely satisfactory explanation for their success remains elusive. Here, we present a novel explanation for the benefits of skip connections in training very deep networks. The difficulty of training deep networks is partly due to the singularities caused by the non-identifiability of the model. Several such singularities have been identified in previous works: (i) overlap singularities caused by the permutation symmetry of nodes in a given layer, (ii) elimination singularities corresponding to the elimination, i.e. consistent deactivation, of nodes, (iii) singularities generated by the linear dependence of the nodes. These singularities cause degenerate manifolds in the loss landscape that slow down learning. We argue that skip connections eliminate these singularities by breaking the permutation symmetry of nodes, by reducing the possibility of node elimination and by making the nodes less linearly dependent. Moreover, for typical initializations, skip connections move the network away from the “ghosts” of these singularities and sculpt the landscape around them to alleviate the learning slow-down. These hypotheses are supported by evidence from simplified models, as well as from experiments with deep networks trained on real-world datasets.

## “Do Deep Convolutional Nets Really Need to be Deep and Convolutional?”, Urban et al 2016

“Do Deep Convolutional Nets Really Need to be Deep and Convolutional?”, (2016-03-17; ; backlinks):

Yes, they do. This paper provides the first empirical demonstration that deep convolutional models really need to be both deep and convolutional, even when trained with methods such as distillation that allow small or shallow models of high accuracy to be trained.

Although previous research showed that shallow feed-forward nets sometimes can learn the complex functions previously learned by deep nets while using the same number of parameters as the deep models they mimic, in this paper we demonstrate that the same methods cannot be used to train accurate models on CIFAR-10 unless the student models contain multiple layers of convolution. Although the student models do not have to be as deep as the teacher model they mimic, the students need multiple convolutional layers to learn functions of comparable accuracy as the deep convolutional teacher.

…

Figure 1summarizes the results inTable 2for student models of different depth, number of convolutional layers, and number of parameters when trained to mimic the ensemble teacher model. Student models trained on the ensemble logits are able to achieve accuracies previously unseen on CIFAR-10 for models with so few layers. Also, it is clear that there is a huge gap between the convolutional student models at the top of the figure, and the non-convolutional student models at the bottom of the figure: the most accurate student MLP has accuracy less than 75%, while the least accurate convolutional student model with the same number of parameters but only one convolutional layer has accuracy above 87%. And the accuracy of the convolutional student models increases further as more layers of convolution are added. Interestingly, the most accurate student MLPs with no convolutional layers have only 2 or 3 hidden layers; the student MLPs with 4 or 5 hidden layers are not as accurate.Comparing the student MLP with only one hidden layer (bottom of the graph) to the student CNN with 1 convolutional layer clearly suggests that convolution is critical for this problem even when models are trained via distillation, and that it is very unlikely that a shallow non-convolutional model with 100 million parameters or less could ever achieve accuracy comparable to a convolutional model. It appears that if convolution is critical for teacher models trained on the original 0/1 hard targets, it is likely to be critical for student models trained to mimic these teacher models. Adding depth to the student MLPs without adding convolution does not substantially close this “convolutional gap”.

## “How far can we go without convolution: Improving fully-connected networks”, Lin et al 2015

“How far can we go without convolution: Improving fully-connected networks”, (2015-11-09; backlinks):

We propose ways to improve the performance of fully connected networks. We found that two approaches in particular have a strong effect on performance: linear bottleneck layers and unsupervised pre-training using autoencoders without hidden unit biases. We show how both approaches can be related to improving gradient flow and reducing sparsity in the network. We show that a fully connected network can yield approximately 70% classification accuracy on the permutation-invariant CIFAR-10 task, which is much higher than the current state-of-the-art. By adding deformations to the training data, the fully connected network achieves 78% accuracy, which is just 10% short of a decent convolutional network.

## “Tensorizing Neural Networks”, Novikov et al 2015

“Tensorizing Neural Networks”, (2015-09-22; ; backlinks):

Deep neural networks currently demonstrate state-of-the-art performance in several domains. At the same time, models of this class are very demanding in terms of computational resources. In particular, a large amount of memory is required by commonly used fully-connected layers, making it hard to use the models on low-end devices and stopping the further increase of the model size. In this paper we convert the dense weight matrices of the fully-connected layers to the Tensor Train format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved. In particular, for the Very Deep VGG networks we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times.

## “Deep Neural Networks for Large Vocabulary Handwritten Text Recognition”, Bluche 2015

`2015-bluche.pdf`

: “Deep Neural Networks for Large Vocabulary Handwritten Text Recognition”, (2015-05-13; backlinks):

The automatic transcription of text in handwritten documents has many applications, from automatic document processing, to indexing and document understanding.

One of the most popular approaches nowadays consists in scanning the text line image with a sliding window, from which features are extracted, and modeled by Hidden Markov Models (HMMs). Associated with neural networks, such as Multi-Layer Perceptrons (MLPs) or Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs), and with a language model, these models yield good transcriptions. On the other hand, in many machine learning applications, including speech recognition and computer vision, deep neural networks consisting of several hidden layers recently produced a large reduction of error rates.

In this thesis, we have conducted a thorough study of different aspects of optical models based on deep neural networks in the hybrid neural network / HMM scheme, in order to better understand and evaluate their relative importance.

- First, we show that deep neural networks produce consistent and large improvements over networks with one or 2 hidden layers, independently of the kind of neural network, MLP or RNN, and of input, handcrafted features or pixels.
- Then, we show that deep neural networks with pixel inputs compete with those using handcrafted features, and that depth plays an important role in the reduction of the performance gap between the 2 kinds of inputs, supporting the idea that deep neural networks effectively build hierarchical and relevant representations of their inputs, and that features are automatically learnt on the way.
- Despite the dominance of LSTM-RNNs in the recent literature of handwriting recognition, we show that deep MLPs achieve comparable results. Moreover, we evaluated different training criteria. With sequence-discriminative training, we report similar improvements for MLP/HMMs as those observed in speech recognition.
- We also show how the Connectionist Temporal Classification framework is especially suited to RNNs.
- Finally, the novel dropout technique to regularize neural networks was recently applied to LSTM-RNNs. We tested its effect at different positions in LSTM-RNNs, thus extending previous works, and we show that its relative position to the recurrent connections is important.
We conducted the experiments on 3 public databases, representing 2 languages (English and French) and 2 epochs, using different kinds of neural network inputs: handcrafted features and pixels. We validated our approach by taking part to the HTRtS contest in 2014.

The results of the final systems presented in this thesis, namely MLPs and RNNs, with handcrafted feature or pixel inputs, are comparable to the state-of-the-art on Rimes and IAM. Moreover, the combination of these systems outperformed all published results on the considered databases.

[

Keywords: pattern recognition, Hidden Markov Models, neural networks, hand-writing recognition]

## “Do Deep Nets Really Need to be Deep?”, Ba & Caruana 2013

“Do Deep Nets Really Need to be Deep?”, (2013-12-21; backlinks):

Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this extended abstract, we show that shallow feed-forward networks can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow neural nets can learn these deep functions using a total number of parameters similar to the original deep model. We evaluate our method on the TIMIT phoneme recognition task and are able to train shallow fully-connected nets that perform similarly to complex, well-engineered, deep convolutional architectures. Our success in training shallow neural nets to mimic deeper models suggests that there probably exist better algorithms for training shallow feed-forward nets than those currently available.

## “Network In Network”, Lin et al 2013

“Network In Network”, (2013-12-16; backlinks):

We propose a novel deep network structure called “Network In Network” (NIN) to enhance model discriminability for local patches within the receptive field. The conventional convolutional layer uses linear filters followed by a nonlinear activation function to scan the input. Instead, we build micro neural networks with more complex structures to abstract the data within the receptive field. We instantiate the micro neural network with a multilayer perceptron, which is a potent function approximator. The feature maps are obtained by sliding the micro networks over the input in a similar manner as CNN; they are then fed into the next layer. Deep NIN can be implemented by stacking mutiple of the above described structure. With enhanced local modeling via the micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers. We demonstrated the state-of-the-art classification performances with NIN on CIFAR-10 and CIFAR-100, and reasonable performances on SVHN and MNIST datasets.

## “Deep Big Multilayer Perceptrons for Digit Recognition”, Cireşan et al 2012

`2012-ciresan.pdf`

: “Deep Big Multilayer Perceptrons for Digit Recognition”, (2012; backlinks):

The competitive MNIST handwritten digit recognition benchmark has a long history of broken records since 1998. The most recent advancement by others dates back 8 years (error rate 0.4%).

Good old on-line backpropagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the MNIST handwritten digits benchmark with a single MLP, and 0.31% with a committee of 7 MLPs.

All we need to achieve this until-2011-best-result are many hidden layers, many neurons per layer, numerous deformed training images to avoid overfitting, and graphics cards to greatly speed up learning.

[

Keywords: neural network, multilayer perceptron, GPU, training set deformations, MNIST, committee, backpropagation]

Note: This work combines 3 previously published papers [1,2,3].…In recent decades the amount of raw computing power per Euro has grown by a factor of 100–1000 per decade. Our results show that this ongoing hardware progress may be more important than advances in algorithms and software (although the future will belong to methods combining the best of both worlds). Current graphics cards (GPUs) are already more than 50× faster than standard microprocessors when it comes to training big and deep neural networks by the ancient algorithm, online backpropagation (weight update rate up to 7.5×10

^{9}/s, and more than 10^{15}per trained network). On the competitive MNIST handwriting benchmark, single precision floating-point GPU-based neural nets surpass all previously reported results, including those obtained by much more complex methods involving specialized architectures, unsupervised pre-training, combinations of machine learning classifiers etc. Training sets of sufficient size to avoid overfitting are obtained by appropriately deforming images.Of course, the approach is not limited to handwriting, and obviously holds great promise for many visual and other pattern recognition problems.

## “Extraction de séquences numériques dans des documents manuscrits quelconques”, Chatelain 2006

`2006-chatelain.pdf`

: “Extraction de séquences numériques dans des documents manuscrits quelconques”, (2006-12-05; backlinks):

Within the framework of the automatic processing of incoming mail documents, we present in this thesis the conception and development of a numerical field extraction system in weakly constrained handwritten documents.

Although the recognition of isolated handwritten entities can be considered as a partially solved problem, the extraction of information in images of complex and free-layout documents is still a challenge. This problem requires the implementation of both handwriting recognition and information extraction methods inspired by approaches developed within the field of information extraction in electronic documents.

Our contribution consists in the conception and the implementation of 2 different strategies: the first extends classical handwriting recognition methods, while the second is inspired from approaches used within the field of information extraction in electronic documents.

The results obtained on a real handwritten mail database show that our second approach is substantially better.

Finally, a complete, generic and efficient system is produced, answering one of the emergent perspectives in the field of the automatic reading of handwritten documents: the extraction of complex information in images of documents. [Text of paper is in French.]

## “Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis”, Simard et al 2003

`2003-simard.pdf#microsoft`

: “Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis”, (2003):

Neural networks are a powerful technology for classification of visual inputs arising from documents. However, there is a confusing plethora of different neural network methods that are used in the literature and in industry.

This paper describes a set of concrete best practices that document analysis researchers can use to get good results with neural networks.

The most important practice is getting a training set as large as possible: we expand the training set by adding a new form of distorted data.

The next most important practice is that convolutional neural networks are better suited for visual document tasks than fully connected networks. We propose that a simple “do-it-yourself” implementation of convolution with a flexible architecture is suitable for many visual document problems. This simple convolutional neural network does not require complex methods, such as momentum, weight decay, structure-dependent learning rates, averaging layers, tangent prop, or even finely-tuning the architecture.

The end result is a very simple yet general architecture which can yield state-of-the-art performance for document analysis.

We illustrate our claims on the MNIST set of English digit images.