Model Zoo Practice – AlexNet

I recently started to do a paper reproduction with a group of people in the wechat. It’s a good experience to remaster the classic models in details by reading the paper.
For the first week, we decided to work on the AlexNet. Here is my study notes of AlexNet. It skipped the basic knowledge and focus on the knowledge points important to me. Some of the images cannot be directly copied from my onenote page.

I actually planned to do a short showcase during the weekend paper seminar.


Trained a large deep convolutional neural network to classify the 1.2m high-resolution images in ImageNet into the 1000 different classes

The neural network:

  • Five convolution layers
  • Some of conv layers are followed by max-pooling layers
  • 3 fully-connected layers
  • Final 1000-way softmax

To make trainning faster:

  • Use non-saturating neurons Relu
  • GPU implementation of the convolution operation

To reduce overfitting in the fully-connected layer:

  • Dropout
  • Other tricks


Large dataset:

  • LabelMe: fully-segmented images
  • ImageNet: over 1.5m labeled high resolution images in over 22,000 categories

The size of the network made overfitting a significant problem

The depth is important

Top1 & Top5

  • Top1 the highest possibility is correct
  • Top5 results include the correct result

ImageNet consists of variable resolution images, while our system requires a constants input dimensionality

  • Down-sample the images into a fixed resolution of 256X256
  • Given a rectangular image, first rescaled the image such that the shorter side was of length 256 and then cropped out the central 256X256 patch from the resulting image

**The Architecture **

Five convolutional and three fully connected layers

• f is non-saturating iff (Ilimz_+ = +00) V I limz
= +00)
f is saturating iff f is not non-saturating.
These definitions are not specific to convolutional neural networks.

ReLU nonlinearity

In terms of training time with GD, saturating non-linearities(tanh sigmoid) are much slower than the non-saturating nonlinearity

Comparation between tanh sigmoid and Relu

  • Sigmoid & tanh when the value is closed to the limited range, the GD is closed to 0. During the BP, the update GD is closed to 0. There is no update from the BP.
  • ReLu doesn’t have max limit. But if the learning rate is too large, a large GD may kill the neuron during the training(GD is 0). The neuron cannot affect the result any more.

Training on multiple GPUs

  • Memory limitation at that time
  • Put half of the kernels on each GPU
  • The GPUs communicate only in certain layers

Local response normalization:

  • ReLU has the desirable property that they donot require input normalization the prevent them from saturating
  • They still find a specified local normalization scheme aids generalization
  • 在2015年 Very Deep Convolutional Networks for Large-Scale Image Recognition.提到LRN基本没什么用。

Overlapping pooling:

  • stride < kernel size
  • Reduce overfitting

Overall Arch:

  • First 5 are conv layers and the rest 3 are fully-connected layer
  • The output of last layer is fed to a 1000-way softmax
  • Kernel in different GPU
  • Response normalization layers follow the 1,2 conv layers
  • Max pooling layer follows both normalization layers as well as the 5 conv layer
  • The ReLu is applied to the output of each conv layer and fully-connected layer

Other Tips:

  • Size of input: usually it should be n times 2
  • Input size will be 227 after data processing
  • The size of feature map floor((img_size – filter_size)/stride) +1 = new_feture_size or floor((img_size – filter_size + pad * 2)/stride) +1 = new_feture_size

Reduce overfitting

Data augmentation:

  • Extracting random 224X224 patches and their horizontal reflection from the 256X256 images and training our network on these extracted patches. At the test time, the network make a prediction by extracting five 224X224 patches (the four corner patches and the center patch) as well as their horizontal reflections
  • Altering the intensities of the RGB channels in training images. Performing PCA on the set of RGB pixel values 对RGB空间做PCA(主成分分析),然后对主成分做一个(0, 0.1)的高斯扰动,也就是对颜色、光照作变换,结果使错误率又下降了1%。


At test time we use all the neurons but multiply their outputs by 0.5

**Details of Learning **

Train the model with SGD, momentum is an upgrade of SGD, weight decay is for normalization

Initialized the weights in each layer from zero-mean gaussian distribution with standard deviation 0.01

Init the biased with the constant 1


Use equal learning rate for all layers

Divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate


Use validation and test error rate to compared with other models

Another way to probe the network’s visual knowledge is to consider the feature activations induced by an image L2