Empirical Studies on Capsule Network Representation and Improvements
Published:
Capsule networks is a novel approach showing promising results on SmallNorb and MNIST. Here we reproduce and build upon the impressive results shown by Sara Sabour et al. We experiment on the Capsule Network architecture by visualizing exactly what the capsules on different layers represents, what information they store about 3D objects in an image, and try to improve its classification results on CIFAR10 and SmallNorb with various methods including some tricks with reconstruction loss. Further, We present a deconvolution-based reconstruction module that reduces the number of learnable parameters by 80% from the fully-connected module presented by Sara Sabour et al.
Benchmarks
Our baseline model is the same as the original paper, but is only trained for 113 epochs on MNIST, and we did not use a 7-model ensemble for CIFAR10 as did in the paper.
Model | MNIST | SmallNORB | CIFAR10 |
---|---|---|---|
Sabour et al. | 99.75% | 97.3% | 89.40% |
Baseline | 99.73% | 91.5% | 72.59% |
Experiments
We introduced a deconvolution-based reconstructions module, and experimented with Batch normalization and different network topologies.
Deconvolution-based Reconstruction
The baseline model has 1.4M parameters in the fully connected decoder, while our deconvolution-based reconstruction module recudes the number of learnable parameters by 80% down to 0.25M.
Here is an comparison between the two reconstruction modules after training for 25 epochs on MNIST, where RLoss is the SSE reconstruction loss, and MLoss is the margin loss.
Model | RLoss | MLoss | Accuracy |
---|---|---|---|
FC | 21.62 | 0.0058 | 99.51% |
FC w/ BN | 13.12 | 0.0054 | 99.54% |
DeConv | 10.87 | 0.0050 | 99.54% |
DeConv w/ BN | 9.52 | 0.0044 | 99.55% |
Visualization
Reconstructions
Here are the reconstruction results for SmallNORB and CIFAR10, after training for 186 epochs and 86 epochs respectively.
Robustness to Affine Transformations
We visualized how the network recognizes a rotated MNIST image when only trained on unmodified MNIST data. We present an image of number 2 as an example. The network is confident about the result when the image is just slightly rotated, but as the image is further rotated, it starts to confuse the image with other numbers. For example, it is very confident about the image being number 7 at a certain angle, and reconstructs a number 7 that aligns pretty well with the input. Due to its special topological features, the input number 2 is still recognized by the network when rotated by 180°.
Primary Capsules Reconstructions
We used a pre-trained network to train a reconstruction module for Primary Capsules. By scaling these capsules by its routing coefficients to the classified object, we were able to visualize reconstructions from Primary Capsules. Each row is reconstructed from a single capsule, and the routing coefficient is increased from left to right.
Credit
This was a research project done in collaboration with Håkon Hukkelås at University of California San Diego.