Many modern computer vision algorithms rely on the 3x3 convolutional layer and while they achieve excellent accuracy metrics, are very computationally intensive. This project look at the effects of using a Box Convolutional layer that can learn its size and location, and thus reduce number of parameters in the model.
We extended an idea introduced in this paper to see if we could achieve similar accuracy metrics in near state of the art model architectures in image classification, segmentation and object detection but with far fewer parameters.
We trained Resnet on Tiny Imagenet, and PSPNet & SSD on VOC 2012 as the baselines for classification, segmentation and detection respectively. We then replaced first a single 3x3 convolutional layer with the custom box convolutional layer, followed by all 3x3 layers, to see how this affected accuracy and computation.
Loss and accuracy metrics looked similar across the board, Epoch time for classification was slightly higher but had better accuracy. BoxSSD was 5% faster by Epoch and used ~40% of the parameters.