These notes cover in detail the development of algorithms required for implementing the ResNet CNN architecture in the JPEG transform domain. This means that the new architecture can perform inference and learning directly on JPEG compressed images. They do not need to be decompressed being being fed into the network.
Although this allows time to be saved up front by allowing the decompression process to be skipped, that is not considered to be the main contribution of the work. JPEG files are highly sparse, and CNNs are mostly performing adds and multiplies. This means that many such operations on a JPEG should be noop
, greatly speeding up the entire network processing. Furthermore, sparse data can be stored in a much smaller space than dense data, so this should permit larger batch sizes and therefore more accurate gradients, increasing the accuracy of the network. Finally, JPEG is by far the most popular image file compression scheme due to it's high compression ratio, so this method should be able to find wide applicability. For example, the ImageNet data set and challenge consists entirely of JPEG images.
The notes are separated by topic and the individual components of ResNet are developed in isolation.
Background
Convolutions
Nonlinearity and Utilities
End-to-End
Appendix
Ehrlich, Max, and Larry S. Davis. "Deep residual learning in the JPEG transform domain." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3484-3493. 2019.
© 2018 Max Ehrlich