Substantiation of the backpropagation technique via the Hamilton—Pontryagin formalism for training nonconvex nonsmooth neural networks

Norkin, VI
Dopov. Nac. akad. nauk Ukr. 2019, 12:19-26
Section: Information Science and Cybernetics
Language: English

The paper observes the similarity between the stochastic optimal control over discrete dynamical systems and the lear ning multilayer neural networks. It focuses on contemporary deep networks with nonconvex nonsmooth loss and activation functions. The machine learning problems are treated as nonconvex nonsmooth stochastic optimization ones. As a model of nonsmooth nonconvex dependences, the so-called generalized differentiable functions are used. A method for calculating the stochastic generalized gradients of a learning quality functional for such systems is substantiated basing on the Hamilton—Pontryagin formalism. This method extends a well-known “backpropagation” machine learning technique to nonconvex nonsmooth networks. Stochastic generalized gradient learning algorithms are extended for training nonconvex nonsmooth neural networks.

Keywords: deep learning, machine learning, multilayer neural networks, nonsmooth nonconvex optimization, stochastic generalized gradient, stochastic optimization

1. Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep learning. Cambridge: The MIT Press. Retrieved from
2. Bottou, L., Curtisy, F. E. & Nocedalz, J. (2018). Optimization methods for large-scale machine learning. SIAM Rev., 60, No. 2, pp. 223-311. Doi:
3. Newton, D., Yousefian, F. & Pasupathy, R. (2018). Stochastic gradient descent: recent trends. INFORMS TutORials in Operations Research, pp. 193-220. Doi:
4. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, pp. 533-536. Doi:
5. Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, pp. 85-117. Doi:
6. Davis, D., Drusvyatskiy, D., Kakade, S. & Lee, J. D. (2019). Stochastic subgradient method converges on tame functions. Found. Comput. Math., pp. 1-36. Doi:
7. Clarke, F. H. (1990). Optimization and nonsmooth analysis. Classics in Applied Mathematics, Vol. 5. 2nd ed. Philadelphia, PA: SIAM. Doi:
8. Norkin, V.I. (1980). Generalized differentiable functions. Cybernetics, 16, No. 1, pp. 10-12. Doi:
9. Mikhalevich, V. S., Gupal, A. M. & Norkin, V. I. (1987). Methods of nonconvex optimization. Moscow: Nauka (in Russian).
10. Norkin, V. I. (1986). Stochastic generalized-differentiable functions in the problem of nonconvex nonsmooth stochastic optimization. Cybernetics, 22, No. 6, pp. 804-809. Doi:
11. Bryson, A. E. & Ho, Y-C. (1969). Applied optimal control: optimization, estimation, and control. Waltham: Blaisdell Publ. Co.
12. Ermoliev, Y. M. (1976). Methods of stochastic programming. Moscow: Nauka (in Russian).
13. Norkin V. I. (2019). Generalized gradients in problems of dynamic optimization, optimal control, and machine learning. Preprint. V.M. Glushkov Institute of Cybernetics of the National Academy of Sciences of Ukraine, Kyiv. Retrieved from
14. Ermol’ev, Yu. M. & Norkin, V. I. (1998). Stochastic generalized gradient method for solving nonconvex nonsmooth stochastic optimization problems. Cybern. Syst. Anal., 34, No. 2, pp. 196-215. Doi:
15. Ermoliev, Y. M. & Norkin, V. I. (2003). Solution of nonconvex nonsmooth stochastic optimization problems. Cybern. Syst. Anal., 39, No. 5, pp. 701-715. Doi: