尝试回答一下,理论和实践有个玩笑:Theory is when one knows everything but nothing works. Practice is when everything works but nobody knows why. 90年代初,这个理论就被提出来(Hornik et al.,1989; Cybenko,1989; Hornik et al., 1990),两层神经网络 + sigmoid激活函数可以是universal approximator,定理说明:存在一个足够大的网络能够达到我们所希望的任意精度,但并没有说这个网络有多大。by 花书 先来看下花书(Deep Learning. lan Goodfellow)中的核心观点:In summary, a feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly. In many circumstances, using deeper models can reduce the number of units required to represent the desired function and can reduce the amount of generalization error. 并且universal approximator并不新鲜,但很多不能用到机器学习中,比如有人提到sum of indicator bumps函数
斯坦福的cs231n中,也有提到,更深的模型更容易让目前的优化算法学习到:Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms (e.g. gradient descent). Similarly, the fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal. 和直觉不太符合的是,该课程中进一步提出, bigger neural networks更容易学习到最优解:The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss. Since Neural Networks are non-convex, it is hard to study these properties mathematically, but some attempts to understand these objective functions have been made, e.g. in a recent paperThe Loss Surfaces of Multilayer Networks. 有意思的是,如果我们先用deep模型学习完,再用一个shadow去模拟这个模型,效果有可能比deep模型还好,但直接用shadow模型去学习原始样本却会表现很差。Do Deep Nets Really Need to be Deep?【2】这篇论文有详细的实验。
总结:并不是说有能力近似万能函数就万事大吉了,关键的还是能不能用现有的优化算法学到它。
参考:
【1】
CS231n Convolutional Neural Networks for Visual Recognition【2】
https://arxiv.org/pdf/1312.6184.pdf【3】
https://arxiv.org/pdf/1412.6550.pdf【4】
袁洋:神经网络有什么理论支持?
|