Final

David Wallace Croft

Neural Net Mathematics
Richard M. Golden, Ph.D.
U.T. Dallas

2005-04-28

Problem 1

Objective Function
$c(w) = μ_{1} * {(o_{1} - r_{1})}^{2} + μ_{2} * {(o_{2} - r_{2})}^{2}$
Response
$r_{k} \equiv r (w, s_{k}) = exp (- w^{T} * s_{k})$
Weight Update Rule
$w (t + 1) = w (t) + Δ w$
Gradient Descent
$Δ w = - η * \nabla c(w)$
Error
$e_{k} = o_{k} - r_{k}$
Error Squared
$f_{k} = e_{k}^{2}$
Exponent
$g_{k} = - w^{T} * s_{k}$
Vectors
$μ = [\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}], e = [\begin{matrix} e_{1} \\ e_{2} \end{matrix}], f = [\begin{matrix} f_{1} \\ f_{2} \end{matrix}], g = [\begin{matrix} g_{1} \\ g_{2} \end{matrix}], o = [\begin{matrix} o_{1} \\ o_{2} \end{matrix}], r = [\begin{matrix} r_{1} \\ r_{2} \end{matrix}]$
Stimulus Matrix
$S = [\begin{matrix} s_{1} & s_{2} \end{matrix}]$
Objective Function Revised
$c(w) = μ^{T} * f = μ^{T} * (e ⊙ e)$
Chain Rule
$\nabla c(w) = {c'(w)}^{T} = {[c'(f) * f'(e) * e'(r) * r'(g) * g'(w)]}^{T}$
First Term
$c'(f) = (μ^{T} * f)' = μ^{T}$
Second Term
$f'(e) = (e ⊙ e)' = 2 * diag (e)$
Third Term
$e'(r) = (o - r)' = - I$
Fourth Term
$r' (g) = ([\begin{matrix} exp (g_{1}) \\ exp (g_{2}) \end{matrix}])' (g) = [\begin{matrix} exp (g_{1}) & 0 \\ 0 & exp (g_{2}) \end{matrix}] = diag (r)$
Fifth Term
$g'(w) = (- [\begin{matrix} s_{1}^{T} \\ s_{2}^{T} \end{matrix}] * w)' = - [\begin{matrix} s_{1}^{T} \\ s_{2}^{T} \end{matrix}] = - S^{T}$
Combined Terms
$c' (w) = [μ^{T}] * [2 * diag (e)] * [- I] * [diag (r)] * [- S^{T}]$
Reduce
$c' (w) = 2 * μ^{T} * diag (e) * diag (r) * S^{T} = 2 * {(μ ⊙ e ⊙ r)}^{T} * S^{T}$
Transpose
$\nabla c(w) = {c' (w)}^{T} = 2 * S * (μ ⊙ e ⊙ r)$
Simplify
$\nabla c(w) = 2 * [\begin{matrix} s_{1} & s_{2} \end{matrix}] * [\begin{matrix} μ_{1} * e_{1} * r_{1} \\ μ_{2} * e_{2} * r_{2} \end{matrix}]$
Simplify More
$\nabla c(w) = 2 * [\begin{matrix} μ_{1} * (o_{1} - r_{1}) * r_{1} * s_{1} + μ_{2} * (o_{2} - r_{2}) * r_{2} * s_{2} \end{matrix}]$
Weight Update Rule Final
$w (t + 1) = w (t) - 2 * η * [\begin{matrix} μ_{1} * (o_{1} - r_{1}) * r_{1} * s_{1} + μ_{2} * (o_{2} - r_{2}) * r_{2} * s_{2} \end{matrix}]$

Problem 2

Newton-Raphson Descent
$w (t + 1) = w (t) - η * {[\nabla^{2} c (w)]}^{-1} * \nabla c (w)$
Hessian (Derivation)
$\nabla^{2} c (w) = \frac{\partial^{2} c (w)}{\partial w \partial w} = \frac{\partial c' (w)}{\partial w} = \frac{\partial {[\nabla c (w)]}^{T}}{\partial w}$
From Problem 1
$c' (w) = 2 * {(μ ⊙ e ⊙ r)}^{T} * S^{T}$
Define y
$y = μ ⊙ e ⊙ r$
Define F
$F = S^{T}$
Derivative of c in terms of y and F
$c' (w) = 2 * y^{T} * F$
Hessian in terms of y and F
$\nabla^{2} c (w) = c'' (w) = 2 * (y^{T} * F)'$
Identity from Marlow p216
$(y^{T} * F)' = (y^{T} \otimes I_{s}) * F' (w) + F^{T} * y' (w)$
Zero term
$F' (w) = 0$
Apply identity
$\nabla^{2} c (w) = 2 * F^{T} * y' (w) = 2 * F^{T} * y' (w)$
Chain rule
$y' (w) = y' (r) * r' (g) * g' (w)$
Function of r
$y' (r) = (μ ⊙ e ⊙ r)' (r) = (μ ⊙ (o - r) ⊙ r)' (r) = diag (μ ⊙ [o - 2 * r])$
From Problem 1
$r' (g) = diag (r), g'(w) = - S^{T}$
Combined terms
$y' (w) = diag (μ ⊙ [o - 2 * r]) * diag (r) * - S^{T} = - diag (μ ⊙ [o - 2 * r] ⊙ r) * S^{T}$
Hessian
$\nabla^{2} c (w) = 2 * F^{T} * - diag (μ ⊙ [o - 2 * r] ⊙ r) * S^{T} = -2 * S * diag (μ ⊙ [o - 2 * r] ⊙ r) * S^{T}$
Newton-Raphson Descent requires the inverse of the Hessian. The inverse of the Hessian exists if the Hessian is positive definite [Haykin p151]. The Hessian is positive definite if the input signal vectors (s) span the d-dimensional real vector space [Golden p367].

Problem 3

Probability Space
$(Ω_{x}, F_{x}, μ_{x}), μ_{x} (Ω_{x}) = 1$
Dominating Measure
$μ_{x} = μ_{s} * μ_{o} * μ_{μ}$
Joint Density
$p_{s,o}^{*} ({\tilde{s}}_{t}, {\tilde{o}}_{t}) = p (o_{t} | s_{t}) * p (s_{t})$
Measurable Function
$g (\tilde{x}) = - {\tilde{μ}}_{t} * log (p_{s,o}^{*} ({\tilde{s}}_{t}, {\tilde{o}}_{t}))$
Expectation
$E {g (\tilde{x})} = \int_{x \in Ω_{x}} g (x) * p_{x}^{*} (x) d μ_{x} (x)$
Summations and Riemann Integrals
$E {g (\tilde{x})} = \underset{s \in Ω}{Σ} \underset{μ \in {0, 1}}{Σ} \int_{o} - μ_{t} * log (p_{s,o}^{*} (s_{t}, o_{t})) * p_{s,o}^{*} (s_{t}, o_{t}) * p_{μ} d o$

p_x^* is a measurable function because it is piecewise continuous on R^d.

Problem 4

Objective Function
$l_{n} (ρ) = \frac{1}{n} * Σ_{t = 1}^{n} [μ_{t} * {(o_{t} - r_{t})}^{2} + λ * {‖ h_{t} ‖}^{2}]$
Squashing Function
$σ (x) = \frac{2}{1 + ⅇ^{-x}} - 1 = \frac{1 - ⅇ^{-x}}{1 + ⅇ^{-x}} = tanh (\frac{x}{2})$
Derivative of Squashing Function
$σ' (x) = 2 * - {(1 + ⅇ^{-x})}^{-2} * {- ⅇ}^{-x} = \frac{1}{2} * \frac{4 * ⅇ^{-x}}{{(1 + ⅇ^{-x})}^{2}} = \frac{1}{2} * \frac{{(1 + ⅇ^{-x})}^{2} - {(1 - ⅇ^{-x})}^{2}}{{(1 + ⅇ^{-x})}^{2}} = \frac{1}{2} * (1 - {σ (x)}^{2})$
Weight Update Rule for Output Layer
$w (k + 1) = w (k) - η * \nabla l_{n} (w)$
Weighted Sum to Output Neuron
$a_{t} = w^{T} * h_{t}$
Output Neuron Response
$r_{t} = σ (a_{t})$
Error
$e_{t} = o_{t} - r_{t}$
Error Squared
$f_{t} = e_{t}^{2}$
Observable Error Squared
$g_{t} = μ_{t} * f_{t}$
Chain Rule
$g_{t}' (w) = g_{t}' (f_{t}) * f_{t}' (e_{t}) * e_{t}' (r_{t}) * r_{t}' (a_{t}) * a_{t}' (w)$
First Term
$g_{t}' (f_{t}) = μ_{t}$
Second Term
$f_{t}' (e_{t}) = 2 * e_{t}$
Third Term
$e_{t}' (r_{t}) = -1$
Fourth Term
$r_{t}' (a_{t}) = \frac{1}{2} * (1 - r_{t}^{2})$
Fifth Term
$a_{t}' (w) = h_{t}^{T}$
Combined Terms
$g_{t}' (w) = μ_{t} * 2 * e_{t} * -1 * \frac{1}{2} * (1 - r_{t}^{2}) * h_{t}^{T} = {- μ}_{t} * (o_{t} - r_{t}) * (1 - r_{t}^{2}) * h_{t}^{T}$
Weight Gradient for Output Layer
$\nabla l_{n} (w) = \frac{1}{n} * Σ_{t = 1}^{n} [{- μ}_{t} * (o_{t} - r_{t}) * (1 - r_{t}^{2}) * h_{t}]$
Weight Update Rule for Output Layer Complete
$w (k + 1) = w (k) + η * \frac{1}{n} * Σ_{t = 1}^{n} [μ_{t} * (o_{t} - r_{t}) * (1 - r_{t}^{2}) * h_{t}]$
Hidden Layer Weight Vector
$v = vec (V)$
Weight Update Rule for Hidden Layer
$v (t + 1) = v (t) - η * \nabla l_{n} (v)$
Weighted Sum to Hidden Layer
$b_{t} = V * s_{t}$
Hidden Layer
$h_{t} = σ (b_{t})$
Normalization Term
$c_{t} = λ * {‖ h_{t} ‖}^{2} = λ * h_{t}^{T} * h_{t}$
Error Minimization Term plus Normalization Term
$d_{t} = μ_{t} * {(o_{t} - r_{t})}^{2} + λ * {‖ h_{t} ‖}^{2} = g_{t} + c_{t}$
Chain Rule
$d_{t}' (v) = g_{t}' (v) + c_{t}' (v)$
Chain Rule for Error Minimization Term
$g_{t}' (v) = g_{t}' (f_{t}) * f_{t}' (e_{t}) * e_{t}' (r_{t}) * r_{t}' (a_{t}) * a_{t}' (h_{t}) * h_{t}' (b_{t}) * b_{t}' (v)$
Fifth Term
$a_{t}' (h_{t}) = w_{t}^{T}$
Sixth Term
$h_{t}' (b_{t}) = \frac{1}{2} * diag (1 - h_{t} ⊙ h_{t})$
Seventh Term
$b_{t}' (v) = I \otimes s_{t}^{T}$
Combined Terms for Error Minimization
$g_{t}' (v) = μ_{t} * 2 * e_{t} * -1 * \frac{1}{2} * (1 - r_{t}^{2}) * w^{T} * \frac{1}{2} * diag (1 - h_{t} ⊙ h_{t}) * [I \otimes s_{t}^{T}]$
Reduce
$g_{t}' (v) = - \frac{1}{2} * μ_{t} * (o_{t} - r_{t}) * (1 - r_{t}^{2}) * {[w ⊙ (1 - h_{t} ⊙ h_{t})]}^{T} \otimes s_{t}^{T}$
Chain Rule for Normalization Term
$c_{t}' (v) = c_{t}' (h_{t}) * h_{t}' (b_{t}) * b_{t}' (v)$
First Term for Normalization
$c_{t}' (h_{t}) = 2 * λ * h_{t}^{T}$
Combined Terms for Normalization
$c_{t}' (v) = 2 * λ * h_{t}^{T} * \frac{1}{2} * diag (1 - h_{t} ⊙ h_{t}) * [I \otimes s_{t}^{T}]$
Reduce
$c_{t}' (v) = λ * {[h_{t} ⊙ (1 - h_{t} ⊙ h_{t})]}^{T} \otimes s_{t}^{T}$
Derivative of Both Terms
$d_{t}' (v) = - \frac{1}{2} * μ_{t} * (o_{t} - r_{t}) * (1 - r_{t}^{2}) * {[w ⊙ (1 - h_{t} ⊙ h_{t})]}^{T} \otimes s_{t}^{T} + λ * {[h_{t} ⊙ (1 - h_{t} ⊙ h_{t})]}^{T} \otimes s_{t}^{T}$
Reduce
$d_{t}' (v) = ({[- \frac{1}{2} * μ_{t} * (o_{t} - r_{t}) * (1 - r_{t}^{2}) * w + λ * h_{t}]}^{T} ⊙ {[1 - h_{t} ⊙ h_{t}]}^{T}) \otimes s_{t}^{T}$
Weight Gradient for Hidden Layer
$\nabla l_{n} (v) = \frac{1}{n} * Σ_{t = 1}^{n} [([- \frac{1}{2} * μ_{t} * (o_{t} - r_{t}) * (1 - r_{t}^{2}) * w + λ * h_{t}] ⊙ [1 - h_{t} ⊙ h_{t}]) \otimes s_{t}]$
Weight Update Rule for Hidden Layer Complete
$v (k + 1) = v (k) + η * \frac{1}{n} * Σ_{t = 1}^{n} [([\frac{1}{2} * μ_{t} * (o_{t} - r_{t}) * (1 - r_{t}^{2}) * w - λ * h_{t}] ⊙ [1 - h_{t} ⊙ h_{t}]) \otimes s_{t}]$

Problem 5

The first part of the objective function minimizes the mean square error between observable desired responses and actual responses. The second part of the objective function minimizes the number of hidden units in the network. This reduces the effects of overfitting and improves generalization performance [Golden pp105-6].

Yes. "If f is twice-differentiable at every [vector x which is an element of] D then f is twice-differentiable on D and is a twice-differentiable function" [Marlow p198]. For any input in the domain, the second derivative of the objective function exists.

Yes. The objective function is measurable because it is continuous.

Problem 6

In my personal opinion, the most important strength of probability theory and expectation as tools for making rational inferences in environments characterized by uncertainty is that it puts the framework on solid footing when compared to more subjective methods such as fuzzy logic. With fuzzy logic, providing the fuzzy measures by quantifying the values of the symbols can be somewhat arbitrary. I imagine these fuzzy values have to be manually tweaked until the fuzzy logic roughly equates to what could have been determined in a more straight-forward manner using expected risk.

The most important limitation is the inability to make fuzzy decisions based on fuzzy classifications. By a "fuzzy classification", I mean varying degrees of membership in a fuzzy set versus all-or-nothing membership in a crisp set [Golden p248]. For example, a probabilistic decision to eat food that might be poisoned will be based on the consequences. I might decide to avoid the food if the risk is high even if the probability is low. A fuzzy decision would provide the alternative option of eating just a little if the classification is determined to be "mostly safe".

Generalization is the ability to make classification decisions for stimuli not previously seen based on similarity to previous stimuli and their classifications. The problem with this definition is that "similarity" is vague. An "appropriate generalization" for a given probability space should be based on minimizing the expected risk [Golden p276].

References

Golden, Richard M., Mathematical Methods for Neural Network Analysis and Design, MIT Press, 1996.
Haykin, Simon, Neural Networks: A Comprehensive Foundation, 2nd Ed., Prentice Hall, 1999.
Marlow, W. H., Mathematics for Operations Research, Dover Publications, 1978.