Final

David Wallace Croft

Neural Net Mathematics
Richard M. Golden, Ph.D.
U.T. Dallas

2005-04-28


Problem 1

  1. Objective Function

    c(w) = μ1 * ( o1 - r1 ) 2 + μ2 * ( o2 - r2 ) 2

  2. Response

    rk r ( w , sk ) = exp ( - wT * sk )

  3. Weight Update Rule

    w (t+1) = w(t) +Δw

  4. Gradient Descent

    Δw = - η * c(w)

  5. Error

    ek = ok - rk

  6. Error Squared

    fk = ek2

  7. Exponent

    gk = - wT * sk

  8. Vectors

    μ = [ μ1 μ2 ] , e = [ e1 e2 ] , f = [ f1 f2 ] , g = [ g1 g2 ] , o = [ o1 o2 ] , r = [ r1 r2 ]

  9. Stimulus Matrix

    S = [ s1 s2 ]

  10. Objective Function Revised

    c(w) = μ T * f = μ T * ( e e )

  11. Chain Rule

    c(w) = c'(w)T = [ c'(f) * f'(e) * e'(r) * r'(g) * g'(w) ] T

  12. First Term

    c'(f) = ( μ T * f ) ' = μT

  13. Second Term

    f'(e) = ( e e ) ' = 2*diag(e)

  14. Third Term

    e'(r) = ( o - r ) ' = -I

  15. Fourth Term

    r'(g) = ( [ exp ( g1 ) exp ( g2 ) ] ) ' (g) = [ exp ( g1 ) 0 0 exp ( g2 ) ] = diag ( r )

  16. Fifth Term

    g'(w) = ( - [ s1T s2T ] * w ) ' = - [ s1T s2T ] = - S T

  17. Combined Terms

    c'(w) = [ μT ] * [ 2*diag(e) ] * [ -I ] * [ diag ( r ) ] * [ - S T ]

  18. Reduce

    c'(w) = 2 * μT * diag ( e ) * diag (r) * S T = 2 * ( μ e r ) T * S T

  19. Transpose

    c(w) = c'(w) T = 2 * S * ( μ e r )

  20. Simplify

    c(w) = 2 * [ s1 s2 ] * [ μ1 * e1 * r1 μ2 * e2 * r2 ]

  21. Simplify More

    c(w) = 2 * [ μ1 * ( o1 - r1 ) * r1 * s1 + μ2 * ( o2 - r2 ) * r2 * s2 ]

  22. Weight Update Rule Final

    w (t+1) = w(t) - 2 * η * [ μ1 * ( o1 - r1 ) * r1 * s1 + μ2 * ( o2 - r2 ) * r2 * s2 ]

Problem 2

  1. Newton-Raphson Descent

    w (t+1) = w(t) - η * [ 2 c w ] -1 * c w

  2. Hessian (Derivation)

    2 c w = 2 c ( w ) w w = c ' ( w ) w = [ c w ] T w

  3. From Problem 1

    c'(w) = 2 * ( μ e r ) T * S T

  4. Define y

    y = μ e r

  5. Define F

    F = S T

  6. Derivative of c in terms of y and F

    c'(w) = 2 * y T * F

  7. Hessian in terms of y and F

    2 c w = c''(w) = 2 * ( y T * F ) '

  8. Identity from Marlow p216

    ( y T * F ) ' = ( y T I s ) * F'( w) + FT * y'( w)

  9. Zero term

    F'( w) = 0

  10. Apply identity

    2 c w = 2 * FT * y'( w) = 2 * FT * y'( w)

  11. Chain rule

    y'( w) = y' (r) * r' (g) * g' (w)

  12. Function of r

    y'( r) = ( μ e r ) ' ( r ) = ( μ ( o - r ) r ) ' ( r ) = diag ( μ [ o - 2 * r ] )

  13. From Problem 1

    r'(g) = diag ( r ) , g'(w) = - S T

  14. Combined terms

    y'( w) = diag ( μ [ o - 2 * r ] ) * diag ( r ) * - S T = - diag ( μ [ o - 2 * r ] r ) * S T

  15. Hessian

    2 c w = 2 * FT * - diag ( μ [ o - 2 * r ] r ) * S T = -2 * S * diag ( μ [ o - 2 * r ] r ) * S T

  16. Newton-Raphson Descent requires the inverse of the Hessian. The inverse of the Hessian exists if the Hessian is positive definite [Haykin p151]. The Hessian is positive definite if the input signal vectors (s) span the d-dimensional real vector space [Golden p367].

Problem 3

  1. Probability Space

    ( Ω x , F x , μx ) , μx ( Ω x ) = 1

  2. Dominating Measure

    μx = μs * μo * μμ

  3. Joint Density

    p s,o * ( s ~ t , o ~ t ) = p ( ot | st ) * p ( st )

  4. Measurable Function

    g ( x ~ ) = - μ ~ t * log ( p s,o * ( s ~ t , o ~ t ) )

  5. Expectation

    E { g ( x ~ ) } = x Ω x g ( x ) * px* ( x ) dμx ( x )

  6. Summations and Riemann Integrals

    E { g ( x ~ ) } = Σ s Ω   Σ μ { 0 , 1 }   o - μ t * log ( p s,o * ( s t , o t ) ) * p s,o * ( s t , o t ) * p μ   d o

px* is a measurable function because it is piecewise continuous on Rd.

Problem 4

  1. Objective Function

    ln ( ρ ) = 1n * Σ t = 1 n [ μ t * ( o t - r t ) 2 + λ * h t 2 ]

  2. Squashing Function

    σ ( x ) = 2 1 + -x - 1 = 1 - -x 1 + -x = tanh ( x 2 )

  3. Derivative of Squashing Function

    σ ' ( x ) = 2 * - ( 1 + -x ) -2 * - -x = 1 2 * 4 * -x ( 1 + -x ) 2 = 1 2 * ( 1 + -x ) 2 - ( 1 - -x ) 2 ( 1 + -x ) 2 = 1 2 * ( 1 - σ ( x ) 2 )

  4. Weight Update Rule for Output Layer

    w ( k + 1 ) = w ( k ) - η * ln ( w )

  5. Weighted Sum to Output Neuron

    at = wT * ht

  6. Output Neuron Response

    rt = σ ( at )

  7. Error

    et = ot - rt

  8. Error Squared

    ft = et2

  9. Observable Error Squared

    gt = μt * ft

  10. Chain Rule

    gt ' (w) = gt ' (ft) * ft ' (et) * et ' (rt) * rt ' (at) * at ' (w)

  11. First Term

    gt ' (ft) = μt

  12. Second Term

    ft ' (et) = 2 * et

  13. Third Term

    et ' (rt) = -1

  14. Fourth Term

    rt ' (at) = 1 2 * ( 1 - rt2 )

  15. Fifth Term

    at ' (w) = htT

  16. Combined Terms

    gt ' (w) = μt * 2 * et * -1 * 1 2 * ( 1 - rt2 ) * htT = - μ t * ( ot - rt ) * ( 1 - rt2 ) * htT

  17. Weight Gradient for Output Layer

    ln ( w ) = 1n * Σ t = 1 n [ - μ t * ( ot - rt ) * ( 1 - rt2 ) * ht ]

  18. Weight Update Rule for Output Layer Complete

    w ( k + 1 ) = w ( k ) + η * 1n * Σ t = 1 n [ μ t * ( ot - rt ) * ( 1 - rt2 ) * ht ]

  19. Hidden Layer Weight Vector

    v = vec ( V )

  20. Weight Update Rule for Hidden Layer

    v (t+1) = v(t) - η * ln ( v )

  21. Weighted Sum to Hidden Layer

    bt = V * st

  22. Hidden Layer

    ht = σ ( bt )

  23. Normalization Term

    c t = λ * h t 2 = λ * h t T * h t

  24. Error Minimization Term plus Normalization Term

    d t = μ t * ( o t - r t ) 2 + λ * h t 2 = g t + c t

  25. Chain Rule

    dt ' (v) = gt ' (v) + ct ' (v)

  26. Chain Rule for Error Minimization Term

    gt ' (v) = gt ' (ft) * ft ' (et) * et ' (rt) * rt ' (at) * at ' (ht) * ht ' (bt) * bt ' (v)

  27. Fifth Term

    at ' (ht) = wtT

  28. Sixth Term

    ht ' (bt) = 1 2 * diag ( 1 - ht ht )

  29. Seventh Term

    bt ' (v) = I stT

  30. Combined Terms for Error Minimization

    gt ' (v) = μt * 2 * et * -1 * 1 2 * ( 1 - rt2 ) * wT * 1 2 * diag ( 1 - ht ht ) * [ I stT ]

  31. Reduce

    gt ' (v) = - 1 2 * μt * ( ot - rt ) * ( 1 - rt2 ) * [ w ( 1 - ht ht ) ] T stT

  32. Chain Rule for Normalization Term

    ct ' (v) = ct ' (ht) * ht ' (bt) * bt ' (v)

  33. First Term for Normalization

    ct ' (ht) = 2 * λ * htT

  34. Combined Terms for Normalization

    ct ' (v) = 2 * λ * htT * 1 2 * diag ( 1 - ht ht ) * [ I stT ]

  35. Reduce

    ct ' (v) = λ * [ ht ( 1 - ht ht ) ] T stT

  36. Derivative of Both Terms

    dt ' (v) = - 1 2 * μt * ( ot - rt ) * ( 1 - rt2 ) * [ w ( 1 - ht ht ) ] T stT   + λ * [ ht ( 1 - ht ht ) ] T stT

  37. Reduce

    dt ' (v) = ( [ - 1 2 * μt * ( ot - rt ) * ( 1 - rt2 ) * w + λ * ht ] T [ 1 - ht ht ] T ) stT

  38. Weight Gradient for Hidden Layer

    ln ( v ) = 1n * Σ t = 1 n [ ( [ - 1 2 * μt * ( ot - rt ) * ( 1 - rt2 ) * w + λ * ht ] [ 1 - ht ht ] ) st ]

  39. Weight Update Rule for Hidden Layer Complete

    v ( k + 1 ) = v ( k )   + η * 1n * Σ t = 1 n [ ( [ 1 2 * μt * ( ot - rt ) * ( 1 - rt2 ) * w - λ * ht ] [ 1 - ht ht ] ) st ]

Problem 5

The first part of the objective function minimizes the mean square error between observable desired responses and actual responses. The second part of the objective function minimizes the number of hidden units in the network. This reduces the effects of overfitting and improves generalization performance [Golden pp105-6].

Yes. "If f is twice-differentiable at every [vector x which is an element of] D then f is twice-differentiable on D and is a twice-differentiable function" [Marlow p198]. For any input in the domain, the second derivative of the objective function exists.

Yes. The objective function is measurable because it is continuous.

Problem 6

In my personal opinion, the most important strength of probability theory and expectation as tools for making rational inferences in environments characterized by uncertainty is that it puts the framework on solid footing when compared to more subjective methods such as fuzzy logic. With fuzzy logic, providing the fuzzy measures by quantifying the values of the symbols can be somewhat arbitrary. I imagine these fuzzy values have to be manually tweaked until the fuzzy logic roughly equates to what could have been determined in a more straight-forward manner using expected risk.

The most important limitation is the inability to make fuzzy decisions based on fuzzy classifications. By a "fuzzy classification", I mean varying degrees of membership in a fuzzy set versus all-or-nothing membership in a crisp set [Golden p248]. For example, a probabilistic decision to eat food that might be poisoned will be based on the consequences. I might decide to avoid the food if the risk is high even if the probability is low. A fuzzy decision would provide the alternative option of eating just a little if the classification is determined to be "mostly safe".

Generalization is the ability to make classification decisions for stimuli not previously seen based on similarity to previous stimuli and their classifications. The problem with this definition is that "similarity" is vague. An "appropriate generalization" for a given probability space should be based on minimizing the expected risk [Golden p276].

 

 

 

References

  • Golden, Richard M., Mathematical Methods for Neural Network Analysis and Design, MIT Press, 1996.
  • Haykin, Simon, Neural Networks: A Comprehensive Foundation, 2nd Ed., Prentice Hall, 1999.
  • Marlow, W. H., Mathematics for Operations Research, Dover Publications, 1978.