Midterm

David Wallace Croft

Neural Net Mathematics
Richard M. Golden, Ph.D.
U.T. Dallas

2005-02-28

Problem 1

  1. Objective Function

    c(w) = μ1 * ( o1 - r1 ) 2 + μ2 * ( o2 - r2 ) 2

  2. Response

    rk = wT * sk

  3. Weight Update Rule

    w (t+1) = w(t) +Δw

  4. Gradient Descent

    Δw = - η * c(w)

  5. Error

    ek = ok - rk

  6. Error Squared

    fk = ek2

  7. Vectors

    μ = [ μ1 μ2 ] , e = [ e1 e2 ] , f = [ f1 f2 ] , o = [ o1 o2 ] , r = [ r1 r2 ]

  8. Objective Function Revised

    c(w) = μ T * f = μ T * ( e e )

  9. Chain Rule

    c(w) = c'(w)T = [ c'(f) * f'(e) * e'(r) * r'(w) ] T

  10. First Term

    c'(f) = ( μ T * f ) ' = μT

  11. Second Term

    f'(e) = ( e e ) ' = 2*diag(e)

  12. Third Term

    e'(r) = ( o - r ) ' = -I

  13. Fourth Term

    r'(w) = ( [ s1T s2T ] * w ) ' = [ s1T s2T ]

  14. Combined Terms

    c(w) = [ [ μT ] * [ 2*diag(e) ] * [ -I ] * [ s1T s2T ] ] T

  15. Reduce

    c(w) = [ -2 * [ μ1 * e1 μ2 * e2 ] * [ s1T s2T ] ] T

  16. Transpose

    c(w) = -2 * [ s1 s2 ] * [ μ1 * e1 μ2 * e2 ]

  17. Reduce More

    c(w) = -2 * [ μ1 * ( o1 - r1 ) * s1 + μ2 * ( o2 - r2 ) * s2 ]

  18. Weight Update Rule Final

    w (t+1) = w(t) + 2 * η * [ μ1 * ( o1 - r1 ) * s1 + μ2 * ( o2 - r2 ) * s2 ]

Problem 2

  1. Taylor Series

    f t + Δ t = Σ k = 0 Δ t k k! f k t

  2. Taylor Series for a Function of a Vector

    c w + Δ w = c w + c w T * Δ w + 1 2 * Δ w T * 2 c w * Δ w + ...

  3. Function Delta

    Δ c w = c w + Δ w - c w c w T * Δ w + 1 2 * Δ w T * 2 c w * Δ w

  4. Minimize the Function Delta

    Δ c ' ( Δ w ) c w T + Δ w T * 2 c w = 0

  5. Solve for the Weight Delta, Step 1

    Δ w T * 2 c w = - c w T

  6. Solve for the Weight Delta, Step 2

    2 c w * Δ w = - c w

  7. Solve for the Weight Delta, Step 3

    Δ w = - [ 2 c w ] -1 * c w

  8. Newton-Raphson Descent

    w (t+1) = w(t) - η * [ 2 c w ] -1 * c w

  9. Hessian (Derivation)

    2 c w = 2 c ( w ) w w = c ' ( w ) w = [ c w ] T w

  10. From Problem 1

    c(w) = [ -2 * [ μ1 * e1 μ2 * e2 ] * [ s1T s2T ] ] T

  11. Define y

    y = [ μ1 * e1 μ2 * e2 ]

  12. Define F

    F = [ s1T s2T ]

  13. Derivative of c in terms of y and F

    c ' ( w ) = -2 * y T * F

  14. Identity from Marlow p216

    ( y T * F ) ' = ( y T I s ) * F'( w) + FT * y'( w)

  15. Hessian in terms of y and F

    2 c w = -2 * FT * y'( e) * e'( r) * r'( w)

  16. Hessian (more)

    2 c w = -2 * FT * diag ( μ ) * ( - I ) * F

  17. Hessian (more 2)

    2 c w = 2 * FT * diag ( μ ) * F

  18. Hessian (more 3)

    2 c w = 2 * [ s1 s2 ] * diag ( μ ) * [ s1 s2 ] T

  19. Newton-Raphson Descent requires the inverse of the Hessian. The inverse of the Hessian exists if the Hessian is positive definite [Haykin p151]. The Hessian is positive definite if the input signal vectors (s) span the d-dimensional real vector space [Golden p367].

Problem 3

  1. Input

    h t = [ s t T , o t-1 , 1 ] T

  2. Response at time t

    rt = - w - h t 2

  3. Objective Function

    ln ( w ) = ( 1n ) Σ t = 1 n [ μ t * ( o t - r t ) 2 + λ * ( r t - r t-1 ) 2 ]

  4. Semantic Interpretation

    The first part of the objective function minimizes the mean square error between observable desired responses and actual responses. The second part of the objective function minimizes the differences between the current response and the immediately preceding response. This acts like a temporal smoothing function, eliminating high frequency components.

  5. Observable Error

    mt = μ t * ( o t - r t )

  6. Observable Errors Over Observation Time

    m = [ m1 m2 ... mn ] T = μ ( o - r )

  7. Zero or One Squared

    μ t 2 = μ t

  8. Square Error

    a = mT * m

  9. Change in Response

    ct = r t - r t-1

  10. Changes in Response Over Observation Time

    c = [ c1 c2 ... cn ] T

  11. Sum of Square of Changes

    b = cT * c

  12. Objective Function revised

    ln ( w ) = ( 1n ) * [ a + λ * b ]

  13. Derivative

    ln ' ( w ) = ( 1n ) * [ a ' ( w ) + λ * b ' ( w ) ]

  14. Difference

    ft = w - h t

  15. Negative Distance

    kt = - ftT * ft

  16. Derivative of the First Part

    a ' ( w ) = a ' ( m ) * m ' ( r ) * r ' ( k ) * k ' ( f ) * f ' ( w )

  17. First Term

    a ' ( m ) = 2 * m T = 2 * [ μ ( o - r ) ] T

  18. Diagonal Matrices

    m ' ( r ) * r ' ( k ) = - diag ( μ ) * diag ( r ) = - diag ( μ r )

  19. First Three Terms

    a ' ( m ) * m ' ( r ) * r ' ( k ) = -2 * m T * diag ( μ r ) = -2 * [ μ ( o - r ) μ r ] T = -2 * [ μ ( o - r ) r ] T

  20. Vector f

    f = vec ( F ) = vec ( [ f1 f2 ... fn ] ) = [ f1T f2T ... fnT ] T

  21. Fourth Term

    k ' ( f ) = -2 * [ f1T 0T 0T ... 0T 0T f2T 0T ... 0T ... ... ... ... ... 0T 0T 0T ... fnT ] n x (n*(d+2))

  22. Fifth Term

    f ' ( w ) = [ I d+2 I d+2 ... I d+2 ] (n*(d+2)) x (d+2) = 1 n I d+2

  23. Fourth and Fifth Terms Combined

    k ' ( f ) * f ' ( w ) = -2 * [ f 1 T f 2 T ... f n T ] n x (d+2)

  24. First Part

    a ' ( w ) = a ' ( m ) * m ' ( r ) * r ' ( k ) * k ' ( f ) * f ' ( w ) = 4 * [ μ ( o - r ) r ] T * [ f 1 T f 2 T ... f n T ] n x (d+2)

  25. Derivative of the Second Part

    b ' ( w ) = b ' ( c ) * c ' ( r ) * r ' ( k ) * k ' ( f ) * f ' ( w )

  26. Second Part, First Term

    b ' ( c ) = 2 * c T

  27. Temporal Smoothing Term

    c ' ( r ) = [ 1 0 0 0 ... 0 0 0 -1 1 0 0 ... 0 0 0 0 -1 1 0 ... 0 0 0 ... ... ... ... ... ... ... ... 0 0 0 0 ... 0 -1 1 ]

  28. Two Terms

    c ' ( r ) * r ' ( k ) = [ 1 0 0 0 ... 0 0 0 -1 1 0 0 ... 0 0 0 0 -1 1 0 ... 0 0 0 ... ... ... ... ... ... ... ... 0 0 0 0 ... 0 -1 1 ] * diag ( r ) = [ r1 0 0 0 ... 0 0 0 -r1 r2 0 0 ... 0 0 0 0 -r2 r3 0 ... 0 0 0 ... ... ... ... ... ... ... ... 0 0 0 0 ... 0 -r n-1 rn ]

  29. Second Part, All Terms

    b ' ( c ) * c ' ( r ) * r ' ( k ) * k ' ( f ) * f ' ( w ) = 2 * c T * [ r1 0 0 0 ... 0 0 0 -r1 r2 0 0 ... 0 0 0 0 -r2 r3 0 ... 0 0 0 ... ... ... ... ... ... ... ... 0 0 0 0 ... 0 -r n-1 rn ] * -2 * [ f 1 T f 2 T ... f n T ] n x (d+2)

  30. Second Part

    b ' ( c ) * c ' ( r ) * r ' ( k ) * k ' ( f ) * f ' ( w ) = -4 * c T * [ r1 0 0 0 ... 0 0 0 -r1 r2 0 0 ... 0 0 0 0 -r2 r3 0 ... 0 0 0 ... ... ... ... ... ... ... ... 0 0 0 0 ... 0 -r n-1 rn ] * [ f 1 T f 2 T ... f n T ] n x (d+2)

  31. Both Parts

    ln ' ( w ) = 4 n * [ μ ( o - r ) r ] T * [ f 1 T f 2 T ... f n T ] n x (d+2) - 4 * λ n * c T * [ r1 0 0 0 ... 0 0 0 -r1 r2 0 0 ... 0 0 0 0 -r2 r3 0 ... 0 0 0 ... ... ... ... ... ... ... ... 0 0 0 0 ... 0 -r n-1 rn ] * [ f 1 T f 2 T ... f n T ] n x (d+2)

  32. Both Parts Again

    ln ' ( w ) = 4 n * [ [ μ ( o - r ) r ] T - λ * c T * [ r1 0 0 0 ... 0 0 0 -r1 r2 0 0 ... 0 0 0 0 -r2 r3 0 ... 0 0 0 ... ... ... ... ... ... ... ... 0 0 0 0 ... 0 -r n-1 rn ] ] * [ f 1 T f 2 T ... f n T ] n x (d+2)

  33. Gradient is the Transpose

    ln ( w ) = 4 n * [ w - h 1 w - h 2 ... w - h n ] (d+2) x n * [ [ μ ( o - r ) r ] - λ * [ r1 * ( -r0 + 2*r1 + -r2 ) r2 * ( -r1 + 2*r2 + -r3 ) ... rn-1 * ( - rn-2 + 2* rn-1 + -rn ) rn * ( rn - rn-1 ) ] ]

  34. Weight Update Rule

    w (t+1) = w(t) - η * ln ( w )

Problem 4

Yes. "If f is twice-differentiable at every [vector x which is an element of] D then f is twice-differentiable on D and is a twice-differentiable function" [Marlow p198]. For any input in the domain, the second derivative of the objective function exists.

Problem 5

  1. Newton-Raphson Descent

    w (t+1) = w(t) - η * [ 2 l n w ] -1 * l n w

  2. Hessian

    2 l n w = l n w w

  3. Identity from Marlow p216

    ( F (d+2) xn * z ) ' ( w ) = F * z ' ( w ) + ( I d+2 z T ) * F ' ( w )

  4. Define z

    z = μ ( o - r ) r

  5. Derivative of z with Respect to r

    z ' ( r ) = diag ( μ ( o - 2 * r ) )

  6. Derivative of z with Respect to w

    z ' ( w ) = z ' ( r ) * r ' ( k ) * k ' ( f ) * f ' ( w ) = diag ( μ ( o - 2 * r ) ) * diag ( r ) * -2 * F T = -2 * diag ( μ ( o - 2 * r ) r ) * F T

  7. Identity from Marlow p211

    F ' ( w ) = Σ t = 1 n ( f t ' ( w ) e t )

  8. Derivative of f at time t

    f t ' ( w ) = I d+2

  9. Derivative of F

    F ' ( w ) = [ I d+2 1 n ] [ ( d + 2 ) * n ] x ( d + 2 )

  10. Combining Terms

    ( I d+2 z T ) * F ' ( w ) = ( I d+2 z T ) * [ I d+2 1 n ] [ ( d + 2 ) * n ] x ( d + 2 ) = ( Σ t = 1 n z t ) * I d + 2

  11. First Part

    ( F (d+2) xn * z ) ' ( w ) = -2 * F * diag ( μ ( o - 2 * r ) r ) * F T + ( Σ t = 1 n [ μ t * ( o t - r t ) * r t ] ) * I d + 2

  12. Define p

    p = [ r1 * ( -r0 + 2*r1 + -r2 ) r2 * ( -r1 + 2*r2 + -r3 ) ... rn-1 * ( - rn-2 + 2* rn-1 + -rn ) rn * ( rn - rn-1 ) ]

  13. Derivative of p with respect to r

    p ' ( r ) = [ -r0 + 4*r1 -r2 -r1 0 0 ... 0 0 0 0 -r2 -r1 + 4*r2 -r3 -r2 0 ... 0 0 0 0 ... ... ... ... ... ... ... ... ... 0 0 0 0 ... 0 - r n-1 - r n-2 + 4* r n-1 - r n - r n-1 0 0 0 0 ... 0 0 - r n - r n-1 + 2* r n ]

  14. Derivative of F times p

    ( F (d+2) xn * p ) ' ( w ) = F * p ' ( w ) + ( I d+2 p T ) * F ' ( w )

  15. Second Part

    ( F (d+2) xn * p ) ' ( w ) = -2 * F * p ' ( r ) * diag ( r ) * F T + ( Σ t = 1 n p t ) * I d + 2

  16. Hessian from Both Parts

    2 l n w = 4n * ( [ -2 * F * diag ( μ ( o - 2 * r ) r ) * F T + ( Σ t = 1 n [ μ t * ( o t - r t ) * r t ] ) * I d + 2 ] - λ * [ -2 * F * p ' ( r ) * diag ( r ) * F T + ( Σ t = 1 n p t ) * I d + 2 ] )

Problem 6

  1. Gradient Descent Weight Update Rule

    w (t+1) = w(t) - η * Q(w)

References

  1. Golden, Richard M., Mathematical Methods for Neural Network Analysis and Design, MIT Press, 1996.
  2. Haykin, Simon, Neural Networks: A Comprehensive Foundation, 2nd Ed., Prentice Hall, 1999.
  3. Marlow, W. H., Mathematics for Operations Research, Dover Publications, 1978.