Question: Show that if we use the loss function L(o) in Exercise 9, then the loss-to-node gradient can be computed for the final layer ht as
Show that if we use the loss function L(o) in Exercise 9, then the loss-to-node gradient can be computed for the final layer ht as follows:
∂L(o)
∂ht
= UT ∂L(o)
∂o The updates in earlier layers remain similar to Exercise 9, except that each o is replaced by L(o). What is the size of each matrix ∂L(o)
∂hp
?
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
