Question: As you know, in deep neural networks for classification we often use a softmax activation function in the layer at the top of the network

As you know, in deep neural networks for classification we often use a softmax activation function
in the layer at the top of the network (final layer) and a cross-entropy loss as a cost function of
the network. In this problem, you will gain additional mathematical insight into the value of using
softmax and cross-entropy loss together in this manner. Think of the computational graph that
includes the above (in the context of back-propagation), and consider the graph node combining
both of the above operations. In this problem, you will show that the combined gradient, i.e., the
gradient of softmax followed cross-entropy loss, has a simple algebraic form of a difference between
two vectors (tensors). This leads to a desirable behavior of the gradient during training. (You are
strongly encouraged to think about what that desirable behavior is and the reason why the particu-
lar expression you will derive in this problem leads to such behavior.) As a consequence, in practice,
the combination of softmax and cross-entropy loss works well.
Recall that any node of a computational graph may have associated with it both a variable the node
will output (we will refer to it as "operand") and an operation (e.g., add, matmul, etc.).
Suppose the input to the combined (softmax and loss) operation is y, and let q=(y), where
(y)=softmax(y). Let the cross-entropy loss function be L=H(p,q), where p vector represents
true probability-distribution for a given training exemplar.
Derive the expression for the gradient gradyL of this combined operation (softmax and cross-entropy).
As mentioned above, you may expect the result to have a simple algebraic form, namely a difference
between two vectors. Show all steps of your derivation.
Hints: Recall that by the definition of cross-entropy, H(p,q)=-i?pilogqi. Since q is the softmax,
you can use softmax definition to calculate logqi which you will need for calculating L. When you
calculate L, keep in mind that i?pi=1(obviously, since p is a probability distribution).
As you know, in deep neural networks for

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!