Question: As you know, in deep neural networks for classification we often use a softmax activation function in the layer at the top of the network

As you know, in deep neural networks for classification we often use a softmax activation function

in the layer at the top of the network

(

final layer

)

and a cross

-

entropy loss as a cost function of

the network. In this problem, you will gain additional mathematical insight into the value of using

softmax and cross

-

entropy loss together in this manner. Think of the computational graph that

includes the above

(

in the context of back

-

propagation

),

and consider the graph node combining

both of the above operations. In this problem, you will show that the combined gradient, i

.

.,

the

gradient of softmax followed cross

-

entropy loss, has a simple algebraic form of a difference between

two vectors

(

tensors

) .

This leads to a desirable behavior of the gradient during training.

(

You are

strongly encouraged to think about what that desirable behavior is and the reason why the particu

-

lar expression you will derive in this problem leads to such behavior.

)

As a consequence, in practice,

the combination of softmax and cross

-

entropy loss works well.

Recall that any node of a computational graph may have associated with it both a variable the node

will output

(

we will refer to it as "operand"

)

and an operation

(

.

.,

add, matmul, etc.

) .

Suppose the input to the combined

(

softmax and loss

)

operation is

y,

and let

q = (y),

where

(y) =

softmax

(y) .

Let the cross

-

entropy loss function be

L = H (p, q),

where

p

vector represents

true probability

-

distribution for a given training exemplar.

Derive the expression for the gradient

g r a d_{y} L

of this combined operation

(

softmax and cross

-

entropy

) .

As mentioned above, you may expect the result to have a simple algebraic form, namely a difference

between two vectors. Show all steps of your derivation.

Hints: Recall that by the definition of cross

-

entropy,

H (p, q) = -_{i}^{?} p_{i} l o g q_{i} .

Since

q

is the softmax,

you can use softmax definition to calculate

l o g q_{i}

which you will need for calculating

L .

When you

calculate L

,

keep in mind that

_{i}^{?} p_{i} = 1 (

obviously

,

since

p

is a probability distribution

) .

As you know, in deep neural networks for

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Janet is a single mother of two. Her salary of $50,000 covers her living expenses of which she has a monthly mortgage payment of $850 (inclusive of taxes and insurance), travel expenses of $10,000...

Question 3. Spring 2019 question 18: We will consider an artificial neural network (ANN) trained on the Urban Traffic dataset described in table 4 to predict the class label y based on attributes...

Sigmoid function g ( x ) is often used as the activation function in neural networks. Which of the following statements is NOT true regarding sigmoid function? Sigmoid function g ( x ) is often used...

Learning rate should be model. a ) increased If the cost oscillates while training a neural network c ) kept constant b ) decreased d ) none of the above To train a neural Network model, for...

Homework due Jul 1 7 , 2 0 2 4 1 7 : 2 9 IST completedDecision Boundaries 0 1 point ( graded ) In this problem we visualize the "decision boundaries" in x - space, corresponding to the four hidden...

import numpy as np import os import torch import torch.nn as nn import torch.nn.functional as F import matplotlib.pyplot as plt import imageio.v2 as imageio import time import gdown device =...

Jupyter Notebook Now that we have tried our hand at some single-layer nets, let's see how they stack up compared to multi-layer nets. :) We will be exploring the basic concepts of learning non-linear...

Need help with the document attached. The company that this paper is about is Nvidia. Project Descriptions SEC 10-K Paper You will be asked to select a company that is publically traded. You must...

Use a fully connected two layer neural network to classify the same dataset of Hw#1 (for both questions). The first layer (input layer) has four neurons and the second layer (output layer) has only...

Jupiter Notebook We have covered some of the limitations of single layer neural networks in class, but they are still powerful learning systems that provide a good way to begin learning about how to...

A client, Jim Adams, requires 10 partner labor hours and 25 professional associate hours from Mason and Hutton, the law firm in E9-6. Partners are paid $125 per hour, and associates make $60 per...

1. Determine the degree of freedom of the press machine shown in Fig.2- 10, and describe the redundant constraints. Fig. 2-10 Press machine 2. Determine the degrees of freedom of the complex...

Madden anabelle, who lives in Winnipeg, decided to open Maddens Nail Spa. Maddeb completed the folowing transactionsRecord the transactions into the basic accounting equation. a . Invested $ 1 6 , 0...

Highland Company produces a lightweight backpack popular with college students. Standard variable costs relating to a single backpack are given below: Direct materials Direct labor Variable...