Questions and Answers of Statistical Techniques in Business

Show that the softmax function\[ \text { softmax }: \boldsymbol{z} \mapsto \frac{\exp (\boldsymbol{z})}{\sum_{k} \exp \left(z_{k}\right)} \]satisfies the invariance property:\[
Projection pursuit is a network with one hidden layer that can be written as:\[ g(\boldsymbol{x})=S\left(\boldsymbol{\omega}^{\top} \boldsymbol{x}\right) \]where \(S\) is a univariate smoothing
Suppose that in the stochastic gradient descent method we wish to repeatedly draw minibatches of size \(N\) from \(\tau_{n}\) where we assume that \(N \times m=n\) for some large integer
Denote the pdf of the \(\mathscr{N}(\mathbf{0}, \boldsymbol{\Sigma})\) distribution by \(\varphi \Sigma(\cdot)\), and let\[ \mathscr{D}\left(\boldsymbol{\mu}_{0} \boldsymbol{\Sigma}_{0} \mid
Suppose that we wish to compute the inverse and log-determinant of the matrix\[ \mathbf{I}_{n}+\mathbf{U} \mathbf{U}^{\top} \]where \(\mathrm{U}\) is an \(n \times h\) matrix with \(h \ll n\). Show
Suppose that\[ \mathbf{U}=\left[\boldsymbol{u}_{0}, \boldsymbol{u}_{1}, \ldots, \boldsymbol{u}_{h-1}\right] \]where all \(\boldsymbol{u} \in \mathbb{R}^{n}\) are column vectors and we have computed
Suppose that \(\mathbf{U} \in \mathbb{R}^{n \times h}\) has its \(k\)-th column \(v\) replaced with \(\boldsymbol{w}\), giving the updated U.(a) If \(\boldsymbol{e} \in \mathbb{R}^{h}\) denotes the
Equation (9.7) gives the rank-two \(\mathrm{BFGS}\) update of the inverse Hessian \(\mathbf{C}_{t}\) to \(\mathbf{C}_{t+1}\). Instead of using a two-rank update, we can consider a one-rank update, in
Show that the BFGS formula (B.23) can be written as:\[ \mathbf{C} \leftarrow\left(\mathbf{I}-v \boldsymbol{g} \boldsymbol{\delta}^{\top}\right)^{\top} \mathbf{C}\left(\mathbf{I}-v \boldsymbol{g}
Show that the BFGS formula (B.23) is the solution to the constrained optimization problem:\[ \mathbf{C}_{\mathrm{BFGS}}=\underset{\mathbf{A} \text { subject to } \mathbf{A}
Consider again the logistic regression model in Exercise 5.18, which used iterative reweighted least squares for training the learner. Repeat all the computations, but this time using the
Download the seeds_dataset.txt data set from the book's GitHub site, which contains 210 independent examples. The categorical output (response) here is the type of wheat grain: Kama, Rosa, and
In Exercise 12 above, we train the multi-logit classifier using a weight matrix \(\mathbf{W}\) \(\in \mathbb{R}^{3 \times 7}\) and bias vector \(\boldsymbol{b} \in \mathbb{R}^{3}\). Repeat the
Consider again Example 9.4, where we used a softmax output function \(S_{L}\) in conjunction with the cross-entropy loss: \(C(\boldsymbol{\theta})=-\ln g_{y+1}(\boldsymbol{x} \mid
Derive the formula (B.25) for a diagonal Hessian update in a quasi-Newton method for minimization. In other words, given a current minimizer \(\boldsymbol{x}_{t}\) of \(f(\boldsymbol{x})\), a
Consider again the Python implementation of the polynomial regression in Section 9.5.1, where the stochastic gradient descent was used for training.Using the polynomial regression data set, implement
Consider again the Pytorch code in Section 9. 5.2. Repeat all the computations, but this time using the momentum method for training of the network. Comment on which method is preferable: the
Let \(0 \leqslant w \leqslant 1\). Show that the solution to the convex optimization problem\[ \begin{gather*} \min _{p_{1}, \ldots, p_{n}} \sum_{i=1}^{n} p_{i}^{2} \tag{7.28}\\ \text { subject
Derive the formulas (7.14) by minimizing the cross-entropy training loss:\[ -\frac{1}{n} \sum_{i=1}^{n} \ln g\left(\boldsymbol{x}_{i}, y_{i} \mid \boldsymbol{\theta}\right) \]where
Adapt the code in Example 7. 2 to plot the estimated decision boundary instead of the true one in Figure 7.3. Compare the true and estimated decision boundaries.
Recall from equation (7.16) that the decision boundaries of the multi-logit classifier are linear, and that the pre-classifier can be written as a conditional pdf of the form:\[ g(y \mid \mathbf{W},
Consider a binary classification problem where the response \(Y\) takes values in \(\{-1,1\}\). Show that optimal prediction function for the hinge loss \(\operatorname{Loss}(y, \tilde{y})=(1-y
In Example 4.12, we applied a principal component analysis (PCA) to the iris data, but refrained from classifying the flowers based on their feature vectors \(\boldsymbol{x}\). Implement a 1-nearest
Figure 7.13 displays two groups of data points, given in Table 7.8. The convex hulls have also been plotted. It is possible to separate the two classes of points via a straight line.In fact, many
In Example 7.6 we used the feature map \(\boldsymbol{\phi}(\boldsymbol{x})=\left[x_{1}, x_{2}, x_{1}^{2}+x_{2}^{2}\right]^{\top}\) to classify the points. An easier way is to map the points into
Let \(Y \in\{0,1\}\) be a response variable and let \(h(\boldsymbol{x})\) be the regression function\[ h(\boldsymbol{x}):=\mathbb{E}[Y \mid \boldsymbol{X}=\boldsymbol{x}]=\mathbb{P}[Y=1 \mid
The purpose of this exercise is to derive the dual program (7.21) from the primal program (7.20). The starting point is to introduce a vector of auxiliary variables \(\xi:=\left[\xi_{1}, \ldots,
Consider SVM classification as illustrated in Figure 7.7. The goal of this exercise is to classify the training points \(\left\{\left(\boldsymbol{x}_{i}, y_{i}\right)\right\}\) based on the value of
A well-known data set is the MNIST handwritten digit database, containing many thousands of digitalized numbers (from 0 to 9 ), each described by a \(28 \times 28\) matrix of gray scales. A similar
Download the winequality-red.csv data set from UCI's wine-quality website. The response here is the wine quality (from 0 to 10) as specified by a wine "expert" and the explanatory variables are
Consider the credit approval data set crx.data from UCl's credit approval website. The data set is concerned with credit card applications. The last column in the data set indicates whether the
Consider a synthetic data set that was generated in the following fashion. The explanatory variable follows a standard normal distribution. The response label is 0 if the explanatory variable is
Consider the digits data set from Exercise12.In this exercise, we would like to train a binary classifier for the identification of digit 8 .(a) Divide the data such that the first 1000 rows are used
Repeat Exercise 16 with the original MNIST data set. Use the first 60,000 rows as the train set and the remaining 10,000 rows as the test set. The original data set can be obtained using the
For the breast cancer data in Section 7.8, investigate and discuss whether accuracy is the relevant metric to use or if other metrics discussed in Section 7.2 are more appropriate.
Show that any training set \(\tau=\left\{\left(\boldsymbol{x}, y_{i}\right), i=1, \ldots, n\right\}\) can be fitted via a tree with zero training loss.
Suppose during the construction of a decision tree we wish to specify a constant regional prediction function \(g^{w}\) on the region \(\mathscr{R}_{w}\), based on the training data in
Using the program from Section 8.2.4, write a basic implementation of a decision tree for a binary classification problem. Implement the misclassification, Gini index, and entropy impurity criteria
Suppose in the decision tree of Example 8.1, there are 3 blue and 2 red data points in a certain tree region. Calculate the misclassification impurity, the Gini impurity, and the entropy impurity.
Consider the procedure of finding the best splitting rule for a categorical variable with \(k\) labels from Section 8.3.4. Show that one needs to consider \(2^{k}\) subsets of \(\{1, \ldots\),
Reproduce Figure 8.6 using the following classification data. from sklearn.datasets import make_blobs X, y = make_blobs (n_samples-5000, n_features-10, centers=3, random_state=10, cluster_std=10)
Prove (8.13); that is, show that 1 1 {x; Rw}Loss (yi, g" (x)) = nl, (g). WEW
Suppose \(\tau\) is a training set with \(n\) elements and \(\tau^{*}\), also of size \(n\), is obtained from \(\tau\) by bootstrapping; that is, resampling with replacement. Show that for large \(n,
Prove Equation (8.17).
Consider the following training/test split of the data. Construct a random forest regressor and identify the optimal subset size \(m\) in the sense of \(R^{2}\) score (see Remark \(8.3)\). import
Explain why bagging decision trees are a special case of random forests.
Show that (8.28) holds.
Consider the following classification data and module imports:
Let \(\mathscr{G}\) be an RKHS with reproducing kernel \(\kappa\). Show that \(\kappa\) is a positive semidefinite function.
Show that a reproducing kernel, if it exists, is unique.
Let \(\mathscr{G}\) be a Hilbert space of functions \(g: \mathscr{X} \rightarrow \mathbb{R}\). Recall that the evaluation functional is the map \(\delta_{x}: g \mapsto g(\boldsymbol{x})\) for a given
Let \(\mathscr{G}_{0}\) be the pre-RKHS \(\mathscr{G}_{0}\) constructed in the proof of Theorem 6.2. Thus, \(g \in \mathscr{G}_{0}\) is of the form \(g=\sum_{i=1}^{n} \alpha_{i} \kappa_{x_{i}}\)
Continuing Exercise 4 , let \(\left(f_{n}\right)\) be a Cauchy sequence in \(\mathscr{G}_{0}\) such that \(\left|f_{n}(\boldsymbol{x})\right| \rightarrow 0\) for all \(\boldsymbol{x}\). Show that
Continuing Exercises 5 and 4, to show that the inner product (6.14) is well defined, a number of facts have to be checked.(a) Verify that the limit converges.(b) Verify that the limit is independent
Exercises 4-6 show that \(\mathscr{G}\) defined in the proof of Theorem 6.2 is an inner product space. It remains to prove that \(\mathscr{G}\) is an RKHS. This requires us to prove that the inner
If \(\kappa_{1}\) and \(\kappa_{2}\) are kernels on \(\mathscr{X}\) and \(\mathscr{Y}\), then \(\kappa_{+},\left((\boldsymbol{x}, \boldsymbol{y}),\left(\boldsymbol{x}^{\prime},
An RKHS enjoys the following desirable smoothness property: if \(\left(g_{n}\right)\) is a sequence belonging to RKHS \(\mathscr{G}\) on \(\mathscr{X}\), and \(\left\|g_{n}-g\right\|_{\mathscr{G}}
Let \(\mathbf{X}\) be an \(\mathbb{R}^{d}\)-valued random variable that is symmetric about the origin (that is, \(\boldsymbol{X}\) and \((-\boldsymbol{X})\) are identically distributed). Denote by
Suppose an \(\mathrm{RKHS} \mathscr{G}\) of functions from \(\mathscr{X} \rightarrow \mathbb{R}\) (with kernel \(\kappa\) ) is invariant under a group \(\mathscr{T}\) of transformations \(T:
Given two Hilbert spaces \(\mathscr{H}\) and \(\mathscr{G}\), we call a mapping \(A: \mathscr{H} \rightarrow \mathscr{G}\) a Hilbert space isomorphism if it is (i) a linear map; that is, \(A(a f+b
Let \(\mathbf{X}\) be an \(n \times p\) model matrix. Show that \(\mathbf{X}^{\top} \mathbf{X}+n \gamma \mathbf{I} \mathbf{I}_{p}\) for \(\gamma>0\) is invertible.
As Example 6.8 clearly illustrates, the pdf of a random variable that is symmetric about the origin is not in general a valid reproducing kernel. Take two such iid random variables \(X\) and
For the smoothing cubic spline of Section 6.6, show that \(\kappa(x, u)=\frac{\max \{x, u\} \min \{x, u\}^{2}}{2}-\frac{\min \{x, u\}^{3}}{6} .\).
Let \(\mathbf{X}\) be an \(n \times p\) model matrix and let \(\boldsymbol{u} \in \mathbb{R}^{p}\) be the unit-length vector with \(k\)-th entry equal to one
Use Algorithm 6. 8.1 from Exercise 16 to write Python code that computes the ridge regression coefficient \(\boldsymbol{\beta}\) in (6.5) and use it to replicate the results on Figure 6.1. The
Consider Example 2.10 with \(\mathbf{D}=\operatorname{diag}\left(\lambda_{1}, \ldots, \lambda_{p}\right)\) for some nonnegative vector \(\lambda \in \mathbb{R}^{p}\), so that twice the negative
(Exercise 18 continued.) Consider again Example 2.10 with \(\mathbf{D}=\operatorname{diag}\left(\lambda_{1}, \ldots, \lambda_{p}\right)\) for some nonnegative model-selection parameter \(\lambda \in
In this exercise we explore how the early stopping of the gradient descent iterations (see Example B.10),\[ \boldsymbol{x}_{t+1}=\boldsymbol{x}_{t}-\alpha abla f\left(\boldsymbol{x}_{t}\right),
Following his mentor Francis Galton, the mathematician/statistician Karl Pearson conducted comprehensive studies comparing hereditary traits between members of the same family. Figure 5.10 depicts
For the simple linear regression model, show that the values for \(\widehat{\beta_{1}}\) and \(\widehat{\beta_{0}}\) that solve the equations (5.9) are:\[ \begin{gather*}
Edwin Hubble discovered that the universe is expanding. If \(v\) is a galaxy's recession velocity (relative to any other galaxy) and \(d\) is its distance (from that same galaxy), Hubble's law states
The multiple linear regression model (5.6) can be viewed as a first-order approximation of the general model\[ \begin{equation*} Y=g(\boldsymbol{x})+\varepsilon \tag{5.42} \end{equation*} \]where
Table 5.6 shows data from an agricultural experiment where crop yield was measured for two levels of pesticide and three levels of fertilizer. There are three responses for each combination.(a)
Show that for the birthweight data in Section 5.6.6.2 there is no significant decrease in birthweight for smoking mothers. [Hint: create a new variable nonsmoke \(=1\)-smoke, which reverses the
Prove (5.37) and (5.38).
In the Tobit regression model with normally distributed errors, the response is modeled as:\[ Y_{i}=\left\{\begin{array}{ll} Z_{i}, & \text { if } u_{i}u_{i}}
Dowload data set WomenWage.csv from the book's website. This data set is a tidied-up version of the women's wages data set from [91]. The first column of the data (hours) is the response variable
Let \(\mathbf{P}\) be a projection matrix. Show that the diagonal elements of \(\mathbf{P}\) all lie in the interval \([0,1]\). In particular, for \(\mathbf{P}=\mathbf{X X}^{+}\)in Theorem 5.1, the
Consider the linear model \(\boldsymbol{Y}=\mathbf{X} \boldsymbol{\beta}+\varepsilon\) in (5.8), with \(\mathbf{X}\) being the \(n \times p\) model matrix and \(\boldsymbol{\varepsilon}\) having
Take the linear model \(\boldsymbol{Y}=\mathbf{X} \boldsymbol{\beta}+\varepsilon\), where \(\mathbf{X}\) is an \(n \times p\) model matrix, \(\varepsilon=\mathbf{0}\), and \(\mathbb{C o
Consider a normal linear model \(\boldsymbol{Y}=\mathbf{X} \boldsymbol{\beta}+\varepsilon\), where \(\mathbf{X}\) is an \(n \times p\) model matrix and \(\varepsilon \sim \mathscr{N}\left(\mathbf{0},
Using the notation from Exercises 11-13, Cook's distance for observation \(i\) is defined as\[ D_{i}:=\frac{\widehat{\boldsymbol{Y}}-\widehat{\boldsymbol{Y}}^{(i)^{2}}}{p S^{2}} \]It measures the
Prove that if we add an additional feature to the general linear model, then \(R^{2}\), the coefficient of determination, is necessarily non-decreasing in value and hence cannot be used to compare
Let \(\boldsymbol{X}:=\left[X_{1} \ldots, X_{n}\right]^{\top}\) and \(\boldsymbol{\mu}:=\left[\mu_{1}, \ldots \mu_{n}\right]^{\top}\). In the fundamental Theorem C.9, we use the fact that if \(X_{i}
Carry out a logistic regression analysis on a (partial) wine data set classification problem. The data can be loaded using the following code.The model matrix has three features, including the
Consider again Example5.10, where we train the learner via the Newton iteration (5.39). If \(\mathbf{X}^{\top}:=\left[x_{1}, \ldots, \boldsymbol{x}_{n}\right]\) defines the matrix of predictors and
In multi-output linear regression, the response variable is a real-valued vector of dimension, say, \(m\). Similar to (5.8), the model can be written in matrix notation:\[ \mathbf{Y}=\mathbf{X
This exercise is to show that the Fisher information matrix \(\mathbf{F}(\boldsymbol{\theta})\) in (4.8) is equal to the matrix \(\mathbf{H}(\boldsymbol{\theta})\) in (4.9), in the special case where
Plot the mixture of \(\mathscr{N}(0,1), \mathscr{U}(0,1)\), and \(\operatorname{Exp}(1)\) distributions, with weights \(w_{1}=w_{2}=w_{3}\) \(=1 / 3\).
Denote the pdfs in Exercise 2 by \(f_{1}, f_{2}, f_{3}\), respectively. Suppose that \(X\) is simulated via the two-step procedure: First, draw \(Z\) from \(\{1,2,3\}\), then \(\operatorname{draw}
Simulate an iid training set of size 100 from the Gamma \((2.3,0.5)\) distribution, and implement the Fisher scoring method in Example 4.1 to find the maximum likelihood estimate. Plot the true and
Let \(\mathscr{T}=\left\{\boldsymbol{X}_{1}, \ldots, \boldsymbol{X}_{n}\right\}\) be iid data from a pdf \(g(\boldsymbol{x} \mid \boldsymbol{\theta})\) with Fisher matrix
Figure 4.15 shows a Gaussian KDE with bandwidth \(\sigma=0.2\) on the points \(-0.5,0,0.2,0.9\), and 1.5. Reproduce the plot in Python. Using the same bandwidth, plot also the KDE for the same data,
For fixed \(x^{\prime}\), the Gaussian kernel function is the solution to Fourier's heat equation \[ \frac{\partial}{\partial t} f(x \mid t)=\frac{1}{2} \frac{\partial^{2}}{\partial x^{2}} f(x \mid
Show that the Ward linkage given in (4.41) is equal to dward (I, J)= == I+I || x-x || 2
Carry out the agglomerative hierarchical clustering of Example 4.8 via the linkage method from scipy.cluster.hierarchy. Show that the linkage matrices are the same. Give a scatterplot of the data,
Suppose that we have the data \(\tau_{\mathrm{n}}=\left\{x_{1}, \ldots, x_{n}\right\}\) in \(\mathbb{R}\) and decide to train the two-component Gaussian mixture model\[ g(x \mid
A \(d\)-dimensional normal random vector \(X \sim \mathscr{N}\left(\boldsymbol{\mu}, \sum\right)\) can be defined via an affine transformation,
A generalization of both the gamma and inverse-gamma distribution is the generalized inverse-gamma distribution, which has density\[\begin{equation*}f(s)=\frac{(a / b)^{p / 2}}{2 K_{p}(\sqrt{a b})}
In Exercise 11 we viewed the multivariate Student \(\mathbf{t}_{\alpha}\) distribution as a scale-mixture of the \(\mathscr{N}\left(\mathbf{0}, \mathbf{I}_{d}\right)\) distribution. In this exercise,

Showing 1 - 100 of 1305