Question: 4 Under - Parameterization and Over - Parameterization In the previous section, we had more data points than features in our data, i . e

4 Under-Parameterization and Over-Parameterization
In the previous section, we had more data points than features in our data, i.e., we were looking at N>100. This tends to be the ideal situation, since we need to find an unknown weight for each feature, and this gives us enough information to determine each weight (similar to how two data points are enough to find the slope and intercept, the two unknowns, of a line).
Sometimes, however, we may have fewer data points than we have features - this makes it difficult to determines how the underlying model should depend on each feature. We just don't have enough data. In the following problems, consider a training data set of size N=50 and a test data set of size N=50.
Problem 8: Let A be a matrix of random values, with k rows and 101 columns, where each entry sampled from a N(0,1) distribution. Note that for any input vector x,Ax? will be a vector of k values. We could then consider performing linear regression on the data points (Ax,y) rather than (x,y). Note that if k50, this transformed data set will have fewer input features than we have data points in our data set, and thus we restore linear regression to working order.
Plot over k from 1 to 50 the testing error when, for a given k, you pick a random A to transform the input vectors by, then do linear regression on the result. You'll need to repeat the experiment for a number of A, for each k, to get a good plot. What do you notice? Does this seem to be a reasonable trend?
Problem 9: Notice that there's nothing stopping us from continuing to increase k. This puts us in a region over over-parameterization (we have more features in our data than data points), and in fact increasingly over-parameterization, if we were bold enough to take k>100. One possible solution is to, when performing linear regression on the transformed Ax? data, do ridge regression, introducing the ridge penalty into the loss we are minimizing.
Continue the experiment, for k=50,51,52,dots,200, plotting the resulting testing error (averaged over multiple choices of A). How did you choose a good value? (Note that the number of weights we need to find changes with k- should this influence ?) What do you notice?
Bonus: Why does this happen?
 4 Under-Parameterization and Over-Parameterization In the previous section, we had more

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!