Chapter 1: The Learning Problem

Exercises

1.3

(a) If \(\mathbf{x}(t)\) is misclassified, we either have \(y(t) = 1\) and \(\text{sign}(\mathbf{w}^T(t)\mathbf{x}(t)) < 0\), or \(y(t) = -1\) and \(\text{sign}(\mathbf{w}^T(t)\mathbf{x}(t)) > 0\). In either case, \(y(t)\mathbf{w}^T(t)\mathbf{x}(t) < 0\).

(b) Substituting the update rule for \(\mathbf{w}^T(t + 1)\), we have

where we use the fact that \(y^2(t)||\mathbf{x}(t)||^2\) is strictly positive.

(c) Since a correctly classified data point has \(y(t)\mathbf{w}^T(t)\mathbf{x}(t) > 0\), it makes sense to change our weight vector such that this quantity increases.

1.4

Check out my jupyter notebook here.

1.13

(a) \(h\) makes an error in approximating \(y\) when it

  • incorrectly approximates \(f(\mathbf{x})\) with probability \(\mu\), and \(y = f(\mathbf{x})\) with probability \(\lambda\).
  • correctly approximates \(f(\mathbf{x})\) with probability \(1 - \mu\), and \(y \neq f(\mathbf{x})\) with probability \(1 - \lambda\).

Therefore \(\mathbb{P}[h(\mathbf{x}) \neq y] = \mu\lambda + (1 - \mu)(1 - \lambda)\).

(b) Intuitively, a value of \(\lambda = 0.5\) would make learning completely infeasible, since predicting the value of \(y\) is essentially identical to guessing the result of a fair coin flip. To justify, we plug in \(\lambda = 0.5\) to our result from (a) to show

Problems

1.1

We denote \(A\) as the event where the first ball we choose is black, and \(B\) as the event where the second ball is black. The quantity we want to compute is \(\mathbb{P}[B | A]\), which we can do using Bayes’ Theorm:

Assuming there is an equal chance that we pick either bag, then \(\mathbb{P}[B \cap A] = 0.5\), since just one of the bags has two black balls. To calculate \(\mathbb{P}[A]\), we notice that if we pick the bag with two black balls (with probability \(0.5\)), then \(A\) occurs with probability 1; otherwise, we pick the second bag with probability \(0.5\), and then \(A\) occurs with probability \(0.5\) (assuming that we pick the balls uniformly at random). Therefore, our final calculation becomes

1.3

(a) Since \(\mathbf{w}^* \) separates the data, we know that for \(n = 1, \ldots, N\),

so it is immediately clear that \(y_n({\mathbf{w}^* }^Tx_n) > 0\) for \(n = 1, \ldots, N\). We conclude that \(\rho = \min_{1 \leq n \leq N}y_n({\mathbf{w}^* }^Tx_n) > 0\).

(b) We can prove the first statement directly. Using our update rule for \(\mathbf{w}(t)\), we have

where the inequality comes from the fact that \(\rho \leq y(t)(\mathbf{x}^T(t - 1)\mathbf{w}^* )\).

We prove the second statement by induction. For \(t = 1\), note that

since \(\mathbf{w}(0) = \mathbf{0}\). For the inductive step, we assume the inequality holds for \(t - 1\), which allows us to write

(c) We again use our update rule for \(\mathbf{w}(t)\) to write

where we use the hint and the fact that \(y^2 = 1\).

(d) For \(t = 1\), we have

Assuming the inequality holds for \(t - 1\), it follows that

(e) Combining and rearranging gives us

We can use the Cauchy-Schwarz inequality to show \(\frac{\mathbf{w}^T(t)\mathbf{w}^* }{||\mathbf{w}(t)||} \leq ||\mathbf{w}^* ||\), so we conclude that

1.8

(a) (Note: this inequality is known as Markov’s inequality).

We denote the density function of \(t\) as \(f\), which gives us

Since \(t \geq \alpha \implies \frac{t}{\alpha} \geq 1\), we have

Finally, since \(\frac{t}{\alpha} \geq 0\), it follows that

Altogether, we have shown

(b) We can prove this inequality by utilizing Markov’s inequality. Note that

so Markov’s inequality tells us

(c) We can again prove this inequality using Markov’s inequality, since linearity of expectation tells us

Remember that independent random variables are also uncorrelated, so taking the variance of \(u\) gives us

We then apply Markov’s inequality as we did in (b), and the inequality immediately follows.

1.9

(a) Note that because \(s > 0\) and the exponential function is monotonically increasing, we have

Applying Markov’s inequality to this probability gives us

(b) We apply a similar transformation as in part (a) and expand the definition of \(u\) to write

Applying Markov’s inequality gives us the result:

Note that we can push the expectation operator into the product because the \(u_n\) are iid.

(c) We use the definition of expectation to evaluate

To minimize \(e^{-s\alpha}U(s)\), we take the derivative with respect to \(s\), set it equal to 0, and solve:

We plug in this value of \(s\) to minimize \(e^{-s\alpha}U(s)\) and then simplify to make (d) easier:

(d) We let \(\alpha = \mathbb{E}(u) + \epsilon = \frac{1}{2} + \epsilon\), so that \(1 - \alpha = \frac{1}{2} - \epsilon\). Note this is valid since \(0 < \epsilon < \frac{1}{2} \implies 0 < \alpha < 1 \). Substituting this into our result from (c) and using the exponent-log trick gives us

where \(\beta\) is defined in the problem statement. Therefore from (b), since the inequality holds for any \(s\), we can conclude

By taking derivatives, one can show that the minimum value of \(\beta\) occurs at \(\epsilon = 0\), so we conclude \(\beta > 0\) for \(0 < \epsilon < \frac{1}{2}\), finishing the proof.

1.10

(a) To simplify, we assume that all the points in \(\mathcal{X}\) are unique, which implies

since \(h\) makes an error whenever \(\mathbf{x} = \mathbf{x}_k\) and \(k\) is even.

(b) If \(f\) generates \(\mathcal{D}\), then its output on the first \(N\) points are fixed. There are \(2^M\) possible outputs on the remaining \(M\) points, so we conclude that there are \(2^M\) such \(f\).

(c) There are \(\binom{M}{k}\) ways for \(f\) and \(h\) to disagree on \(k\) points, which is the same as saying there are \(\binom{M}{k}\) choices of \(f\) with \(E_{\text{off}}(h, f) = \frac{k}{M}\). To check our work, we can use a trick using the binomial theorem to show

(d) If every \(f\) is equally likely, then the definition of expectation says

To simplify even further, we can use the identity \(\sum_{k = 0}^M k\binom{M}{k} = M2^{M - 1}\) to show

(d) We note that the expression for the expected off-training-set error does not depend on the hypothesis \(h\), so we can conclude that no matter what hypotheses \(A_1\) and \(A_2\) choose, the expected error will be the same.

1.11

(a) This is a classic example of a single-variable function that we can minimize via taking the derivative:

so we conclude that the value that minimizes \(E_{\text{in}}(h)\) is the sample mean.

(b) We can also take the derivative for this form of \(E_{\text{in}}(h)\). Using the identity \(\frac{d}{dx}|x| = \frac{|x|}{x} = \text{sign}(x)\), we have

For this quantity to be 0, we must have as many values of \(h - y_n\) that are positive as there are negative. This means \(h\) should be the median of the \(y_n\) (or in the case that \(N\) is even, any value in between the two median values).

(c) It is clear that

i.e., this outlier causes \(h_{\text{mean}}\) to blow up. However, \(h_{\text{med}}\) at most will jump to the next highest value of \(y_n\) or not be affected at all, depending on where \(y_N\) falls in the order of the \(y_n\).