Gaussian conditioning vs linear regression

Given a dataset $\mathcal{D} = \{ (x_i, y_i) \}_{i=1}^{N}$ of pairs of inputs $x_i \in \mathbb{R}^n$ and outputs $y_i \in \mathbb{R}^m$, one may wish to be able to predict the output $y^*$ given an input $x^*$.

Linear regression

One straightforward way to approach this problem is to postulate a linear dependency between $x$ and $y$, for example, $y = Ax$, then estimate the parameters $A \in \mathbb{R}^{m \times n}$ from the dataset $\mathcal{D}$ by linear regression,

\begin{equation} \label{lin_reg} \hat{A} = YX^T(XX^T)^{-1}, \end{equation}

where matrices $X \in \mathbb{R}^{n \times N}$ and $Y \in \mathbb{R}^{m \times N}$ hold the data, and finally, when a new input $x^*$ arrives, compute the output $y^* = \hat{A}x^*$ by multiplying the input $x^*$ with the estimate $\hat{A}$ of the assumed linear mapping $A \colon x \mapsto y$.

Gaussian conditioning

Another approach is to postulate that inputs and outputs are jointly Gaussian,

and take $y^*$ to be the mean of the conditional distribution $p(y|x)$. If both $x$ and $y$ are zero-mean, then the conditional mean is given by

\begin{equation} \label{cond_mean} \mu_{y|x} = \Sigma_{yx} \Sigma_{xx}^{-1} x. \end{equation}

Substitution of the estimates of the covariances

into Formula \eqref{cond_mean} for the conditional mean reveals a deep relation between Gaussian conditioning \eqref{cond_mean} and linear regression \eqref{lin_reg}.

Conditioning $\approx$ regression

Namely, Gaussian conditioning performs linear regression under the hood,

As a consequence, the output $y^*$ of the linear regression is the maximum likelihood estimate of the conditional mean $\mu_{y|x^*}$, provided inputs and outputs are assumed to be jointly Gaussian. What it means in practice is that if you are only interested in point prediction and do not care about quantifying uncertainty, then linear regression is equivalent to Gaussian conditioning.

Linear basis function models

The relation between Gaussian conditioning and linear regression straightforwardly extends to linear basis function models of the form $y = A\phi(x)$ where $\phi \colon \mathbb{R}^n \to \mathbb{R}^k$ is a (nonlinear) feature mapping. Indeed, replacing the old dataset $\mathcal{D}$ with a new dataset $\mathcal{D}_{\phi} = \{ (\phi_i, y_i) \}_{i=1}^{N}$ where $\phi_i = \phi(x_i)$, we see that all the same steps can be performed as before if $x_i$ is replaced by $\phi_i$. The underlying assumption, however, slightly changes in this case: the feature vectors $\phi$—and not the inputs $x$ themselves—are assumed to be jointly Gaussian with the outputs $y$.