Circle definitions23 Dec 2016
In algebraic geometry, that studies zeros of polynomials, a circle $C_R$ is the set of points $(x, y)$ that satisfy $x^2 + y^2 = R^2$. In differential geometry, that studies curves and surfaces, a circle is the image of the curve $\gamma : t \mapsto (R\cos(2\pi t), R\sin(2\pi t))$ on $[0, 1)$, i.e. $C_R = \gamma([0, 1))$. In machine learning, points of a circle would be viewed as samples from a 2D probability distribution concentrated around the set of points $C_R$. In optimization, a circle would be a set of minimizers of a function. Each representation is good for certain purposes.
This definition is not constructive. It tells you to try every point from $\bR^2$ and to accept only those for which the constraint is satisfied. A constraint is like a predicate—a function returning either $0$ or $1$—so, one can view this definition as a function on $\bR^2$ that returns $1$ on $C_R$ and $0$ otherwise. As you can imagine, it is rather discontinuous. This definition is good for discrimination (think of image classification, for example), because it is easy to check whether a given point belongs to a circle or not, or whether it is at least close to the circle.
This is a constructive definition. It allows you to enumerate (in some sense) all points that belong to the circle (although there are of course uncountable many of them). This is good for generation. Imagine you had a parametric model for pictures of cats; then you could generate new cat pictures just by calling $\gamma$ with a different parameter value.
Let’s distribute some Gaussian probability mass along the circle. You can imagine that the circle becomes thicker. We can take, for example,
on $r \geq 0, 0 \leq \phi < 2\pi$ and $R \gg \sigma$. The latter requirement is needed to ensure that the density is close to zero at the origin. Once we have a distribution, we can sample from it (at least in theory). That is the approach of many machine learning algorithms: they attempt to find a distribution of data and then sample from it to generate new data. It is not an easy problem at all, even for a circle. The distribution can also be used to discriminate using maximum likelihood.
Let’s define the circle of radius $R$ as the set of minimizers of the Mexican hat with $\sigma = R/2$. Convince yourself that so defined set of points is indeed $C_R$. You might be wondering how one is supposed to use such a description of a circle. Note that many powerful gradient descent algorithms have been developed over the years, and we can use them to generate points on a circle by running optimization from different initial conditions. We could also do a kind of simple discrimination by comparing cost function values.
There are many ideas reminiscent of parametric vs constraint definition. As already pointed out, generative and discriminative models are just the same concept but with a bit of uncertainty here and there. Actor-critic algorithms can also be viewed in this way: actor generates something using a parametric model and critic filters it through the constraint. Similar ideas have been mentioned by Misha Gromov under the name ergobrain and egomind. With a bit of a stretch, one can frame subconscious and conscious of Freud and systems 1 and 2 of Tversky and Kahneman as generators and discriminators from generative adversarial networks, which are again about actors and critics, parameters and constraints. Finally, there is a classical division of mathematics into algebra and geometry (very nicely described by Poincaré). So, this duality appears to be pretty deep indeed.