At webrepublic, I tried this morning to write a technical blogpost about an upcoming presentation that we are going to have, but we rejected it as too technical. I was planning for a second part on perceptron algorithm, but it will be postponed maybe for the weekend. I think there is already enough material on the Internet, especially videos, but not an easily accessible article (or series of articles) that can serve as a gentle introduction in order to motivate someone for this exciting field.
We assume that the reader knows some maths (end of high-school) or he likes to look up terms that he does not understand.
We often hear about Big Data and Machine Learning. On this blog post, we will skip the introductions and it is important and focus directly on some examples of Machine Learning and how it works. We begin we maximum likelihood estimators and focus on example using a Gaussian distribution. We limit ourselves to intuitive definitions of the terms that we use. More formal definitions can easily be found on the web.
Assume that we have some measurements about the height of several people. We know that each measurement follows a known distribution but we don’t know the exact values of the distribution parameters (mean value, variance in the case of normal distribution). We will call this scenario a parametric model, since we have already specified a family of functions that may have generated our data and they only differ in the values of their parameters. Our goal is to specify which one of these functions is the most suitable to describe our data. We call the function that we finally pick estimator. An estimator can provide a prediction about how future data from the same source may look.
Let us have another look at our data. We may see an array of n = 10 measurements similar to X = [157, 169, 172, 160, 170, 171, 167, 176, 180, 171]. We probably want to pick the estimator that maximizes the likelihood of X, that is, out of all possible parameter values, choose those that make X less surprising. Assume that we have worked hard and we have found an estimator that describes how one measurement was produced, as specified by our goal. What can we tell about the probability all these measurements? Well, not much yet. We have to know how they affect each other. To keep things simple, we will make the additional assumption that the measurements are independent. That means that if we know the distribution which they are drawn from, then each measurement does not give us more information than what we already know about the next measurement. Now, since the events are independent, we can find the probability of X by multiplying the probabilities of each measurement x_i. Then we will pick
the values of the parameters that maximize P(X).
We look up the definition of normal distribution:
where μ is the mean value and is the variance. It may look scary but it is actually pretty simple. We should not judge a formula by its length.
According to our previous discussion the joint probability p(X), that is, if we take n measurements the probability that X occurs, is:
We just wrote in a concise what that the probability of occurrence of X is given the parameters μ, σ. Our goal is to find the values of μ, σ that maximize p(X|μ, σ). We take the logarithm of p(X|μ, σ) and using the property that:
log(a · b) = log a + log b
we convert the product to a sum. Let’s see it in action.
The quantity log(p(X|μ, σ) is called the log-likelihood of the data. Maximizing the log-likelihood is equivalent to maximizing the likelihood of our data since the function log is strictly increasing.
In our setting, we have already specified the value of X and we try to find the values of μ, σ. In other words, the likelihood is a function of the parameters μ, σ and not of the data X. In the following section, we determine the values of μ, σ that maximize the log-likelihood. Assuming that everything else is fixed, we consider the likelihood as a function of μ. Under some conditions, a function has reached its maximum value when it stops increasing. A function stops increasing at a point if the slope of the tangent line to that point is 0. We can formalize this concept using the first derivative and finding the values of μ that make it 0. That holds for:
Following a similar procedure for σ, we can find that:
Applying these formulas to our data X, we get μ = 169.3 and σ = 6.45, values which are very close to the true values that generated this sequence.
We just described the main idea behind the maximum likelihood estimator.
Where to go from there: