When Charles Darwin was deciding whether or not to propose to his cousin, Emma Wedgwood, he had a rather interesting approach. He drew up a list of pros and cons, as pros he listed things like children, companionship, and the charms of music and female chit-chat; and as cons he listed things like terrible loss of time, the burden of visiting relatives, and having less money to spend on b by a very sooks. The list went on for quite a while, and by a very small margin he decided to marry, signing off “Quad erat demonstrandum” – Latin for, it being proved necessary to marry. When it comes to thinking, we tend to assume that more is better, you will make a better decision the more pros and cons you list. You will just spend one more day thinking about that email before you send it, you have got to think up that perfect one-liner to ask your crush out.
However, algorithms in machine learning tell us that more information is not always better. In fact, placing too much value on unnecessary information can lead to wrong predictions and results. Take this example, Darwin may have found it useful; it is a study conducted on German couples. The data points in Figure 1 represent their life satisfaction during the first 10 years of marriage.
Figure 1: Life satisfaction during first 10 years of marriage
Let’s say one wants a computer program to make predictions about what their life satisfaction will be after 10 years. The simplest prediction we could make takes just one factor into account, that is, time. This is called a one-factor model, and would create a simple straight line on a chart, like in Figure 2. It captures the basic trend, but misses a lot of the data points, and if one follows it for long enough, one will see that couples become infinitely miserable the longer they stay married which does not sound quite right.
Figure 2: One-factor model
So, if we tried to capture a more complex, but slightly more accurate relationship between time and happiness, we could take two factors into account. For example, time and time squared. This two-factor prediction (Figure 3) passes through more data points, and the trend seems to align more with what psychologists say. That is, there is a slight comedown after the honeymoon bliss, but that life satisfaction more or less levels out over time.
Figure 3: Two-factor model
So, if using two factors is more accurate in comparison to using one, adding more information should lead to a more accurate prediction, right? Well, let’s try it with nine factors now (Figure 4). It does a good job of modelling the current data a, as it fits through every point, but it misses the overall trend. It shows that couples are deeply miserable right up til their wedding day, their marriage is then a series of severe ups and downs, and after 10 years, they drop into a sudden, deep depression.
Figure 4: Nine-factor model
In practice, obscuring data, adding random noise, or withholding information, often makes for better predictions. By placing too much emphasis on each individual data point, we lose sight of what is really important – the trend. This idea is called overfitting, and it tells us that sometimes the less we think, the better off we are. Overfitting and the wrong prediction are usually a result of placing too much emphasis on what we can measure, while forgetting what is actually important.
So, what can we do about overfitting? Well, in machine learning there is a technique called regularisation which can be thought of as a kind of complexity penalty. For instance, one method of regularisation is to scale down the weights of all the factors until most of them are at completely zero. This means that only the most important few have any say in the final decision. In Darwin’s situation, this might have looked something like crossing out the less important stuff like children and companionship, and keeping only the stuff that really mattered, like not being able to spend as much on books. Speaking of Darwin, right after he decided to marry by proof. He immediately started fretting about when to marry. He wrote yet another list of pros and cons, considering things like happiness, awkwardness, expenses, always wanting to have traveled in a hot air balloon (not sure whether that was a pro or a con), but by the end of the page he resolved to, “Never mind, trust to chance”. In other words, he had his own methods of regularising. He married Emma Wedgwood, and they lived happily ever after, until he died of congestive heart failure at 73.
References: