Names change, but ideas usually don’t. How is today’s ‘data science’ different from yesterday’s statistics, mathematics and probability?

Actually, it’s not very different. If it seems changed it’s only because the ground reality has changed. Yesterday we had data scarcity, today we have a data glut (“big data”). Yesterday we had our models, and were seeking data to validate them. Today we have data, and seek models to explain what this data is telling.

Can we find associations in our data? If there’s association, can we identify a pattern? If there are multiple patterns, can we identify which are the most likely? If we can identify the most likely pattern, can we abstract it to a universal reality? That’s essentially the data science game today.

**Correlation**

Have we wondered why the staple food in most of India is dal-chaval or dal-roti? Why does almost everyone eat the two together? Why not just dal followed by just chaval?

The most likely reason is that the nutritive benefit when eaten *together* is more than the benefit when eaten *separately*. Or think of why doctors prescribe combination drug therapies, or think back to the film Abhimaan (1973) in which Amitabh Bachchan and Jaya Bhaduri discovered that singing together created harmony, while singing separately created discord. Being together can offer a greater benefit than being apart.

Of course, togetherness could also harm more. Attempting a combination of two business strategies could hurt more than using any individual strategy. Or partnering Inzamam ul Haq on the cricket field could restrict two runs to a single, or, even more likely, result in a run out!

In data science, we use the correlation coefficient to measure the degree of linear association or togetherness. A correlation coefficient of +1 indicates the best possible positive association; while a value of -1 corresponds to the negative extreme. In general, a high positive or negative value is an indicator of greater association.

The availability of big data now allows us to use the correlation coefficient to more easily confirm *suspected* associations, or discover *hidden* associations. Typically, the data set is a spreadsheet, e.g., supermarket data with customers as rows, and every merchandise sold as a column. With today’s number crunching capability, it is possible to compute the correlation coefficient between every pair of columns in the spreadsheet. So, while we can compute the correlation coefficient to confirm that beer cans and paper napkins are positively correlated (could be a dinner party), we could also unearth a hidden correlation between beer cans and baby diapers.

Why would beer cans and baby diapers be correlated? Perhaps there’s no valid reason, perhaps there’s some unknown common factor that we don’t know about (this has triggered off the ‘correlation-is-not-causation’ discussion). But today’s supermarket owner is unlikely to ponder over such imponderables; he’ll just direct his staff to place baby diapers next to beer cans and hope that it leads to better sales!

**Regression**

If two variables *X* and *Y* have a high correlation coefficient, it means that there is a strong degree of linear dependence between them. This opens up an interesting possibility: *why not use the value of *X* to predict the likely value of *Y*? *The prospect becomes even more enticing when it is easy to obtain *X*, but very hard (or expensive) to obtain* Y*.

To illustrate, let us consider the height (*X*) and weight (*Y*) data of 150 male students in a class. The correlation coefficient between *X* and *Y* is found to be 0.88. Suppose a new student joins. We can measure his height with a tape, but we don’t have a weighing scale to obtain his weight. Is it possible to predict his weight?

Let us first plot this data on a scatter diagram (see below); every blue dot on the plot corresponds to the height-weight of one student. The plot looks like a dense maze of blue dots. Is there some ‘togetherness’ between the dots? There __is__ (remember the correlation is 0.88?), but it isn’t *complete* togetherness (because, then, all the dots would’ve aligned on a single line).

To predict the new student’s weight, our best bet is to draw a straight line cutting right through the middle of the maze. Once we have this line, we can use it to read off the weight of the new student on the Y-axis, corresponding to his measured height plotted on the X-axis.

How should we draw this line? The picture offers two alternatives: the blue line and the orange line. Which of the two is better? The one that is ‘middler’ through the maze is better. Let us drop down (or send up) a ‘blue perpendicular’ from every dot on to the blue line, and, likewise, an ‘orange perpendicular’ from every dot on to the orange line (note that if the dot is on the line, the corresponding perpendicular has zero length). Now sum the lengths of all the blue and orange perpendiculars. The line with a *smaller* sum is the better line!

X: height; Y: weight

Notice that the blue and orange lines vary only in terms of their ‘slope’ and ‘shift’, and there can be an infinity of such lines. The line with the lowest sum of the corresponding perpendiculars will be the ‘best’ possible line. We call this the *regression line* to predict *Y* using *X*; and it will look like:

*a*_{1} *X* + *a*_{2}, with *a*_{1} and *a*_{2} being the slope and shift values of this best line. This is the underlying idea in the famed least-square method.

**Bivariate to multivariate**

Let us see how we can apply the same idea to the (harder) problem of predicting the likely marks (*Y*) that a student might get in his final exam. The numbers of hours studied (*X*_{1}) seems to be a reasonable predictor. But if we compute the correlation coefficient between *Y* and *X*_{1}, using sample data, we’ll probably find that it is just about 0.5. That’s not enough, so we might want to consider another predictor variable. How about the intelligence quotient (IQ) of the student (*X*_{2})? If we check, we might find that the correlation between *Y* and *X*_{2} too is about 0.5.

Why not, then, consider *both* these predictors? Instead of looking at just the simple correlation between *Y* and *X*, why not look at the multiple correlation between *Y* and both *X*_{1} and *X*_{2}? If we calculate this multiple correlation, we’ll find that it is about 0.8.

And, now that we are at it, why not also add two more predictors: Quality of the teaching (*X*_{3})_{, }and the student’s emotional quotient (*X*_{4})? If we go through the exercise, we’ll find that the multiple correlation keeps increasing as we keep adding more and more predictors.

However, there’s a price to pay for this greed. If three predictor variables yield a multiple correlation of 0.92, and the next predictor variable makes it 0.93, is it really worth it? Remember too that with every new variable we also increase the computational complexity and errors.

And there’s another – even more troubling – question. Some of the predictor variables could be strongly correlated *among themselves* (this is the problem of multicollinearity). Then the extra variables might actually bring in more noise than value!

How, then, do we decide what’s the optimal number of predictor variables? We use an elegant construct called the *adjusted multiple correlation. *As we keep adding more and more predictor variables to the pot (we add the most correlated predictor first, then the second most correlated predictor and so on …), we reach a point where the addition of the next predictor *diminishes the adjusted multiple correlation* even though the multiple correlation itself keeps rising. That’s the point to stop!

Let us suppose that this approach determines that the optimal number of predictors is 3. Then the multiple regression line to predict *Y* will look like *a*_{1} *X*_{1} + *a*_{2} *X*_{2} + *a*_{3} *X*_{3} + *a*_{4.} where *a*_{1},_{ }*a*_{2},* a*_{3},* a*_{4 }are the coefficients based on the least-square criterion._{ }

_{ }Predictions using multiple regression are getting more and more reliable because there’s so much more data these days to validate. There is this (possibly apocryphal) story of a father suing a supermarket because his teenage daughter was being bombarded with mailers to buy new baby kits. “My daughter isn’t pregnant”, the father kept explaining. “Our multiple regression model indicates a very high probability that she is”, the supermarket insisted. And she was …

As we dive deeper into multivariate statistics we’ll find that this is the real playing arena for data science; indeed, when I look at the contents of a machine learning course today, I can’t help feeling that it is multivariate statistics re-emerging with a new disguise. As the French writer Jean-Baptiste Alphonse Karr remarked long ago: *plus ça change, plus c’est la même chose!*