Today, let’s dust off the *regression analysis* material of CFA Level 1 and see if we can use it in a real world example. We’re going to build a simple **quantitative model** that forecasts future stock price movements!

We’ll construct our model using single variable linear regression. You’ll recognize the equation from high school algebra:

**y = mx + b**

It’s just the equation for a straight line. We take some predictor variable (x), multiply it by some constant (m) and add an offset (b) to make our prediction (about y).

I racked my brains to figure out both what stock we might predict, and what other variable we might use to predict it. I racked for almost **a full 2 minutes** before I remembered this…

There’s a theory out there that says not all market sectors perform well at the same time. For example, the plot suggests what’s happening in Industrials today might give us information about what will happen in Energy in the near future.

That was enough racking for me. I’ve decided **we’ll see if last month’s return from an Industrials ETF can be used to forecast this month’s return of an Energy ETF**. This baby better work.

Yahoo Finance has historical ETF price information for the two ready to download in CSV format. I’m going to use the “Adjusted Close” column which includes dividends just because I’m paranoid about some ex-dividend price gap movement throwing a wrench in the works.

The first step is just to merge the two ETFs’ CSV files from Yahoo into one spreadsheet. Then I convert their monthly *price* columns into monthly *returns* like so…

I used a 5 year history: from April 2008 to April 2013.

And now, we **do not** compute the slope and intercept by hand, **nor** do we use the spreadsheet SLOPE and INTERCEPT functions. No my friends, we stand on the shoulders of giants and use the LINEST function. You’ll understand why soon.

LINEST takes 4 arguments:

- arg1: our Y data

arg2: our X data

arg3: set to TRUE to use y=mx + b, otherwise it uses y=mx

arg4: set to TRUE to get a boatload of stats returned about our model

So type that into a single cell, press return…

… and get multiple cells of output!

Let’s add labels to the left and right so that we know what LINEST gave us:

We got our **m**. We got our **b**. So at the end of June for example, the model predicts:

We got the standard error (**SE**) of our coefficients and model. If you crunch the numbers I think you’ll find the **F**-statistic shows we’re in “statistically significant” territory. Now there’s a secret phrase I learned in the South, to use when a Southerner invites you to an event but you’d prefer not to go.

*“Well ya know ah’d looooove to!*

…but I just cain’t.”

…but I just cain’t.”

Well I’d love to perform a Breusch-Pagan test for conditional heteroskedasticity for you too right here. But I just can’t.

Assuming no red flags there, let’s zero in on the value returned in cell B68. Our model’s R^{2} is 10.3%. In plain English, this means:

Look, the monthly returns of the Energy ETF are obviously all over the map, but our model explains 10% of the variation in those monthly returns.

Now 10% sounds pathetically inaccurate. It is! Our model is **highly inaccurate** by itself. But the word on the street is that this may indeed be an interesting lead for someone working in quantitative trading – and in fact even a model with an R^{2} as low as 5% won’t necessarily be thrown in the trash can. Why?

I’ve never worked at a quant hedge fund, but after doing a few Kaggle competitions and learning to love the random forest machine-learning algorithm in particular, I suspect that it’s due to the use of *ensemble methods*.

**It’s counterintuitive, but true, that you often get a more accurate prediction by averaging the outputs of multiple lousy predictors that have each only seen parts of the historical data rather than using one master 1000-variable model that has been fine-tuned with Swiss precision over all the data.**

The latter simply suffers too much from overfitting. But our simple model, taking its place among a **band of brothers** (lousy brothers too, but from diverse viewpoints) could earn its keep in helping predict the coming monthly return of XLE. Or at least its direction.