Running a solid correlation
This article isn’t about how to crunch the statistic—any stat book or Excel help can tell you how to do that. Instead, it’s my usual theme: understanding what you’re working with and making sure you’ve got something you can count on for your business decisions.
- Correlation coefficient
- Cause and effect
- Sample quality
- How important is it really?
- Alternate analysis
- Reporting results
Even before running the correlation, always chart the results. Your brain is the best analysis software made, and it will spot many oddities in graphical form which can be masked when the data are reduced to a single value. See also What's your executive dashboard hiding? and the Wikipedia article which shows varied scatterplots with the same coefficients.
A “perfect” correlation is 1 or -1, with 0 being no statistical relationship. On occasion, you’ll get nice strong correlations like these two:
A client whose survey came up with a weak correlation asked if I had a rule of thumb for reporting results—which is a question nearly guaranteed to produce a “It’s more than just the number” spiel from me. Whether your correlation generates a 0.96 or 0.56, it’s always worthwhile to review the reliability of your inputs before making major moves on a statistical output. And remember, your perspective on those inputs is often quite different at the end of a project than at the beginning.
With any multivariate analysis, the goal is to explain the greatest amount of cause and effect with the fewest number of variables—also known as parsimony, close cousin to Occam’s Razor. Occasionally this leads to oversimplification, but in the real world of humans and complex interactions, a more likely problem is overlooking or working around a hole in the interactions:
- Researchers often have blind spots about what’s truly causing an effect, and the entire study may be conducted without that key question
- Sometimes a factor is too sensitive to ask of respondents
- The true cause may be something you can ask of respondents, but not available for practical purposes such as media buying
In any of those cases, what you may end up with in your analysis, regardless of the strength of your correlation value, is a calculation along this indirect lower route:
This isn’t wrong so much as precarious—especially if you’re dealing with an indirect chain a couple Effects long or where the Faux Cause is fed by several root Causes. If you’re clear on what you’re correlating, and temper your business decisions with the lower reliability, you’re fine. But if you’re on the lower path and think you’re correlating against a root Cause, that’s where you can get into trouble. In that case, not only will you be counting on those results more than they may deliver, as long as you believe you know the Cause, you won’t be looking for more puzzle pieces and explanations.
As always, the more representative your sample is and the more replies you have, the more reliable your statistics are. See Real world sampling for a crib sheet.
Statistics are one of those “Just because we can, doesn’t mean we should” areas. Numbers have a facade of precision and reliability that make them attractive, and the endless variants—especially when you start digging at this level—can give new meaning to Analysis Paralysis, keeping your team from focusing on what’s really important to your business.
Before you worry too much about a correlation coefficient that comes up at a moderate 0.75 instead of a strong 0.90, step back and look at how much the application of that particular Cause/Effect relationship is going to impact your operations. Is it a bit of understanding or is it a key driver of a multi-million dollar product launch? The statistic is always a tool—your business need is always what drives the decision. Once you get the context back, you can decide whether the 0.75 is simply something to watch in future research while you move on to more important factors, or is critical enough to clarify now with an additional survey.
If on the scatterplot you think you see a pattern (versus a uniform blob), but the correlation coefficient isn’t coming up with much, here are two alternate statistics.
- Chi Square
Straightforward to calculate, and in consolidating the segments you may bring a picture of the relationships into focus at the statistical reliability you're hoping for. See Understanding Chi Square.
- Cluster analysis
Not straightforward to calculate, but a powerful tool if you are a statistician or have one at hand. If this is of interest, I’d start with the Wikipedia article Cluster analysis (in marketing) and then move on to your favorite cluster-supporting statistics software (SPSS, SAS, etc.) rather than attempting to read the assorted formula-laden pages Google will bring up.
As I told my client with the 0.56 correlation, I think the best research is more about relationships and drivers than tallies. If you have an intriguing but weak relationship, mentioning it as something to be explored and/or watched in the future is not only valid, it adds depth.
A few suggestions:
- If the weak/dubious results would distract from your more substantive findings or potentially cause confusion, I’d leave them out of the main report. You could still publish them as “further reading” for those who are curious.
- Remember that some of the report readers won’t have been in the meetings where you hashed out the dubious reliability of certain charts or explained about all the stats (or their bodies were present but attached to a Blackberry, or it’s been 9 months). Be sure you have captions or commentary explaining why certain figures/stats are less reliable than the majority of the results. In many cases, that forgetful reader will be yourself, and your overworked gray matter will be thankful for the extra notes!
- Make it as obvious as you can to someone skimming the document what’s in this “curiosity” category, such as with sidebars, boxed out figures, or a separate section at the end of the report.
Think I shall rename you—Ann ‘Lifesaver’ Ray.
Corporate Performance Resources