What is a red flag for bad statistical data?
There are a few ways to think about the question “What is a red flag on bad statistical data?”. One way to restate it is:
“How do I locate bad data points in a dataset that is generally reliable”? In other words, we want to know what the “red flags” are to spot incorrect data. Here are a few steps that I use (note that they aren’t necessarily definitive because life (and data) is complicated):
- Look for data that is out of range, or extremely unlikely, in a real-world sense. Suppose you are doing an analysis on university undergraduate students. You would expect most students to be in the standard age range of roughly 17 to 30. You may find people that are significantly out of that range (e.g. a 16 year old or a 60 year old), but you wouldn’t expect very many. These cases might have to be checked. If you find a student who is 3 years old or 103 years old, you know something is wrong. It could be a data entry error, or it could be that your data (say from a big administrative database) is defaulting to some odd value, which has created an impossible age. So, you want to find and correct those, or at least flag them in some way.
- Look for data that is logically impossible. For example, you might find cases who are listed as male, yet are included in a study on survival rates for cervical cancer. Now, this may be an interpretation issue (e.g. the variable is not clear about whether it is meant to refer “gender identity” vs “sex assigned at birth”), but it might just be incorrect data. So, those should be looked into, if possible. That said, you have to be careful about making assumptions - for example men can get breast cancer.
- Look for data that is contradictory. In a survey that allows multiple responses, you might have someone claim that they are both “atheist” and “Christian”. That might a reasonable response (e.g. they were raised as a Christian but are now consider themselves to be an atheist), but it might indicate a problem in your dataset if it shows up a lot. Similarly, some people might tick every religion on a long list (or every race or nationality). Are the spoofing the survey, or is that how they really feel? (i.e. is it bad data or valid data?).
- Look for outliers that seem unreasonable. For example, if a histogram shows a single point that is some large number of standard deviations out, it might well be bad data.
- Once you have done your analysis (let’s say a regression analysis), you should explore the analysis for points that have high influence on the data. There are diagnostics for this (dfits or dfbetas). If you find a point that is hugely influential, it might be indicate error in the data, or it could be valid but be throwing out your analysis, as a very atypical real-world case. You might exclude that data (depends on the purpose of the analysis and other things, it’s a judgement call).
- You might also want to look for fraudulent data. Benford’s law is a good example of a way to look for that.
This list isn’t exhaustive. Your problem might be greater than just some bad points in an otherwise good dataset. The dataset itself might be questionable. For example:
- You should check with subject matter experts, to see how they feel about the validity of your data (e.g. they might inform you that the source of your data is politically biased or untrustworthy for some other reason in their opinion).
- You should check data processing steps (maybe with an IT expert) to ensure that a programming bug (e.g. a bad SQL statement, like a join that misfired) hasn’t created bad/weird data.
- Then, of course, there is face validity. If your analysis comes up with a truly counter-intuitive result, you might have discovered something really outstanding. Or, you might have discovered a bad dataset. You would want to double-check and triple-check your data and your analysis in these cases.
And, here’s a travel story or two.
A Drive Across Newfoundland
Newfoundland, Canada’s most easterly province, is a region that is both fascinating in its unique culture and amazing in its vistas of stark beauty. The weather is often wild, with coastal regions known for steep cliffs and crashing waves (though tranquil beaches exist too). The inland areas are primarily Precambrian shield, dominated by forests, rivers, rock formations, and abundant wildlife. The province also features some of the Earth’s most remarkable geology, notably The Tablelands, where the mantle rocks of the Earth’s interior have been exposed at the surface, permitting one to explore an almost alien landscape, an opportunity available on only a few scattered regions of the planet.
The city of St. John’s is one of Canada’s most unique urban
areas, with a population that maintains many old traditions and cultural
aspects of the British Isles. That’s true of the rest of the province, as well,
where the people are friendly and inclined to chat amiably with visitors. Plus,
they talk with amusing accents and party hard, so what’s not to like?
This account focusses on a two-week road trip in October 2007, from St. John’s in the southeast, to L’Anse aux Meadows in the far northwest, the only known Viking settlement in North America. It also features a day hike visit to The Tablelands, a remarkable and majestic geological feature. Even those who don’t normally consider themselves very interested in geology will find themselves awe-struck by these other-worldly landscapes.
On the Road with Bronco BillySit back and go on a ten day trucking trip in a big rig, through western North America, from Alberta to Texas, and back again. Explore the countryside, learn some trucking lingo, and observe the shifting cultural norms across this great continent. Then, some time later, try it out for yourself.
Amazon U.S.: http://www.amazon.com/gp/product/B00X2IRHSK
Amazon U.K.: http://www.amazon.co.uk/gp/product/B00X2IRHSK
Amazon Germany: http://www.amazon.de/gp/product/B00X2IRHSK
Amazon Canada: http://www.amazon.ca/gp/product/B00X2IRHSK