Moved by Metrics Data Analysis Outside the Box-ers: Outliers in the Real World

Outside the Box-ers: Outliers in the Real World

Real people don’t use the term “outliers.”   Instead they say things like:

“This number is too big.  It can’t be right.”

“If they only earned $300 last year, I don’t care about it.”

“Why are there so many zeros?”

“Since everybody else answered the survey with an 8 or a 9, how come this guy put in all 1’s and 2’s?”

But the statistician might shrug, or suggest a test of outliers.   An outlier is defined as ‘having different underlying behavior than the rest of the data’.   This is really useless because unless you are doing simulations, you don’t know the underlying behavior, i.e. the distribution, of any one data point.  (I hear the Bayesians.  Assuming a prior is still an assumption!)  We statisticians don’t like to declare outliers too quickly.  The ‘too big’ number could be a data entry error, a scale problem, or just a really big number.  The low income could be real, an error, or a confusion of household and individual income.   And sometimes you get zeros–often, ’No’, – or answers to surveys which seem questionable.    Statisticians care about outliers from the point of view of how they impact the analysis.

Knowing the structure of data and identifying problems early is important.   Finance people like to say the numbers tell the story, but the truth is that there is a story for the numbers.   Survey research often has a code book that defines the range of answers and formulas for derived variables.     For corporate data from data warehouses, there are often in-house experts that provide documentation, range checks and other guidance.    For commercial data, the vendor can provide details such as number of purchases, price and type of item.   For example FICO credit ratings are between 300 and 850.   In fact, a high credit rating might be an outlier from others in a group of loan applications with scores under 600 but might not be an error.  Sometimes we have knowledge about a process that can explain an apparent outlier.  For example, there is a worn machine part or broken sensor.

Steps in determining outliers  

Before thinking about what is on the edges, it is important to think of what is in the middle.   A good start in thinking about the middle is the median. Medians are common in the news for economic statistics-household income, price of homes, miles traveled on a holiday weekend—because medians are not influenced by the tails of the data like the classic average.     The median is the middle-ranked value for odd numbers and the average of the two middle numbers for evens.  So for 9 numbers the median is the 5th ranked number.  For 10 numbers it is the average of the 5th and 6th– (5th + 6th)/2.   Then there are the extremes-the minimum and the maximum.  The median of the minimum and the median is the 1st Fourth.  The median of the median and the maximum is the 3rd Fourth. The difference between the two is the interquartile range (IQR).

 

[av_one_full first av_uid=’av-1d2rse’]

[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/Interquartile-Range.jpg’ attachment=’512′ align=’center’ animation=’no-animation’ link=” target=” av_uid=’av-18569q’]

[/av_one_full]

 

Just from these 5 numbers, the range (maximum – minimum), and the interquartile range we can gain a lot of understanding of data.   John Tukey, the great statistician, came up with the box and whisker plot, now usually called the boxplot, to visually express this idea. The boxplot is a convenient way to display the idea of interquartile range.    For all boxplots, the outer sides of the box are the 1st and 3rd Fourth with the line in the middle at median.  Often average, i.e. the mean, is placed within the box as shown below.

[av_one_full first av_uid=’av-10vx6e’]

[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/Boxplot.jpg’ attachment=’513′ align=’center’ animation=’no-animation’ link=” target=” av_uid=’av-z8912′]

[/av_one_full]

How does this information help you to determine true outliers?

Tukey gave the following rule for outliers- multiply the interquartile range by 1.5.   Anything more than this number plus the 3rd Fourth or minus the 1st Fourth is an outlier.    For example, if the 3rd Fourth is 650 and the interquartile range is 50, then any value over 725, 650 +75, can be treated as an outlier.  The blue marks at the end show outliers.

The confusing part about boxplots is the whiskers.  The whiskers extend out from the box to include all data that is not an outlier.   So if there are no outliers, the whisker ends at the minimum and maximum.  In the diagram above the minimum is the end of the whisker.  The maximum is the 2nd outlier.   The problem is that varieties of boxplots have emerged.   Some variations don’t use the outlier rule and always have whiskers go to the extremes.  Others based the length of the whisker on a spread measurement such as the standard deviation or 1.5 IQR.   It is important to know the defaults are used.   And of course, they can go horizontal or vertical.

Example

Here is an example using loan data we used in the first blog post (link to post here) For the length of home data, the boxplot is as follows for 245 applicants:

 

[av_one_full first av_uid=’av-qavl2′]

[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/Boxplot-of-Length-of-Residence.jpg’ attachment=’514′ align=’left’ animation=’no-animation’ link=” target=” av_uid=’av-kii1i’]

[/av_one_full]

The summary statistics for this variable are

Min. 1st Qu.  Median    Mean  3rd Qu.    Max.

0.0    4.0          9.0          9.841  14.0           35.0

The numbers were given as integers so we will assume that 0 means less than 1 year in a home.  We identify 3 points as outliers with the 1.5IQR rule-30, 33, and 35.    Even with a median of 9 and IQR of 10, we don’t worry about any numbers 29 or less.  While someone in their home for 30 years might not yet be retired, we can safely say that they are somewhat older and more established in the community than the center of the applicant pool.

This approach for home equity is somewhat problematic.  Default scientific notation aside, we have 15 points, 6% of the data, that are declared outliers.   Their equity exceeds $286,000 going up to $870,200.   Also the boxplot doesn’t show the 30 applicants with 0, perhaps negative, equity since 0 is the minimum.   Someone with $40,000 in home equity is in a different position than someone with 0, but this approach lumps them all in the 1st Quarter.

[av_one_full first av_uid=’av-eh2vq’]

[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/11/Boxplot-of-Equity-in-Home-of-Loan-Applicants.jpg’ attachment=’519′ align=’left’ animation=’no-animation’ link=” target=” av_uid=’av-a87sm’]

[/av_one_full]

Boxplots are a great way to get a good snapshot of the spread of data and how potential outliers fall.    They are great for continuous data with many different values.  And it is easy to compare them to each other.   Once a person gets used to them, they become an invaluable tool.   However, I find they require frequent use or people forget how they work.  Also, the box is not giving any additional information other than guiding the eye.  There have been attempts to add some shape to the box to give a sense of the distribution within the middle.

The 1.5 IQR rule is a good place to start for outlier detection.  However, it is best used with data that is nearly symmetric.    By using the middle half of the data, we ignore the information available from the other half of the data.   If the data is very compact, many larger points will be declared outliers.    Sometimes it is good to focus on the extreme outliers, defined as 3 IQR.    There are newer methods that are better for finding outliers in noisy or skewed data.

With data, it is probably best to first screen for a ‘broken number’ than to declare it an outlier.  Just because a data point seems unappealing or suspect, that doesn’t make it an outlier.   Think about whether there was an error in measurement, data entry or it comes from a different place than the others.   If the data is then confirmed as valid, then an outlier analysis is valuable.

 


[i] While Tukey used the expression ‘fourths’, I hear ‘quartile’ more often.   Is this the same as the 25th and 75th percentile?    There are at least 9 methods for calculating percentiles, the percentage of data below the value of interest.  For example, 90 percent of children are shorter than the child at the 90th percentile for height at a given age.     The 25th and 75th percentile are usually about the same as the 1st and 3rd fourth.

Related Post