how to find outliers using standard deviation

How to Remove Outliers for Machine Learning

Final Updated on August 18, 2020

When modeling, it is important to make clean the data sample to ensure that the observations best represent the problem.

Sometimes a dataset can comprise farthermost values that are exterior the range of what is expected and unlike the other information. These are called outliers and often machine learning modeling and model skill in general tin can be improved by agreement and even removing these outlier values.

In this tutorial, y'all will discover outliers and how to identify and remove them from your machine learning dataset.

Later completing this tutorial, you will know:

That an outlier is an unlikely observation in a dataset and may have one of many causes.
How to utilise elementary univariate statistics similar standard deviation and interquartile range to identify and remove outliers from a information sample.
How to use an outlier detection model to place and remove rows from a training dataset in order to lift predictive modeling performance.

Kick-starting time your project with my new book Data Preparation for Motorcar Learning, including stride-by-step tutorials and the Python source code files for all examples.

Let'due south get started.

Update May/2018: Stock-still bug when filtering samples via outlier limits.
Update May/2020: Updated to demonstrate on a existent dataset.

How to Apply Statistics to Identify Outliers in Data
Photo by Jeff Richardson, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

What are Outliers?
Test Dataset
Standard Deviation Method
Interquartile Range Method
Automatic Outlier Detection

What are Outliers?

An outlier is an ascertainment that is unlike the other observations.

It is rare, or distinct, or does non fit in some way.

We will generally define outliers every bit samples that are uncommonly far from the mainstream of the information.

— Page 33, Applied Predictive Modeling, 2013.

Outliers can have many causes, such equally:

Measurement or input error.
Data corruption.
True outlier ascertainment (due east.g. Michael Jordan in basketball).

There is no precise way to define and identify outliers in full general because of the specifics of each dataset. Instead, y'all, or a domain skilful, must interpret the raw observations and decide whether a value is an outlier or non.

Even with a thorough understanding of the data, outliers can be hard to define. […] Peachy care should exist taken not to hastily remove or change values, peculiarly if the sample size is small.

— Page 33, Applied Predictive Modeling, 2013.

Still, we tin can utilise statistical methods to identify observations that announced to be rare or unlikely given the available data.

Identifying outliers and bad data in your dataset is probably one of the virtually hard parts of data cleanup, and it takes time to get right. Even if you have a deep understanding of statistics and how outliers might affect your data, it'south e'er a topic to explore cautiously.

— Page 167, Data Wrangling with Python, 2016.

This does non mean that the values identified are outliers and should be removed. But, the tools described in this tutorial can be helpful in shedding light on rare events that may require a second look.

A good tip is to consider plotting the identified outlier values, peradventure in the context of non-outlier values to encounter if there are whatever systematic human relationship or pattern to the outliers. If there is, perhaps they are not outliers and can be explained, or perhaps the outliers themselves can be identified more systematically.

Want to Go Started With Data Preparation?

Accept my free vii-solar day email crash course at present (with sample code).

Click to sign-up and also get a complimentary PDF Ebook version of the grade.

Exam Dataset

Before we await at outlier identification methods, let's define a dataset we can apply to test the methods.

Nosotros will generate a population 10,000 random numbers drawn from a Gaussian distribution with a mean of fifty and a standard deviation of 5.

Numbers drawn from a Gaussian distribution will have outliers. That is, by virtue of the distribution itself, there will be a few values that will exist a long way from the mean, rare values that we can identify as outliers.

We volition use the randn() function to generate random Gaussian values with a mean of 0 and a standard divergence of 1, then multiply the results by our ain standard deviation and add the mean to shift the values into the preferred range.

The pseudorandom number generator is seeded to ensure that we get the aforementioned sample of numbers each time the code is run.

# generate gaussian data

from numpy . random import seed

from numpy . random import randn

from numpy import mean

from numpy import std

# seed the random number generator

seed ( i )

# generate univariate observations

information = 5 * randn ( 10000 ) + l

# summarize

print ( 'hateful=%.3f stdv=%.3f' % ( mean ( data ) , std ( data ) ) )

Running the instance generates the sample and then prints the mean and standard deviation. Every bit expected, the values are very close to the expected values.

Standard Deviation Method

If we know that the distribution of values in the sample is Gaussian or Gaussian-like, we can employ the standard deviation of the sample every bit a cut-off for identifying outliers.

The Gaussian distribution has the property that the standard deviation from the hateful tin exist used to reliably summarize the pct of values in the sample.

For example, inside i standard divergence of the mean will embrace 68% of the data.

Then, if the mean is 50 and the standard divergence is 5, every bit in the examination dataset above, then all information in the sample between 45 and 55 will account for about 68% of the data sample. We can cover more than of the data sample if nosotros expand the range as follows:

1 Standard Departure from the Mean: 68%
2 Standard Deviations from the Mean: 95%
three Standard Deviations from the Mean: 99.7%

A value that falls outside of 3 standard deviations is part of the distribution, but it is an unlikely or rare event at approximately one in 370 samples.

Three standard deviations from the mean is a common cut-off in practice for identifying outliers in a Gaussian or Gaussian-like distribution. For smaller samples of information, perhaps a value of ii standard deviations (95%) can be used, and for larger samples, perhaps a value of 4 standard deviations (99.9%) can be used.

Given mu and sigma, a elementary mode to identify outliers is to compute a z-score for every xi, which is defined as the number of standard deviations away xi is from the mean […] Data values that have a z-score sigma greater than a threshold, for example, of three, are declared to exist outliers.

— Page nineteen, Data Cleaning, 2019.

Let'south make this concrete with a worked case.

Sometimes, the data is standardized first (e.thousand. to a Z-score with zero mean and unit variance) so that the outlier detection can be performed using standard Z-score cut-off values. This is a convenience and is not required in general, and nosotros will perform the calculations in the original scale of the data here to make things clear.

We tin can calculate the mean and standard deviation of a given sample, and then summate the cut-off for identifying outliers equally more 3 standard deviations from the mean.

. . .

# calculate summary statistics

data_mean , data_std = hateful ( information ) , std ( data )

# identify outliers

cut_off = data_std * three

lower , upper = data_mean - cut_off , data_mean + cut_off

We can and so place outliers as those examples that autumn outside of the defined lower and upper limits.

. . .

# identify outliers

outliers = [ x for x in information if ten < lower or x > upper ]

Alternately, we can filter out those values from the sample that are not within the defined limits.

. . .

# remove outliers

outliers_removed = [ ten for x in data if x > lower and x < upper ]

Nosotros can put this all together with our sample dataset prepared in the previous section.

The complete instance is listed below.

five

fourteen

# identify outliers with standard deviation

from numpy . random import seed

from numpy . random import randn

from numpy import hateful

from numpy import std

# seed the random number generator

seed ( i )

# generate univariate observations

data = five * randn ( 10000 ) + 50

# calculate summary statistics

data_mean , data_std = mean ( information ) , std ( data )

# identify outliers

cut_off = data_std * iii

lower , upper = data_mean - cut_off , data_mean + cutting _off

# place outliers

outliers = [ x for x in data if 10 < lower or x > upper ]

impress ( 'Identified outliers: %d' % len ( outliers ) )

# remove outliers

outliers_removed = [ x for x in information if x >= lower and x <= upper ]

print ( 'Non-outlier observations: %d' % len ( outliers_removed ) )

Running the example will starting time print the number of identified outliers and and so the number of observations that are non outliers, demonstrating how to identify and filter out outliers respectively.

Identified outliers: 29

Non-outlier observations: 9971

Then far we have only talked about univariate data with a Gaussian distribution, east.yard. a unmarried variable. You can utilise the same approach if you accept multivariate information, e.1000. data with multiple variables, each with a different Gaussian distribution.

You can imagine bounds in 2 dimensions that would ascertain an ellipse if you have ii variables. Observations that fall outside of the ellipse would exist considered outliers. In iii dimensions, this would exist an ellipsoid, and so on into higher dimensions.

Alternately, if yous knew more near the domain, possibly an outlier may be identified by exceeding the limits on one or a subset of the information dimensions.

Interquartile Range Method

Not all information is normal or normal plenty to treat it equally existence drawn from a Gaussian distribution.

A good statistic for summarizing a non-Gaussian distribution sample of data is the Interquartile Range, or IQR for curt.

The IQR is calculated as the difference between the 75th and the 25th percentiles of the information and defines the box in a box and whisker plot.

Recollect that percentiles can exist calculated by sorting the observations and selecting values at specific indices. The 50th percentile is the center value, or the average of the two eye values for an even number of examples. If we had 10,000 samples, then the 50th percentile would exist the boilerplate of the 5000th and 5001st values.

We refer to the percentiles as quartiles ("quart" meaning 4) because the information is divided into four groups via the 25th, 50th and 75th values.

The IQR defines the centre 50% of the data, or the torso of the data.

Statistics-based outlier detection techniques assume that the normal data points would announced in high probability regions of a stochastic model, while outliers would occur in the low probability regions of a stochastic model.

— Page 12, Data Cleaning, 2019.

The IQR can exist used to place outliers by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. The common value for the gene k is the value 1.five. A cistron k of 3 or more can be used to place values that are extreme outliers or "far outs" when described in the context of box and whisker plots.

On a box and whisker plot, these limits are drawn as fences on the whiskers (or the lines) that are drawn from the box. Values that autumn outside of these values are fatigued every bit dots.

We can summate the percentiles of a dataset using the percentile() NumPy function that takes the dataset and specification of the desired percentile. The IQR can then be calculated every bit the difference between the 75th and 25th percentiles.

. . .

# summate interquartile range

q25 , q75 = percentile ( data , 25 ) , percentile ( data , 75 )

iqr = q75 - q25

We can then calculate the cutoff for outliers every bit 1.5 times the IQR and subtract this cut-off from the 25th percentile and add it to the 75th percentile to give the actual limits on the information.

. . .

# calculate the outlier cutoff

cut_off = iqr * ane.five

lower , upper = q25 - cut_off , q75 + cut_off

Nosotros tin then utilise these limits to place the outlier values.

. . .

# place outliers

outliers = [ ten for x in data if x < lower or ten > upper ]

We tin also apply the limits to filter out the outliers from the dataset.

. . .

# remove outliers

outliers_removed = [ x for 10 in data if 10 > lower and x < upper ]

Nosotros can tie all of this together and demonstrate the procedure on the examination dataset.

The complete example is listed below.

iii

xiv

# place outliers with interquartile range

from numpy . random import seed

from numpy . random import randn

from numpy import percentile

# seed the random number generator

seed ( 1 )

# generate univariate observations

data = 5 * randn ( 10000 ) + 50

# calculate interquartile range

q25 , q75 = percentile ( data , 25 ) , percentile ( data , 75 )

iqr = q75 - q25

impress ( 'Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % ( q25 , q75 , iqr ) )

# summate the outlier cutoff

cut_off = iqr * one.v

lower , upper = q25 - cut_off , q75 + cut _off

# identify outliers

outliers = [ x for ten in data if x < lower or x > upper ]

print ( 'Identified outliers: %d' % len ( outliers ) )

# remove outliers

outliers_removed = [ x for 10 in information if 10 >= lower and 10 <= upper ]

print ( 'Not-outlier observations: %d' % len ( outliers_removed ) )

Running the example first prints the identified 25th and 75th percentiles and the calculated IQR. The number of outliers identified is printed followed past the number of non-outlier observations.

Percentiles: 25th=46.685, 75th=53.359, IQR=six.674

Identified outliers: 81

Not-outlier observations: 9919

The approach tin can be used for multivariate data by calculating the limits on each variable in the dataset in turn, and taking outliers as observations that fall outside of the rectangle or hyper-rectangle.

Automatic Outlier Detection

In machine learning, an approach to tackling the trouble of outlier detection is one-class classification.

One-Form Nomenclature, or OCC for short, involves plumbing fixtures a model on the "normal" data and predicting whether new data is normal or an outlier/anomaly.

A one-class classifier aims at capturing characteristics of grooming instances, in social club to exist able to distinguish between them and potential outliers to announced.

— Page 139, Learning from Imbalanced Data Sets, 2018.

A 1-class classifier is fit on a grooming dataset that merely has examples from the normal class. Once prepared, the model is used to allocate new examples as either normal or not-normal, i.e. outliers or anomalies.

A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature infinite.

This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to every bit the curse of dimensionality.

The local outlier factor, or LOF for curt, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

We introduce a local outlier (LOF) for each object in the dataset, indicating its degree of outlier-ness.

— LOF: Identifying Density-based Local Outliers, 2000.

The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class.

Nosotros can demonstrate the LocalOutlierFactor method on a predictive modelling dataset.

We will apply the Boston housing regression trouble that has 13 inputs and one numerical target and requires learning the human relationship between suburb characteristics and house prices.

The dataset can exist downloaded from here:

Boston Housing Dataset (housing.csv)
Boston Housing Dataset Details (housing.names)

Looking in the dataset, you lot should see that all variables are numeric.

0.00632,xviii.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,xv.30,396.90,four.98,24.00

0.02731,0.00,vii.070,0,0.4690,vi.4210,78.90,four.9671,2,242.0,17.80,396.xc,9.xiv,21.60

0.02729,0.00,seven.070,0,0.4690,vii.1850,61.x,4.9671,two,242.0,17.80,392.83,4.03,34.70

0.03237,0.00,ii.180,0,0.4580,6.9980,45.eighty,6.0622,3,222.0,eighteen.lxx,394.63,two.94,33.forty

0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.seventy,396.90,5.33,36.20

...

No need to download the dataset, we will download it automatically.

Start, nosotros can load the dataset as a NumPy array, split information technology into input and output variables and then split it into train and test datasets.

The complete example is listed below.

# load and summarize the dataset

from pandas import read_csv

from sklearn . model_selection import train_test _split

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

df = read_csv ( url , header = None )

# recall the array

data = df . values

# split into inpiut and output elements

X , y = data [ : , : - 1 ] , data [ : , - 1 ]

# summarize the shape of the dataset

print ( X . shape , y . shape )

# split into train and examination sets

X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.33 , random_state = 1 )

# summarize the shape of the train and examination sets

print ( X_train . shape , X_test . shape , y_train . shape , y_test . shape )

Running the example loads the dataset and first reports the total number of rows and columns in the dataset, then the information number of examples allocated to the train and test datasets.

(506, 13) (506,)

(339, xiii) (167, 13) (339,) (167,)

Information technology is a regression predictive modeling problem, pregnant that nosotros will be predicting a numeric value. All input variables are also numeric.

In this case, we will fit a linear regression algorithm and evaluate model operation past training the model on the examination dataset and making a prediction on the test data and evaluate the predictions using the mean absolute mistake (MAE).

The complete instance of evaluating a linear regression model on the dataset is listed below.

thirteen

xviii

twenty

# evaluate model on the raw dataset

from pandas import read_csv

from sklearn . model_selection import train_test_split

from sklearn . linear_model import LinearRegression

from sklearn . metrics import mean_absolute _error

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

df = read_csv ( url , header = None )

# retrieve the assortment

data = df . values

# split into inpiut and output elements

X , y = data [ : , : - 1 ] , data [ : , - one ]

# split into train and test sets

X_train , X_test , y_train , y_test = train_test_split ( Ten , y , test_size = 0.33 , random_state = 1 )

# fit the model

model = LinearRegression ( )

model . fit ( X_train , y_train )

# evaluate the model

yhat = model . predict ( X_test )

# evaluate predictions

mae = mean_absolute_error ( y_test , yhat )

print ( 'MAE: %.3f' % mae )

Running the example fits and evaluates the model then reports the MAE.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation process, or differences in numerical precision. Consider running the example a few times and compare the boilerplate issue.

In this case, we can see that the model achieved a MAE of almost iii.417.

Next, we tin try removing outliers from the training dataset.

The expectation is that the outliers are causing the linear regression model to learn a bias or skewed understanding of the problem, and that removing these outliers from the training ready will permit a more effective model to be learned.

We tin achieve this past defining the LocalOutlierFactor model and using it to brand a prediction on the grooming dataset, marking each row in the training dataset equally normal (1) or an outlier (-ane). We will use the default hyperparameters for the outlier detection model, although it is a adept idea to melody the configuration to the specifics of your dataset.

. . .

# place outliers in the training dataset

lof = LocalOutlierFactor ( )

yhat = lof . fit_predict ( X_train )

We can then utilize these predictions to remove all outliers from the training dataset.

. . .

# select all rows that are not outliers

mask = yhat != - 1

X_train , y_train = X_train [ mask , : ] , y_train [ mask ]

Nosotros tin can and then fit and evaluate the model as per normal.

The updated example of evaluating a linear regression model with outliers deleted from the training dataset is listed below.

viii

eleven

xvi

xviii

twenty

xxx

# evaluate model on grooming dataset with outliers removed

from pandas import read_csv

from sklearn . model_selection import train_test_split

from sklearn . linear_model import LinearRegression

from sklearn . neighbors import LocalOutlierFactor

from sklearn . metrics import mean_absolute _error

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

df = read_csv ( url , header = None )

# recall the array

information = df . values

# split into inpiut and output elements

X , y = data [ : , : - 1 ] , data [ : , - 1 ]

# split into train and test sets

X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.33 , random_state = i )

# summarize the shape of the training dataset

print ( X_train . shape , y_train . shape )

# identify outliers in the training dataset

lof = LocalOutlierFactor ( )

yhat = lof . fit_predict ( X_train )

# select all rows that are non outliers

mask = yhat != - i

X_train , y_train = X_train [ mask , : ] , y_train [ mask ]

# summarize the shape of the updated grooming dataset

print ( X_train . shape , y_train . shape )

# fit the model

model = LinearRegression ( )

model . fit ( X_train , y_train )

# evaluate the model

yhat = model . predict ( X_test )

# evaluate predictions

mae = mean_absolute_error ( y_test , yhat )

print ( 'MAE: %.3f' % mae )

Running the example fits and evaluates the linear regression model with outliers deleted from the preparation dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation process, or differences in numerical precision. Consider running the case a few times and compare the average outcome.

Firstly, we can see that the number of examples in the training dataset has been reduced from 339 to 305, pregnant 34 rows containing outliers were identified and deleted.

We tin can likewise run across a reduction in MAE from about 3.417 by a model fit on the unabridged training dataset, to about three.356 on a model fit on the dataset with outliers removed.

(339, 13) (339,)

(305, 13) (305,)

MAE: 3.356

The Scikit-Acquire library provides other outlier detection algorithms that can be used in the aforementioned manner such as the IsolationForest algorithm. For more examples of automatic outlier detection, come across the tutorial:

iv Automated Outlier Detection Algorithms in Python

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

Develop your own Gaussian test dataset and plot the outliers and non-outlier values on a histogram.
Examination out the IQR based method on a univariate dataset generated with a non-Gaussian distribution.
Choose one method and create a function that volition filter out outliers for a given dataset with an arbitrary number of dimensions.

If you explore any of these extensions, I'd honey to know.

Get a Handle on Modern Data Grooming!

Prepare Your Machine Learning Data in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Data Preparation for Machine Learning

It provides self-report tutorials with full working code on:
Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction, and much more...

Bring Modern Information Preparation Techniques to
Your Machine Learning Projects

Encounter What'due south Within

Source: https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/

Posted by: dickensevervall.blogspot.com

how to find outliers using standard deviation

How to Remove Outliers for Machine Learning

Tutorial Overview

What are Outliers?

Want to Go Started With Data Preparation?

Exam Dataset

Standard Deviation Method

Interquartile Range Method

Automatic Outlier Detection

Extensions

Further Reading

Tutorials

Books

API

Articles

Summary

Get a Handle on Modern Data Grooming!

Prepare Your Machine Learning Data in Minutes

Bring Modern Information Preparation Techniques to
Your Machine Learning Projects

0 Response to "how to find outliers using standard deviation"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

how to find outliers using standard deviation

Tutorial Overview

What are Outliers?

Want to Go Started With Data Preparation?

Exam Dataset

Standard Deviation Method

Interquartile Range Method

Automatic Outlier Detection

Extensions

Further Reading

Tutorials

Books

API

Articles

Summary

Get a Handle on Modern Data Grooming!

Prepare Your Machine Learning Data in Minutes

Bring Modern Information Preparation Techniques to Your Machine Learning Projects

0 Response to "how to find outliers using standard deviation"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

Bring Modern Information Preparation Techniques to
Your Machine Learning Projects