Thisarticlewas originally published onBuilt Inby Eric Kleppen.
Variance is a powerful statistic used in data analysis andmachine learning.
Calculating variance is easy usingPython.

What is variance?
Variance is a statistic that measures dispersion.
Depending on the statistical tests, uneven variance between samples couldskeworbiasresults.

One of the popularstatistical teststhat applies variance is called the analysis of variance (ANOVA) test.
For example, say you want to analyze whether social media use impacts hours of sleep.
The test can show whether results are explained by group differences or individual differences.

How do you find the variance?
Square each deviation to get a positive number.
Sum the squared values.

Divide the sum of squares by N or n-1.
69.5/6 = 11.583
There we have it!
The variance of our population is 11.583.

Why use n-1 when calculating the sample variance?
Applying n-1 to the formula is calledBessels correction, named after Friedrich Bessel.
When using samples, we need to calculate the estimated variance for the population.

Using n-1 will make the variance estimate larger, overestimating variability in samples, thus reducing biases.
Luckily, Python can easily handle the calculation for very large data.
Next, add logic to calculate the population mean.

After calculating the mean, find the differences from the mean for each value.
you’re free to do this in one line using a list comprehension.
Next, square the differences and sum them.

Lastly, calculate the variance.
Using an If/Else statement, we can utilize the is_sampleparameter.
If is_sampleis true, calculate variance using (n-1).

it’s possible for you to do it in one line of code using Pandas.
Lets load up some data and work through a real example of finding variance.
Lets find the variance for the numeric columns mileage, engine_power and price.

That makes sense because many factors play into the distance a person needs to drive.
By comparison, engine_power has a low variance which indicates the values dont vary widely from the mean.
Variance also impacts which statistical tests can help us make data driven decisions.

For large data sets, we saw how simple it is to calculate variance using Python and Pandas.
Also tagged with





