Thisarticlewas originally published onBuilt Inby Eric Kleppen.

Variance is a powerful statistic used in data analysis andmachine learning.

Calculating variance is easy usingPython.

The guide to find variance using Python

What is variance?

Variance is a statistic that measures dispersion.

Depending on the statistical tests, uneven variance between samples couldskeworbiasresults.

Calculation for finding variance in Python

One of the popularstatistical teststhat applies variance is called the analysis of variance (ANOVA) test.

For example, say you want to analyze whether social media use impacts hours of sleep.

The test can show whether results are explained by group differences or individual differences.

Calculation for finding variance in Python

How do you find the variance?

Square each deviation to get a positive number.

Sum the squared values.

Square each deviation with a positive number

Divide the sum of squares by N or n-1.

69.5/6 = 11.583

There we have it!

The variance of our population is 11.583.

Sum the squared values

Why use n-1 when calculating the sample variance?

Applying n-1 to the formula is calledBessels correction, named after Friedrich Bessel.

When using samples, we need to calculate the estimated variance for the population.

recalculate the variance pretending the values are from a sample

Using n-1 will make the variance estimate larger, overestimating variability in samples, thus reducing biases.

Luckily, Python can easily handle the calculation for very large data.

Next, add logic to calculate the population mean.

Start by defining the function that takes in the two parameters.

After calculating the mean, find the differences from the mean for each value.

you’re free to do this in one line using a list comprehension.

Next, square the differences and sum them.

Next, add logic to calculate the population mean.

Lastly, calculate the variance.

Using an If/Else statement, we can utilize the is_sampleparameter.

If is_sampleis true, calculate variance using (n-1).

find the differences from the mean for each value.

it’s possible for you to do it in one line of code using Pandas.

Lets load up some data and work through a real example of finding variance.

Lets find the variance for the numeric columns mileage, engine_power and price.

Next, square the differences and sum them.

That makes sense because many factors play into the distance a person needs to drive.

By comparison, engine_power has a low variance which indicates the values dont vary widely from the mean.

Variance also impacts which statistical tests can help us make data driven decisions.

Calculate the variance

For large data sets, we saw how simple it is to calculate variance using Python and Pandas.

Also tagged with

How to find the variance in Python

reading the CSV file into a Pandas data frame

We can count the number of rows in the data set and display the first five rows to make sure everything loaded correctly:

Displaying the first rows using bmw_df.head()

Variance for numeric columns in the BMW data frame

Pandas var() function