Descriptive Statistics in Python

Dipankar Medhi
7 min readNov 18, 2021

--

Photo by Carlos Muza on Unsplash

Let us understand descriptive statistics and implement all the concepts in the python programming language using the pandas library.

What is Descriptive Statistics?

As the name itself suggests, descriptive statistics briefly describes (summarizes) the features or characteristics of data.

A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and analysing those statistics.

Wikipedia

Types of Descriptive Statistics

There are 3 main types of descriptive statistics, namely

  • Distribution
  • Central Tendency
  • Variability.

Central Tendency:-

It describes the centre or central position of a given dataset. The 3 ways of finding the central tendency are mean, median and mode.

Mean is the most common method used to find the average of a dataset. It represents the sum of all values divided by the total number of values in a given dataset.

Mean = ⅀x/n

Mean is highly affected by outliers and it can significantly get an increase or decrease due to the presence of outliers in the calculation.

Median is the middle value in a dataset that is arranged in ascending order (from the smallest value to the largest value). If a dataset contains an even number of values, the median of the dataset is the mean of the two middle values.

Median of an odd-numbered data set:-

Median = (n+1)/2

Median of an even-numbered data set:-

Median = mean of n/2 and (n/2)+1

Mode is the most frequently occurring value in a dataset. In some cases, a dataset may contain multiple modes, while some datasets may not have any mode at all.

Ordered data set: 0, 3, 3, 12, 15, 24
Mode: 3

Variability:-

Variability is the measure of how spread out the data is. The two major measures of variability that we use in data science are Standard deviation and Variance.

Variance is the average of squared deviations from the mean. It is the square of the standard deviation. This means that the units of variance are much larger than those of a typical value of a data set.

Variance formula

Standard Deviation is the average amount of variability in your dataset. It tells how far each score lies from the mean. The larger the standard deviation, the more variable the data set is.

Standard deviation formula

Implementing the above measures in python using Pandas library:-

Here, we are using the income dataset having, Mthly_HH_Income, Mthly_HH_Expense, No_of_Fly_Members, Emi_or_Rent_Amt, Annual_HH_Income, No_of_Earning_Members as its features.

We import the dataset using the pandas read_csv method into a variable df.

df = pd.read_csv('data.csv')

Pandas library has a very powerful method called describe(), which describes a pandas dataframe by displaying all the necessary statistical details like count, standard deviation, mean, percentile, etc.

df.describe()
pandas describe function output.

Using the median() function we obtain the median of the numerical features.

df.median()
median of the features using median() function.

mode() function outputs the mode of each feature.

mode using mode() function

The var() function is to obtain the variance of the features of the pandas dataframe.

df.var()
var() function outputs variance of the features.

Correlation between features

Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. — statisticssolutions

Correlation between two variables

The correlation of the features can be calculated using corr() function.

df_corr = df.corr()
df_corr
correlation table of the features/variables

The correlation of the variables can be visualized in a more well organized and appealing way using a heatmap. For which we use matplotlib or seaborn.

plt.figure(figsize=(15, 15))
sns.heatmap(df_corr, vmin=-1, vmax=+1, annot=True)
Correlation heatmap

We can observe that there is a very high positive relation of 0.97 between Annual HH income and Monthly HH income. A positive relation of 0.64 between Monthly HH expense and No of family members. Therefore we can conclude that with the increase in no of family members, the monthly household expenses increases. And similarly, with the increase of monthly income, the monthly expenses increase.

Normal Distribution

Normal Distribution is a probability function used in statistics that tells about how the data values are distributed. It is the most important probability distribution function used in statistics because of its advantages in real case scenarios. For example, the height of the population, shoe size, IQ level, rolling a die, and many more.

Normal Distribution (Towardsdatascience.com)

Plotting distribution of all the numerical features of the dataset using the seaborn distplot function.

numberical_features =  df.select_dtypes(exclude=[np.object])
for i in numberical_features.columns:
plt.figure(figsize=(9, 7))
sns.distplot(numberical_features[i])
plt.show()

Similarly, the rest of the distplots of the remaining variables are plotted respectively.

Skewness

Skewness is the measure of how much the probability distribution of a random variable deviates from the normal distribution.

  • Skewness is a measure of the asymmetry of a distribution. Another measure that describes the shape of a distribution is kurtosis.
  • In a normal distribution, the mean divides the curve symmetrically into two equal parts at the median and the value of skewness is zero.
  • When the value of the skewness is negative, the tail of the distribution is longer towards the left-hand side of the curve.
  • When the value of the skewness is positive, the tail of the distribution is longer towards the right-hand side of the curve.

Why skewness is important? Let’s say our data is right skewed i.e the tail of the distribution is toward the right hand side of the distribution. This means that most of the data points of our dataset have lower/less values. Or we can say that higher number of points have low values. So our model will perform better in predicting lower values as compared to predicting large/higher values.

Checking the skewness of the dataset using the skew() function of pandas library.

print("Skewness:\n", df.select_dtypes(exclude=[np.object]).skew())
The skewness of the features of the dataset.

We can conclude that all the features are positively skewed i.e their tail is towards the right-hand side of the distribution. The feature No_of_fly_members has a skewness value close to 0, so we can say that it is very close to a normal distribution.

Effect of Skewness on Mean, Median and Mode

Left(Negative) and Right(Positive) skewed (statisticshowto.com)
  • In case of Right skewed distribution — Mean > Median > Mode
  • In case of Left skewed distribution — Mode > Median > Mean

QQ Plot

When the quantiles of two variables are plotted against each other, then the plot obtained is known as quantile – quantile plot or qqplot. This plot provides a summary of whether the distributions of two variables are similar or not with respect to the locations. — GeekForGeeks

QQ plots are very useful to determine

  • If two populations are of the same distribution.
  • If residuals follow a normal distribution.

In Q-Q plots, we plot the theoretical Quantile values with the sample Quantile values. Quantiles are obtained by sorting the data. It determines how many values in a distribution are above or below a certain limit.

  • If we are getting a straight line, then we can tell that the dataset we are comparing is from the same distribution.
  • And if the plot is not in a straight line, then we can say that they belong to different distribution

Plotting a QQ plot in python: —

import statsmodels.api as sm

import scipy.stats as stats
# Let's iterate through every numeric feature and obtain their qqplot
for i in numberical_features.columns:
sm.qqplot(df[i], dist=stats.norm)
plt.title("QQ-plot for {}".format(i))
plt.show()

Box-Cox

Box Cox is a transformation method that transforms non-normal dependent variables in the data to a normal shape. — statisticshowto.com

For a more detailed understanding, watch this youtube video on box cox transformation:—

Let’s apply stats.boxcox() method from scipy library to our dataset.

from scipy import stats

fitted_data, fitted_lambda = stats.boxcox(df['Annual_HH_Income'])

fig, ax = plt.subplots(1, 2)

sns.distplot(df['Annual_HH_Income'], hist=False, kde=True,
kde_kws={'shade': True, 'linewidth': 2},
label="Non-Normal", color="yellow", ax=ax[0]);

sns.distplot(fitted_data, hist=False, kde=True,
kde_kws={'shade': True, 'linewidth': 2},
label="Normal", color="orange", ax=ax[1]);

plt.legend(loc="upper right")

fig.set_figheight(5)
fig.set_figwidth(10)

print(f"Lambda value used for Transformation: {fitted_lambda}")
BoxCox transformation.

Here, the Annual Household income feature which was not normal turned into or became very close to a normal distribution. Using this we can transform features into a normal distribution.

Visit the Github repository for the dataset used in this project along with the jupyter notebook. Feel free to clone the repository and use the notebook as you wish. Thank you for reading 💙 this short article.

--

--