# Descriptive Statistics in Python

Let us understand ** descriptive statistics **and implement all the concepts in the

**python**programming language using the

**pandas**library.

# What is Descriptive Statistics?

As the name itself suggests, descriptive statistics briefly describes (summarizes) the features or characteristics of data.

A

descriptive statistic(in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, whiledescriptive statistics(in the mass noun sense) is the process of using and analysing those statistics.

# Types of Descriptive Statistics

There are 3 main types of descriptive statistics, namely

*Distribution**Central Tendency**Variability*.

**Central Tendency:-**

It describes the centre or central position of a given dataset. The 3 ways of finding the central tendency are *mean*, *median* and *mode*.

**Mean **is the most common method used to find the average of a dataset. It represents the sum of all values divided by the total number of values in a given dataset.

`Mean = ⅀x/n`

Mean is highly affected by outliers and it can significantly get an increase or decrease due to the presence of outliers in the calculation.

**Median** is the middle value in a dataset that is arranged in ascending order (from the smallest value to the largest value). If a dataset contains an even number of values, the median of the dataset is the mean of the two middle values.

Median of an odd-numbered data set:-

`Median = (n+1)/2`

Median of an even-numbered data set:-

`Median = mean of n/2 and (n/2)+1`

**Mode **is the most frequently occurring value in a dataset. In some cases, a dataset may contain multiple modes, while some datasets may not have any mode at all.

`Ordered data set: 0, 3, 3, 12, 15, 24`

Mode: 3

## Variability:-

Variability is the measure of how spread out the data is. The two major measures of variability that we use in data science are *Standard deviation* and *Variance*.

**Variance **is the average of squared deviations from the mean. It is the square of the *standard deviation*. This means that the units of variance are much larger than those of a typical value of a data set.

**Standard Deviation** is the average amount of variability in your dataset. It tells how far each score lies from the *mean*. The larger the standard deviation, the more variable the data set is.

## Implementing the above measures in python using **Pandas**** **library**:-**

Here, we are using the income dataset having, Mthly_HH_Income, Mthly_HH_Expense, No_of_Fly_Members, Emi_or_Rent_Amt, Annual_HH_Income, No_of_Earning_Members as its features.

We import the dataset using the pandas read_csv method into a variable df.

`df `**=** pd**.**read_csv('data.csv')

**Pandas **library has a very powerful method called *describe(), *which describes a pandas dataframe by displaying all the* *necessary statistical details like count, standard deviation, mean, percentile, etc.

`df.describe()`

Using the median() function we obtain the median of the numerical features.

`df.median()`

mode() function outputs the mode of each feature.

The var() function is to obtain the variance of the features of the pandas dataframe.

`df.var()`

# Correlation between features

Correlationis a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. — statisticssolutions

The correlation of the features can be calculated using corr() function.

`df_corr `**=** df**.**corr()

df_corr

The correlation of the variables can be visualized in a more well organized and appealing way using a heatmap. For which we use **matplotlib **or **seaborn**.

`plt`**.**figure(figsize**=**(15, 15))

sns**.**heatmap(df_corr, vmin**=-**1, vmax**=+**1, annot**=True**)

We can observe that there is a very high **positive** relation of **0.97** between *Annual HH income* and *Monthly HH income*. A **positive** relation of **0.64** between *Monthly HH expense* and *No of family members*. Therefore we can conclude that with the increase in no of family members, the monthly household expenses increases. And similarly, with the increase of monthly income, the monthly expenses increase.

## Normal Distribution

Normal Distribution is a probability function used in statistics that tells about how the data values are distributed. It is the most important probability distribution function used in statistics because of its advantages in real case scenarios. For example, the height of the population, shoe size, IQ level, rolling a die, and many more.

Plotting distribution of all the numerical features of the dataset using the seaborn *distplot *function.

`numberical_features `**=** df**.**select_dtypes(exclude**=**[np**.**object])

**for** i **in** numberical_features**.**columns:

plt**.**figure(figsize**=**(9, 7))

sns**.**distplot(numberical_features[i])

plt**.**show()

Similarly, the rest of the distplots of the remaining variables are plotted respectively.

## Skewness

Skewness is the measure of how much the probability distribution of a random variable deviates from the normal distribution.

- Skewness is a measure of the asymmetry of a distribution. Another measure that describes the shape of a distribution is kurtosis.
- In a normal distribution, the mean divides the curve symmetrically into two equal parts at the median and the value of skewness is zero.
- When the value of the skewness is negative, the tail of the distribution is longer towards the left-hand side of the curve.
- When the value of the skewness is positive, the tail of the distribution is longer towards the right-hand side of the curve.

Why skewness is important?Let’s say our data is right skewed i.e the tail of the distribution is toward the right hand side of the distribution. This means that most of the data points of our dataset have lower/less values. Or we can say that higher number of points have low values. So our model will perform better in predicting lower values as compared to predicting large/higher values.

Checking the skewness of the dataset using the skew() function of pandas library.

`print("Skewness:\n", df`**.**select_dtypes(exclude**=**[np**.**object])**.**skew())

We can conclude that all the features are positively skewed i.e their tail is towards the right-hand side of the distribution. The feature **No_of_fly_members** has a skewness value close to 0, so we can say that it is very close to a normal distribution.

## Effect of Skewness on Mean, Median and Mode

- In case of
*Right skewed*distribution —**Mean > Median > Mode** - In case of
*Left skewed*distribution —**Mode > Median > Mean**

## QQ Plot

`When the quantiles of two variables are plotted against each other, then the plot obtained is known as quantile – quantile plot or qqplot. This plot provides a summary of whether the distributions of two variables are similar or not with respect to the locations. — GeekForGeeks`

QQ plots are very useful to determine

- If two populations are of the same distribution.
- If residuals follow a normal distribution.

In Q-Q plots, we plot the **theoretical Quantile** values with the **sample Quantile** values. Quantiles are obtained by sorting the data. It determines how many values in a distribution are above or below a certain limit.

- If we are getting a straight line, then we can tell that the dataset we are comparing is from the same distribution.
- And if the plot is not in a straight line, then we can say that they belong to different distribution

Plotting a QQ plot in python: —

importstatsmodels.apiassmimportscipy.statsasstats# Let's iterate through every numeric feature and obtain their qqplotforiinnumberical_features.columns:

sm.qqplot(df[i], dist=stats.norm)

plt.title("QQ-plot for {}".format(i))

plt.show()

## Box-Cox

Box Cox is a transformation method that transforms non-normal dependent variables in the data to a normal shape. — statisticshowto.com

For a more detailed understanding, watch this youtube video on box cox transformation:—

Let’s apply stats.boxcox() method from **scipy **library to our dataset.

**from** scipy **import** stats

fitted_data, fitted_lambda **=** stats**.**boxcox(df['Annual_HH_Income'])

fig, ax **=** plt**.**subplots(1, 2)

sns**.**distplot(df['Annual_HH_Income'], hist**=False**, kde**=True**,

kde_kws**=**{'shade': **True**, 'linewidth': 2},

label**=**"Non-Normal", color**=**"yellow", ax**=**ax[0]);

sns**.**distplot(fitted_data, hist**=False**, kde**=True**,

kde_kws**=**{'shade': **True**, 'linewidth': 2},

label**=**"Normal", color**=**"orange", ax**=**ax[1]);

plt**.**legend(loc**=**"upper right")

fig**.**set_figheight(5)

fig**.**set_figwidth(10)

print(f"Lambda value used for Transformation: {fitted_lambda}")

Here, the Annual Household income feature which was not normal turned into or became very close to a normal distribution. Using this we can transform features into a normal distribution.

Visit the **Github **repository for the dataset used in this project along with the jupyter notebook. Feel free to clone the repository and use the notebook as you wish. Thank you for reading 💙 this short article.