# Basic Statistical Analysis

## Univariate dataset​

Here we will look into some simple (univariate) dataset, and see how can we statistically describe the dataset. We work on a Jupyter notebook. First let us load the python libraries we will be using:

import matplotlib.pyplot as pltimport pandas as pdimport seaborn as sns# some matplotlib configs%matplotlib inlineplt.rcParams["figure.dpi"] = 150plt.rcParams["figure.facecolor"] = "white"

Let us load our dataset of human heights (in inches). The dataset contains more columns (variables), but we will be looking only in the height.

data = pd.read_csv("../datasets/height-weight.csv", usecols=[1])data.head()

And we will see an output like below. You can also pass an argument to the head method (e.g., data.head(10)) to display certain number of rows instead of default 5.

   Height(Inches)0        65.783311        71.515212        69.398743        68.216604        67.78781

We can a get a summary of our data using pandas describe method:

data.describe()
       Height(Inches)count    25000.000000mean        67.993114std          1.901679min         60.27836025%         66.70439750%         67.99570075%         69.272958max         75.152800

Now let us try to understand what these various quantities means.

#### Count, min and max​

Count is the number of data points we have in our dataset. Min and max are the lowest and highest values of data points, respectively.

#### Mean and standard deviation​

We have seen previously how they are defined. Mean gives us an idea where our data is centered around. Standard deviation tells us how our data is distributed or spread. 66.67% of our data points falls in the range mean ± std, about 95% of our samples falls in the range mean ± 2*std and 99.7% of our data falls in the range mean ± 3*std. Standard deviation is the average distance our data points fall from the mean value.

#### Inter quartile ranges​

The quantities 25%, 50%, and 75% are inter quartile ranges, sometimes referred to as first, second, and third quartile as well. Second quartile is also known as median. First quartile gives the value where 25% of data points falls below. Inter quartile range is defined as the (3rd quartile - 1st quartile); in our case (69.272958 - 66.704397) = 2.568561. This also describes how hour data points are distributed. Range is defined as (max value - min value).

info

In case of symmetrical dataset, the mean and median will have the same value. If the data is right skewed, the mean will be higher than median. Conversely, if the data is left skewed, the average will lower than the median.

#### Standard score or z-score​

Standard score or z-score of a specific data point is given by

$z\text{-score} = \frac{\text{value} - \text{mean}}{\text{standard~deviation}}$

### Histogram plot​

Histogram plot can give us good indication about mean, median, data distribution, ranges of our dataset. Here we will use seaborn to make plots.

sns.histplot(data, x="Height(Inches)", bins=20)plt.show()

We can infer from the histogram plot that our data is symmetrical with a bell- shaped (normal) distribution, centered around 68 inches, has range from 62 to 75. The distribution is unimodal (it has only peak. If there are two peaks, we call it bimodal).

In case of univariate data, we can also plot a smooth distribution curve (kernel density plot) along with the histogram by setting the kde parameter to True.

sns.histplot(data, x="Height(Inches)", bins=20, kde=True)

Main aspects of histogram plot:

• Shape: overall appearance of histogram; could be symmetric, bell shaped, left skewed or right skewed, etc.
• Center: mean or median. In order to find the median, we can draw a vertical separator line such that the area under the curve on the left and right are equal.
• Spread: how the data is distributed. Range, Interquartile Range (IQR), standard deviation, variance.
• Outliers: data points that fall far from the bulk of the data. If the data is say, right skewed (long tail on the right), likely there will have outlier on the right as well.

### Box plot​

sns.boxplot(data=data["Height(Inches)"])plt.show()

Important parts of the box plot is described in the diagram below:

### Violin plot​

sns.violinplot(x=data["Height(Inches)"])plt.show()

### Q-Q plot​

Quantile-quantile plot.