Pandas

pandas module helps us to work with tabular data. It is more like a spreadsheet program. We can perform various data manipulation, cleaning, and even visualization using pandas.

Installation

Install pandas package:

pip install pandas

# for Microsoft Excel compatibility
pip install xlrd openpyxl

# for handling missing data
pip install missingno

Dataframe

Dataframe is like a variable that can hold a whole table. It is more than a variable, it is a python object. It has properties and methods. Let us create some tabular data:

import pandas as pd

# Create a python dictionary
data = {'Names':  ['Student 1', 'Student 2', 'Student 3'],
        'Age': [14, 15, 15],
        'Height': [150, 145, 152]
        }

# dataframe form a dictionary
df = pd.DataFrame(data)

We can visualize the table in Jupyter notebook:

df  # this shows rich view

# alternatively print plaintext tables
print(df)

       Names  Age  Height
0  Student 1   14     150
1  Student 2   15     145
2  Student 3   15     152

Each column in a dataframe is a pandas series. We can select columns:

df["Height"]

0    150
1    145
2    152
Name: Height, dtype: int64

Find the shape of the dataframe:

df.shape

(3, 3)

Let us append a new row of data:

df = df.append({'Names' : 'Student 4', \
                'Age' : 16, \
                'Height' : 155}, \
                ignore_index=True)
print(df)

       Names  Age  Height
0  Student 1   14     150
1  Student 2   15     145
2  Student 3   15     152
3  Student 4   16     155

Here shape is a property of the dataframe, while append is a method. General convention is: object.prop to get a property value and object.function() to perform some operation.

Assign a new column:

# add new column
grades = ["A", "B", "C", "B"]
df['grades'] = grades
print(df)

       Names  Age  Height grades
0  Student 1   14     150      A
1  Student 2   15     145      B
2  Student 3   15     152      C
3  Student 4   16     155      B

Insert column in a specific location:

df.pop('grades')  # remove last column
df.insert(1, 'Grades', grades)
print(df)

       Names Grades  Age  Height
0  Student 1      A   14     150
1  Student 2      B   15     145
2  Student 3      C   15     152
3  Student 4      B   16     155

Edit row:

df.loc[1] = ['Student 2', 'A', 15, 147]
print(df)

       Names Grades  Age  Height
0  Student 1      A   14     150
1  Student 2      A   15     147
2  Student 3      C   15     152
3  Student 4      B   16     155

Delete a row:

df.drop(df.index[[2]], inplace=True)

# Drop last row
# df.drop(df.index[[-1]], inplace=True)

print(df)

       Names Grades  Age  Height
0  Student 1      A   14     150
1  Student 2      A   15     147
3  Student 4      B   16     155

Sort by column:

df.sort_values(by="Age", ascending=False)

	Names	Grades	Age	Height
3	Student 4	B	16	155
1	Student 2	A	15	147
0	Student 1	A	14	150

Select columns based on values:

df[df["Height"] >= 150]

	Names	Grades	Age	Height
0	Student 1	A	14	150
3	Student 4	B	16	155

Excel and CSV

Save to .csv or .xlsx format:

# Save as both excel and csv
df.to_csv('student_info.csv', index=False)
df.to_excel('student_info.xlsx', index=False)

Read .csv or .xlsx files:

df_copy_csv = pd.read_csv('student_info.csv')
df_copy_xlsx = pd.read_excel('student_info.xlsx')

Now let us try to load a larger dataset to explore:

url = "https://pranabdas.github.io/drive/datasets/iris/iris.csv"
names = ["sepal_len", "sepal_width", "petal_len", "petal_width", "species"]
df = pd.read_csv(url, header=None, names=names)

First, let us check the size of the dataframe:

df.shape

(150, 5)

We have 150 rows and 5 columns of data. We can see the top of the dataset by:

df.head()

sepal_len	sepal_width	petal_len	petal_width	species
5.1	3.5	1.4	0.2	Iris-setosa
4.9	3.0	1.4	0.2	Iris-setosa
4.7	3.2	1.3	0.2	Iris-setosa
4.6	3.1	1.5	0.2	Iris-setosa
5.0	3.6	1.4	0.2	Iris-setosa

By default this shows first 5 rows, we can see any number of rows by:

df.head(10)

Similarly, we can see the tail of the dataframe as well:

df.tail()

sepal_len	sepal_width	petal_len	petal_width	species
6.7	3.0	5.2	2.3	Iris-virginica
6.3	2.5	5.0	1.9	Iris-virginica
6.5	3.0	5.2	2.0	Iris-virginica
6.2	3.4	5.4	2.3	Iris-virginica
5.9	3.0	5.1	1.8	Iris-virginica

We can also see any arbitrary rows:

df.loc[15:20]

sepal_len	sepal_width	petal_len	petal_width	species
5.7	4.4	1.5	0.4	Iris-setosa
5.4	3.9	1.3	0.4	Iris-setosa
5.1	3.5	1.4	0.3	Iris-setosa
5.7	3.8	1.7	0.3	Iris-setosa
5.1	3.8	1.5	0.3	Iris-setosa
5.4	3.4	1.7	0.2	Iris-setosa

We can also select specific rows and columns:

df.loc[15:20, "sepal_len"]

  5.7
  5.4
  5.1
  5.7
  5.1
  5.4
Name: sepal_len, dtype: float64

Instead of column header names, its index can be used:

df.iloc[15:20, 0:2]


sepal_len	sepal_width
5.7	4.4
5.4	3.9
5.1	3.5
5.7	3.8
5.1	3.8

Notice that for indexing loc and iloc behaves differently.

Datatype

Checking the data types. While importing, pandas might not correctly set the datatype for your data. You can check the datatype by:

df.dtypes

sepal_len      float64
sepal_width    float64
petal_len      float64
petal_width    float64
species         object
dtype: object

We can change a datatype of a column using astype:

df["sepal_len"] = df["sepal_len"].astype(int)
df.dtypes

sepal_len        int64
sepal_width    float64
petal_len      float64
petal_width    float64
species         object
dtype: object

Text manipulation

Say, I want to remove the "Iris-" prefix from the species names:

df["species"] = df["species"].str.replace("Iris-", "")

Remove leading or trailing spaces:

df["species"] = df["species"].str.strip()

Say, I want to make the names uppercase or lowercase or title formatted:

df["species"] = df["species"].str.title()
# df["species"] = df["species"].str.upper()
# df["species"] = df["species"].str.lower()
df.head()


sepal_len	sepal_width	petal_len	petal_width	species
0	5.1	3.5	1.4	0.2	Setosa
1	4.9	3.0	1.4	0.2	Setosa
2	4.7	3.2	1.3	0.2	Setosa
3	4.6	3.1	1.5	0.2	Setosa
4	5.0	3.6	1.4	0.2	Setosa

We create a new pandas series to make some more data manipulation:

names = pd.Series(['Albert Einstein', 'Richard Feynman', 'Issac Newton'])

We want to reorder the first and last names:

names = names.str.split(" ")
names = pd.Series([i[::-1] for i in names])
names = [" ".join(i) for i in names]
print(names)

['Einstein Albert', 'Feynman Richard', 'Newton Issac']

Customizing pandas

pd.set_options('max_rows', 5)
pd.set_options('max_columns', 4)

Notebooks

If you have issues with viewing the notebooks in web browser, please clone or download the repository and open the notebooks in Jupyter notebook.

tip

If you are interested in advanced use of pandas, please check out my machine learning tutorial and notebooks.

Installation​

Dataframe​

Excel and CSV​

Datatype​

Text manipulation​

Customizing pandas​

Notebooks​