Pandas
pandas module helps us to work with tabular data. It is more like a spreadsheet program. We can perform various data manipulation, cleaning, and even visualization using pandas.
Installation
Install pandas package:
pip install pandas
# for Microsoft Excel compatibility
pip install xlrd openpyxl
# for handling missing data
pip install missingno
Dataframe
Dataframe is like a variable that can hold a whole table. It is more than a variable, it is a python object. It has properties and methods. Let us create some tabular data:
import pandas as pd
# Create a python dictionary
data = {'Names': ['Student 1', 'Student 2', 'Student 3'],
'Age': [14, 15, 15],
'Height': [150, 145, 152]
}
# dataframe form a dictionary
df = pd.DataFrame(data)
We can visualize the table in Jupyter notebook:
df # this shows rich view
# alternatively print plaintext tables
print(df)
Names Age Height
0 Student 1 14 150
1 Student 2 15 145
2 Student 3 15 152
Each column in a dataframe is a pandas series. We can select columns:
df["Height"]
0 150
1 145
2 152
Name: Height, dtype: int64
Find the shape of the dataframe:
df.shape
(3, 3)
Let us append a new row of data:
df = df.append({'Names' : 'Student 4', \
'Age' : 16, \
'Height' : 155}, \
ignore_index=True)
print(df)
Names Age Height
0 Student 1 14 150
1 Student 2 15 145
2 Student 3 15 152
3 Student 4 16 155
Here shape
is a property of the dataframe, while append
is a method.
General convention is: object.prop
to get a property value and
object.function()
to perform some operation.
Assign a new column:
# add new column
grades = ["A", "B", "C", "B"]
df['grades'] = grades
print(df)
Names Age Height grades
0 Student 1 14 150 A
1 Student 2 15 145 B
2 Student 3 15 152 C
3 Student 4 16 155 B
Insert column in a specific location:
df.pop('grades') # remove last column
df.insert(1, 'Grades', grades)
print(df)
Names Grades Age Height
0 Student 1 A 14 150
1 Student 2 B 15 145
2 Student 3 C 15 152
3 Student 4 B 16 155
Edit row:
df.loc[1] = ['Student 2', 'A', 15, 147]
print(df)
Names Grades Age Height
0 Student 1 A 14 150
1 Student 2 A 15 147
2 Student 3 C 15 152
3 Student 4 B 16 155
Delete a row:
df.drop(df.index[[2]], inplace=True)
# Drop last row
# df.drop(df.index[[-1]], inplace=True)
print(df)
Names Grades Age Height
0 Student 1 A 14 150
1 Student 2 A 15 147
3 Student 4 B 16 155
Sort by column:
df.sort_values(by="Age", ascending=False)
Names Grades Age Height
3 Student 4 B 16 155
1 Student 2 A 15 147
0 Student 1 A 14 150
Select columns based on values:
df[df["Height"] >= 150]
Names Grades Age Height
0 Student 1 A 14 150
3 Student 4 B 16 155
Excel and CSV
Save to .csv
or .xlsx
format:
# Save as both excel and csv
df.to_csv('student_info.csv', index=False)
df.to_excel('student_info.xlsx', index=False)
Read .csv
or .xlsx
files:
df_copy_csv = pd.read_csv('student_info.csv')
df_copy_xlsx = pd.read_excel('student_info.xlsx')
Now let us try to load a larger dataset to explore:
url = "https://pranabdas.github.io/drive/datasets/iris/iris.csv"
names = ["sepal_len", "sepal_width", "petal_len", "petal_width", "species"]
df = pd.read_csv(url, header=None, names=names)
First, let us check the size of the dataframe:
df.shape
(150, 5)
We have 150 rows and 5 columns of data. We can see the top of the dataset by:
df.head()
sepal_len sepal_width petal_len petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
By default this shows first 5 rows, we can see any number of rows by:
df.head(10)
Similarly, we can see the tail of the dataframe as well:
df.tail()
sepal_len sepal_width petal_len petal_width species
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
We can also see any arbitrary rows:
df.loc[15:20]
sepal_len sepal_width petal_len petal_width species
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
20 5.4 3.4 1.7 0.2 Iris-setosa
We can also select specific rows and columns:
df.loc[15:20, "sepal_len"]
15 5.7
16 5.4
17 5.1
18 5.7
19 5.1
20 5.4
Name: sepal_len, dtype: float64
Instead of column header names, its index can be used:
df.iloc[15:20, 0:2]
sepal_len sepal_width
15 5.7 4.4
16 5.4 3.9
17 5.1 3.5
18 5.7 3.8
19 5.1 3.8
Notice that for indexing loc
and iloc
behaves differently.
Datatype
Checking the data types. While importing, pandas might not correctly set the datatype for your data. You can check the datatype by:
df.dtypes
sepal_len float64
sepal_width float64
petal_len float64
petal_width float64
species object
dtype: object
We can change a datatype of a column using astype
:
df["sepal_len"] = df["sepal_len"].astype(int)
df.dtypes
sepal_len int64
sepal_width float64
petal_len float64
petal_width float64
species object
dtype: object
Text manipulation
Say, I want to remove the "Iris-" prefix from the species names:
df["species"] = df["species"].str.replace("Iris-", "")
Remove leading or trailing spaces:
df["species"] = df["species"].str.strip()
Say, I want to make the names uppercase or lowercase or title formatted:
df["species"] = df["species"].str.title()
# df["species"] = df["species"].str.upper()
# df["species"] = df["species"].str.lower()
df.head()
sepal_len sepal_width petal_len petal_width species
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
We create a new pandas series to make some more data manipulation:
names = pd.Series(['Albert Einstein', 'Richard Feynman', 'Issac Newton'])
We want to reorder the first and last names:
names = names.str.split(" ")
names = pd.Series([i[::-1] for i in names])
names = [" ".join(i) for i in names]
print(names)
['Einstein Albert', 'Feynman Richard', 'Newton Issac']
Customizing pandas
pd.set_options('max_rows', 5)
pd.set_options('max_columns', 4)
Notebooks
If you have issues with viewing the notebooks in web browser, please clone or download the repository and open the notebooks in Jupyter notebook.
If you are interested in advanced use of pandas, please check out my machine learning tutorial and notebooks.