Handling duplicated data in Python pandas dataframes
Use duplicate() and drop_duplicates() methods effectively
When working with datasets, identifying duplicate records is crucial for data quality.
Assume this dataset throughout this post.
import pandas as pd
# Sample data with duplicate
data = {’name’: [’Alice’, ‘Bob’, ‘Alice’, ‘Charlie’, ‘Bob’, ‘Alice’],
‘age’: [25, 30, 25, 35, 0, 25]}
df = pd.DataFrame(data)
Finding Duplicated Rows
Use duplicated()
to drop first duplicate and show other occurrence.
# Find all duplicate rows
duplicates = df[ df.duplicated() ]
print(duplicates)
**Output**
```
name age
2 Alice 25
5 Alice 25
```
`keep` Parameter
Use duplicated()
with keep=False
to show all duplicate occurrences.
# Find all duplicate rows
duplicates = df[df.duplicated(keep=False)]
print(duplicates)
**Output**
```
name age
0 Alice 25
2 Alice 25
5 Alice 25
```
The method’s keep
option includes False, ‘first’, ‘last’, defaulted to ‘first’
. Since duplicate()
generates a boolean
mask, its more intuitive to think of keep
as drop
instead, since it 'drops’ first row instead of ‘keeping’ it.
`subset` Parameter
Use duplicated()
with subset=[‘x_column’]
to show occurrences with same values in the ‘x_column
’.
# Find all duplicate rows in 'age' column
duplicates = df[ df.duplicated(keep=False, subset=[’age’]) ]
print(duplicates)
**Output**
```
name age
0 Alice 25
2 Alice 25
5 Alice 25
```
Removing Duplicates
To remove, drop_duplicates()
removes all duplicate rows, i.e. same values along all columns, by default.
# Remove duplicates
df_clean = df.drop_duplicates()
print(df_clean)
**Output**
```
name age
0 Alice 25
1 Bob 30
3 Charlie 35
4 Bob 0
`subset` Parameter
Use subset
to remove duplicates based on specific columns while keeping the first occurrence.
# Remove duplicates based on ‘name’ column
df_clean = df.drop_duplicates(subset=[’name’])
print(df_clean)
**Output**
```
name age
0 Alice 25
1 Bob 30
3 Charlie 35
```
`keep` Parameter
The keep
parameter applies as well. Use keep=False
if you wish to remove all duplicates from your dataframe.
# Remove duplicates based on ‘name’ column
df_clean = df.drop_duplicates(subset=[’name’], keep=False)
print(df_clean)
**Output**
```
name age
3 Charlie 35
```
End