Handling duplicated data in Python pandas dataframes
Use duplicate() and drop_duplicates() methods effectively
When working with datasets, identifying duplicate records is crucial for data quality.
Assume this dataset throughout this post.
import pandas as pd
# Sample data with duplicate
data = {’name’: [’Alice’, ‘Bob’, ‘Alice’, ‘Charlie’, ‘Bob’, ‘Alice’],
‘age’: [25, 30, 25, 35, 0, 25]}
df = pd.DataFrame(data)Finding Duplicated Rows
Use duplicated() to drop first duplicate and show other occurrence.
# Find all duplicate rows
duplicates = df[ df.duplicated() ]
print(duplicates)
**Output**
```
name age
2 Alice 25
5 Alice 25
````keep` Parameter
Use duplicated() with keep=False to show all duplicate occurrences.
# Find all duplicate rows
duplicates = df[df.duplicated(keep=False)]
print(duplicates)
**Output**
```
name age
0 Alice 25
2 Alice 25
5 Alice 25
```The method’s keep option includes False, ‘first’, ‘last’, defaulted to ‘first’. Since duplicate() generates a boolean mask, its more intuitive to think of keep as drop instead, since it 'drops’ first row instead of ‘keeping’ it.
`subset` Parameter
Use duplicated() with subset=[‘x_column’] to show occurrences with same values in the ‘x_column’.
# Find all duplicate rows in 'age' column
duplicates = df[ df.duplicated(keep=False, subset=[’age’]) ]
print(duplicates)
**Output**
```
name age
0 Alice 25
2 Alice 25
5 Alice 25
```Removing Duplicates
To remove, drop_duplicates() removes all duplicate rows, i.e. same values along all columns, by default.
# Remove duplicates
df_clean = df.drop_duplicates()
print(df_clean)
**Output**
```
name age
0 Alice 25
1 Bob 30
3 Charlie 35
4 Bob 0`subset` Parameter
Use subset to remove duplicates based on specific columns while keeping the first occurrence.
# Remove duplicates based on ‘name’ column
df_clean = df.drop_duplicates(subset=[’name’])
print(df_clean)
**Output**
```
name age
0 Alice 25
1 Bob 30
3 Charlie 35
````keep` Parameter
The keep parameter applies as well. Use keep=False if you wish to remove all duplicates from your dataframe.
# Remove duplicates based on ‘name’ column
df_clean = df.drop_duplicates(subset=[’name’], keep=False)
print(df_clean)
**Output**
```
name age
3 Charlie 35
```End
