Handling duplicated data in Python pandas dataframes

Use duplicate() and drop_duplicates() methods effectively

Sep 26, 2025

When working with datasets, identifying duplicate records is crucial for data quality.

Assume this dataset throughout this post.

import pandas as pd

# Sample data with duplicate
data = {’name’: [’Alice’, ‘Bob’, ‘Alice’, ‘Charlie’, ‘Bob’, ‘Alice’],
        ‘age’: [25, 30, 25, 35, 0, 25]}

df = pd.DataFrame(data)

Finding Duplicated Rows

Use duplicated() to drop first duplicate and show other occurrence.

# Find all duplicate rows
duplicates = df[ df.duplicated() ]

print(duplicates)
**Output**
```
    name  age
2  Alice   25
5  Alice   25
```

`keep` Parameter

Use duplicated() with keep=False to show all duplicate occurrences.

# Find all duplicate rows
duplicates = df[df.duplicated(keep=False)]

print(duplicates)
**Output**
```
    name  age
0  Alice   25
2  Alice   25
5  Alice   25
```

The method’s keep option includes False, ‘first’, ‘last’, defaulted to ‘first’. Since duplicate() generates a boolean mask, its more intuitive to think of keep as drop instead, since it 'drops’ first row instead of ‘keeping’ it.

`subset` Parameter

Use duplicated() with subset=[‘x_column’] to show occurrences with same values in the ‘x_column’.

# Find all duplicate rows in 'age' column
duplicates = df[ df.duplicated(keep=False, subset=[’age’]) ] 

print(duplicates)
**Output**
```
    name  age
0  Alice   25
2  Alice   25
5  Alice   25
```

Removing Duplicates

To remove, drop_duplicates() removes all duplicate rows, i.e. same values along all columns, by default.

# Remove duplicates
df_clean = df.drop_duplicates()

print(df_clean)

**Output**
```
      name  age
0    Alice   25
1      Bob   30
3  Charlie   35
4      Bob    0

`subset` Parameter

Use subset to remove duplicates based on specific columns while keeping the first occurrence.

# Remove duplicates based on ‘name’ column
df_clean = df.drop_duplicates(subset=[’name’])

print(df_clean)

**Output**
```
      name  age
0    Alice   25
1      Bob   30
3  Charlie   35
```

`keep` Parameter

The keep parameter applies as well. Use keep=False if you wish to remove all duplicates from your dataframe.

# Remove duplicates based on ‘name’ column
df_clean = df.drop_duplicates(subset=[’name’], keep=False)

print(df_clean)

**Output**
```
      name  age
3  Charlie   35
```

End

Handling duplicated data in Python pandas dataframes

Use duplicate() and drop_duplicates() methods effectively

Finding Duplicated Rows

`keep` Parameter

`subset` Parameter

Removing Duplicates

`subset` Parameter

`keep` Parameter

Discussion about this post