How to dedupe or remove duplicate records in your data?

Deduplicate




You can remove duplicate records from your data using the Deduplicate transform. This can be done in two ways: row-wise and column-wise.

Row-wise

This method removes rows with duplicate data, allowing only unique rows to be present in your dataset.

To apply row-wise deduplication:

1. Click the Transform menu, click Deduplicate, then select Row-wise

2. When removing duplicate rows, you can enable these options to refine how duplicates are identified:

Ignore case - This option ignores the cases and treats uppercase and lowercase characters as the same and consider them as duplicates. 

Ignore whitespace - This options ignores leading, trailing, and multiple spaces between words and consider them duplicates

Flag duplicate records - This option will flag duplicate records in a new column. You can filter out duplicates using the newly added column later. For records that have no duplicates, the 'Duplicate Flag' column will remain empty. This option helps you preserve both the master and duplicate records, while still identifying the duplicate records. Instead of immediately removing duplicates, you can review, validate, or apply transforms before deciding what to keep or discard.

Here's a snapshot below,



Notes
Note: There could be instances where no duplicates were found in the sample dataset. You can still apply the rule to remove duplicates rows when the entire dataset is processed during export.

3. A live preview will be shown with the duplicate rows highlighted in red.



4. Click Remove duplicates

Column-wise

You can also select single or multiple columns and choose to deduplicate. You can use Deduplicate > Column-wise transform to remove rows based on duplicate values present in the selected columns. 

In other words, select all those columns (For example, Region, Address, Product) which has the same entries vertically. The column-wise deduplication will remove those rows which has the same entry vertically in the columns selected.

To apply column-wise deduplication:

1. Click the Transform menu, click Deduplicate then select Column-wise

2. When removing duplicate rows, you can enable these options to refine how duplicates are identified:

Ignore case - This option ignores the cases and treats uppercase and lowercase characters as the same and consider them as duplicates. 

Ignore whitespace - This options ignores leading, trailing, and multiple spaces between words and consider them duplicates

Flag duplicate records - This option will flag duplicate records in a new column. You can filter out duplicates using the newly added column later. For records that have no duplicates, the 'Duplicate Flag' column will remain empty. This option helps you preserves both the master and duplicate records, while still identifying the duplicate records. Instead of immediately removing duplicates, you can review, validate, or apply transforms before deciding what to keep or discard.

 Here's a snapshot below,



3. You can choose one of the two methods to dedupe your dataset based on the selected column: Automatic deduplication or Manual conditions

4. When you choose the Automatic deduplication method, DataPrep works for you to deduplicate your data based on the columns you've selected. 



5. When you choose the Manual conditions method, you will need to enter the conditions and expressions and construct the 'if' statements. You can then select which rows to keep, or remove, within each of the duplicate cluster if the condition is true.
 

6. The following table lists the available If conditions for all the data types. Click here to know more about data types.

Text

Numeric

Datetime

Duration

Boolean

List

Map

 contains

 =equal to

 = equal to

 is smallest

 is true

 has value

 has key

 doesn't contain

 != not equal to

 != not equal to

 is largest

 is false

 is empty list

 is empty map

 begins with

 > more than

 is earliest

 = equal to

 contains

 is not empty list

 is not empty   map

 ends with

 < less than

 is latest

 != not equal to

 doesn't contain

 is empty

 is empty

 is

 >= more than or   equal

 is after

 is empty

 begins with

 is not empty

 is not empty

 is not

 <= less than or   equal to

 is before

 is not empty

 ends with

 use regex

 use regex

 is empty

 is smallest

 on or after

 use regex

 is

 use patterns

 use patterns

 is not empty

 is largest

 on or before

 use patterns

 is not

 

 

 use regex

 is empty

 is empty

 

 is empty

 

 

 use patterns

 is not empty

 is not empty

 

 is not empty

 

 

 

 use regex

 use regex

 

 use regex

 

 

 

 use patterns

 use patterns

 

 use patterns

 

 


7. You can also keep adding more conditions using the AND and OR operators to apply deduplication using a combination of conditions.

Idea
For example, you can write a condition that goes like this, "If the mail column contains zoho.com, keep those rows", i.e., Enter conditions to select which rows to keep if mail contains zoho.com

8. With the Advanced option, you can insert functions and provide conditions to remove duplicates. 



9. Click the Preview button to see which rows will be removed during the transformation.

10. You can also select multiple columns for deduplication using (+) in Columns to de-duplicate. 

FAQs

1. Why do the number of duplicates reduce each time I apply the deduplicate by rows transform with the Flag duplicates option?

Each time you apply the deduplicate transform with the flag option, a new column is added to mark duplicate records. When you apply it again, this flag column is also checked for duplicates. Because all flagged records now have the same flag value, they are grouped together. So the number of duplicates keeps reducing each time you apply it.

Example:
First time → 5 records → 4 duplicates, 1 master
Second time → 3 duplicates, 1 master
Third time → 2 duplicates, 1 master

To avoid this, make sure you filter out and delete the duplicate records before applying the transform again.

SEE ALSO