Deduplicate

Deduplicate





You can remove duplicate records from your data using the Deduplicate transform. This can be done in two ways: row-wise and column-wise.

Row-wise

This method removes rows with duplicate data, allowing only unique rows to be present in your dataset.

To apply row-wise deduplication:

1. Click the Transform menu, click Deduplicate, then select Row-wise

2. You can choose to ignore case and whitespace while removing duplicate rows.

Notes
Note: There could be instances where no duplicates were found in the sample dataset. You can still apply the rule to remove duplicates rows when the entire dataset is processed during export.

3. A live preview will be shown with the duplicate rows highlighted in red.



4. Click Remove duplicates

Column-wise

You can also select single or multiple columns and choose to dedupe. You can use Deduplicate > Column-wise transform to remove rows based on duplicate values present in the selected columns. 

In other words, select all those columns (For example, Region, Address, Product) which has the same entries vertically. The column-wise deduplication will remove those rows which has the same entry vertically in the columns selected.

To apply column-wise deduplication:

1. Click the Transform menu, click Deduplicate then select Column-wise

2. You can choose to ignore case and whitespace to find duplicates. 

3. You can choose one of the two methods to dedupe your dataset based on the selected column: Automatic deduplication or Manual conditions

4. When you choose the Automatic deduplication method, DataPrep works for you to dedupe your data based on the columns you've selected. 



5. When you choose the Manual conditions method, you will need to enter the conditions and expressions and construct the 'if' statements. You can then select which rows to keep, or remove, within each of the duplicate cluster if the condition is true.



6. The following table lists the available If conditions for all the data types. Click here to know more about data types.

Text

Numeric

Datetime

Duration

Boolean

List

Map

 contains

 =equal to

 = equal to

 is smallest

 is true

 has value

 has key

 doesn't contain

 != not equal to

 != not equal to

 is largest

 is false

 is empty list

 is empty map

 begins with

 > more than

 is earliest

 = equal to

 contains

 is not empty list

 is not empty   map

 ends with

 < less than

 is latest

 != not equal to

 doesn't contain

 is cell empty

 is cell empty

 is

 >= more than or   equal

 is after

 is cell empty

 begins with

 is cell not empty

 is cell not empty

 is not

 <= less than or   equal to

 is before

 is cell not empty

 ends with

 use regex

 use regex

 is cell empty

 is smallest

 on or after

 use regex

 is

 use patterns

 use patterns

 is cell not empty

 is largest

 on or before

 use patterns

 is not

 

 

 use regex

 is cell empty

 is cell empty

 

 is cell empty

 

 

 use patterns

 is cell not empty

 is cell not empty

 

 is cell not empty

 

 

 

 use regex

 use regex

 

 use regex

 

 

 

 use patterns

 use patterns

 

 use patterns

 

 


7. You can also keep adding more conditions using the AND and OR operators to apply deduplication using a combination of conditions.

Idea
For example, you can write a condition that goes like this, "If the mail column contains zoho.com, keep those rows", i.e., Enter conditions to select which 
If mail contains zoho.com

8. With the Advanced option, you can insert functions and provide conditions to remove duplicates.



9. Click the Preview button to see which rows will be removed during the transformation.

10. You can also select multiple columns for deduplication using (+) in Columns to de-duplicate.