Deduplicate

You can remove duplicate records from your data using the Deduplicate transform. This can be done in two ways: row-wise and column-wise.

Row-wise

This method removes rows with duplicate data, allowing only unique rows to be present in your dataset.

To apply row-wise deduplication:

1. Click the Transform menu, click Deduplicate, then select Row-wise.

2. When removing duplicate rows, you can enable these options to refine how duplicates are identified:

Ignore case - This option ignores the cases and treats uppercase and lowercase characters as the same and consider them as duplicates.

Ignore whitespace - This options ignores leading, trailing, and multiple spaces between words and consider them duplicates

Flag duplicate records - This option will flag duplicate records in a new column. You can filter out duplicates using the newly added column later. For records that have no duplicates, the 'Duplicate Flag' column will remain empty. This option helps you preserve both the master and duplicate records, while still identifying the duplicate records. Instead of immediately removing duplicates, you can review, validate, or apply transforms before deciding what to keep or discard.

Here's a snapshot below,

Note: There could be instances where no duplicates were found in the sample dataset. You can still apply the rule to remove duplicates rows when the entire dataset is processed during export.

3. A live preview will be shown with the duplicate rows highlighted in red.

4. Click Remove duplicates.

Column-wise

You can also select single or multiple columns and choose to deduplicate. You can use Deduplicate > Column-wise transform to remove rows based on duplicate values present in the selected columns.

In other words, select all those columns (For example, Region, Address, Product) which has the same entries vertically. The column-wise deduplication will remove those rows which has the same entry vertically in the columns selected.

To apply column-wise deduplication:

1. Click the Transform menu, click Deduplicate then select Column-wise.

2. When removing duplicate rows, you can enable these options to refine how duplicates are identified:

Ignore case - This option ignores the cases and treats uppercase and lowercase characters as the same and consider them as duplicates.

Ignore whitespace - This options ignores leading, trailing, and multiple spaces between words and consider them duplicates

Flag duplicate records - This option will flag duplicate records in a new column. You can filter out duplicates using the newly added column later. For records that have no duplicates, the 'Duplicate Flag' column will remain empty. This option helps you preserves both the master and duplicate records, while still identifying the duplicate records. Instead of immediately removing duplicates, you can review, validate, or apply transforms before deciding what to keep or discard.

Here's a snapshot below,

3. You can choose one of the two methods to dedupe your dataset based on the selected column: Automatic deduplication or Manual conditions.

4. When you choose the Automatic deduplication method, DataPrep works for you to deduplicate your data based on the columns you've selected.

5. When you choose the Manual conditions method, you will need to enter the conditions and expressions and construct the 'if' statements. You can then select which rows to keep, or remove, within each of the duplicate cluster if the condition is true.

6. The following table lists the available If conditions for all the data types. Click here to know more about data types.

Text	Numeric	Datetime	Duration	Boolean	List	Map
contains	=equal to	= equal to	is smallest	is true	has value	has key
doesn't contain	!= not equal to	!= not equal to	is largest	is false	is empty list	is empty map
begins with	> more than	is earliest	= equal to	contains	is not empty list	is not empty map
ends with	< less than	is latest	!= not equal to	doesn't contain	is empty	is empty
is	>= more than or equal	is after	is empty	begins with	is not empty	is not empty
is not	<= less than or equal to	is before	is not empty	ends with	use regex	use regex
is empty	is smallest	on or after	use regex	is	use patterns	use patterns
is not empty	is largest	on or before	use patterns	is not
use regex	is empty	is empty		is empty
use patterns	is not empty	is not empty		is not empty
	use regex	use regex		use regex
	use patterns	use patterns		use patterns

7. You can also keep adding more conditions using the AND and OR operators to apply deduplication using a combination of conditions.

For example, you can write a condition that goes like this, "If the mail column contains zoho.com, keep those rows", i.e., Enter conditions to select which rows to keep if mail contains zoho.com

8. With the Advanced option, you can insert functions and provide conditions to remove duplicates.

9. Click the Preview button to see which rows will be removed during the transformation.

10. You can also select multiple columns for deduplication using (+) in Columns to de-duplicate.

FAQs

1. Why do the number of duplicates reduce each time I apply the deduplicate by rows transform with the Flag duplicates option?

Each time you apply the deduplicate transform with the flag option, a new column is added to mark duplicate records. When you apply it again, this flag column is also checked for duplicates. Because all flagged records now have the same flag value, they are grouped together. So the number of duplicates keeps reducing each time you apply it.

Example:

First time → 5 records → 4 duplicates, 1 master

Second time → 3 duplicates, 1 master

Third time → 2 duplicates, 1 master

To avoid this, make sure you filter out and delete the duplicate records before applying the transform again.

How to dedupe or remove duplicate records in your data?

Deduplicate

Row-wise

Column-wise

FAQs

1. Why do the number of duplicates reduce each time I apply the deduplicate by rows transform with the Flag duplicates option?