Welcome to another Kaizen week!
Duplicate records are a common problem in growing CRM databases. The same person fills out a web form, attends a trade show, and gets manually added by a rep. Three records, same email, slightly different data each time. Some users in our forum have described spending days merging records manually after a large import. Others have asked whether the merge can be triggered automatically.
In this post, we will build a button-triggered deduplication solution using the
COQL API, the
Merge Records API, and
Deluge Functions. While we use
Leads module in the sample implementation, the same approach works for Contacts, Accounts, Vendors, and custom modules with tweaks.
The solution will:
- Scan the Leads module for records sharing the same email address.
- Score each duplicate group and elect the most complete record as master.
- Carry over the best available field values from duplicates into the master via the merge payload.
- Merge all duplicates into the master in batches.
- Return a summary of groups processed, succeeded, failed, and skipped.
Doesn't Zoho CRM already handle duplicates?
Yes, and it works well for most cases. There are two built-in tools.
- Find and Merge to identify and merge duplicate records from the CRM UI.
- Deduplicate Records performs a bulk duplicate scan across a module and allows users to review and merge duplicate groups.
Both tools cover Leads, Contacts, Accounts, Vendors, Deals, and Custom modules.
The limitation both tools share is that they are UI-driven and admin-initiated. Neither can trigger automatically when duplicates arrive through imports, integrations, or API calls. And neither supports custom logic for deciding which record to keep, or which field values to carry forward when the master has a blank field that a duplicate has filled in.
This is not intended to replace Find and Merge or Deduplicate Records. If those tools meet your requirements, they remain the simplest option. This implementation is intended for scenarios that require automation, custom master-selection logic, or field arbitration. It provides a reusable framework for duplicate discovery, master selection, field arbitration, and merge execution.
The Merge Records API merges up to three records (one master and up to two children) into a single record. The child records are deleted after the merge operation. All related data including Notes, Activities, Emails, and Attachments transfers to the master.
Endpoint: POST {api-domain}/crm/v8/{module_api_name}/{master-record-id}/actions/merge
Supported modules: Leads, Contacts, Accounts, Deals (Potentials), Vendors, and Custom modules.
Scope required:
ZohoCRM.modules.{module_name}.WRITE
or
ZohoCRM.modules.{module_name}.ALL
Things to know before building:
- By default, the master record's field values are retained. To use a value from a child record for a specific field, name it in the child's _fields array in the request body.
- The data array accepts a maximum of two child records per call. For groups with more duplicates, you merge in batches of two.
- If a child has more than 1,000 records in any related list, the merge is scheduled as an async job. The API returns a SCHEDULED response with a job_id. Use the Get Merge Status API to track completion.
- Records that are locked, in an active Approval process, or Closed Deals return NOT_ALLOWED and cannot be merged.
The use case
Zylker Technologies generates leads from a website form, LinkedIn campaigns, trade show badge scans, and manual rep entries. The Email field was not set as unique, so the same prospect accumulated multiple records across these channels, each with slightly different data. The web form captures email but not phone. The trade show scan has a direct-dial number but an abbreviated company name. The LinkedIn entry has a title but no phone.
The team eventually made the Email field unique to prevent new duplicates from being created. However, a significant backlog of duplicate records already existed in the system. For a backlog of hundreds of groups arriving from multiple automated sources, Deduplicate Records was not practical. They needed a repeatable cleanup process that could apply the same master-selection and field-arbitration rules across every duplicate group.
Prerequisites
- A Zoho CRM account with access to Custom Functions and Connections.
- A Zoho OAuth connection named crm_oauth_connection with the scopes:
- ZohoCRM.modules.leads.ALL
- ZohoCRM.coql.READ
Setting up the connection:
- Go to Setup > Developer Hub > Connections
- Click Create Connection and choose Zoho CRM
- Name it crm_oauth_connection (must match the name used in the code)
- Add the scopes above and click Create and Connect.
Note: If this connection expires, the functions will return an UnAuthenticated Connection error. Reauthorize it from the Connections page to resolve this.
Solution overview
Three functions work together.
- button.cleanupAllDuplicates(): the entry point, triggered by the list view button. Calls the other two functions and returns a summary.
- standalone.findDuplicateEmails(): scans the Leads module and returns all email addresses that appear on more than one record.
- standalone.cleanupDuplicateGroup(emailValue): processes a single duplicate group. Fetches the records, elects a master, arbitrates field values, and calls the Merge Records API.
Step 1: Create the list view button
Go to Setup > Modules > Leads > Buttons and create a custom button:
- Label: Clean Up Duplicates
- Where to show: List View
- Action: Custom Function > cleanupAllDuplicates
Step 2: button.cleanupAllDuplicates()
This is the entry point, the function wired to the button. When clicked, it calls findDuplicateEmails() to get the list of emails that appear on more than one Lead record. It then loops through each email, calling cleanupDuplicateGroup() to process that group. At the end, it returns a summary map with the count of groups found, processed, succeeded, failed, and skipped.
A typical result looks like this:
{
"Status": "COMPLETED",
"Duplicate_Groups_Found": 37,
"Duplicate_Groups_Processed": 37,
"Success_Groups": 37,
"Failed_Groups": 0,
"Skipped_Groups": 0
}
Step 3: standalone.findDuplicateEmails()
This function uses
zoho.crm.getRecords method to fetch all Lead records, 200 per page, and builds an in-memory map of email <> count. It then returns the emails with a count greater than one.
The response from getRecords is a list of record maps. For each record, the function reads the
Email field. If the email is already in the map, the count is incremented. If not, it is added with a count of 1. After all pages are processed, only emails with a count greater than 1 are collected into a list and returned as a pipe-separated string, for example:
john@acme.com|jane@corp.com|bob@example.comThe loop stops as soon as a page returns no records, so it only makes as many API calls as there are pages of data. For an org with 160 leads, this is two calls. The loop iterates through a maximum of 50 pages (50 × 200 = 10,000 records). If your org has more than 10,000 leads, records beyond page 50 are never scanned and those duplicates will be missed. See
Adapting for your org size for how to handle this.
Converted leads are excluded from the count. A converted lead cannot be merged via the Leads endpoint and would cause a NOT_ALLOWED error in the merge step.
Step 4: standalone.cleanupDuplicateGroup(emailValue)
This function processes one email group from start to finish.
Fetching the records
A COQL query fetches all leads with the given email, ordered by Created_Time ASC:
select id, Created_Time, Company, Phone, Website, Email from Leads where Email = '<emailValue>' order by Created_Time asc limit 200
Ordering by created time matters for the tiebreak in master selection.
Electing the master
Each record is scored: +10 for Company, +10 for Phone, +10 for Website. The highest scorer becomes the master. On a tie, the oldest record wins. It appears first because of ORDER BY Created_Time ASC, and the loop only replaces the master when it finds a strictly higher score.
This scoring logic is specific to this implementation. You can change it based on what "most complete" means for your org. For example, weighting
Annual_Revenue or
Lead_Score, or simply always preferring the oldest or most recently modified record. See
Customizing the master selection and arbitration logic for examples.
Field arbitration and building the merge payload
The Merge Records API retains the master's field values by default. To pull a specific field value from a child record instead, include that field in the child's _fields array in the request body.
The function first scans all records and tracks which child ID holds the best Phone and Website values when the master is missing them:
phoneSourceId = "";
websiteSourceId = "";
if(ifnull(masterRecord.get("Phone"), "") == "")
{
for each record in records
{
if(record.get("id") != masterId &&
ifnull(record.get("Phone"), "") != "")
{
phoneSourceId = record.get("id");
break;
}
}
}
// same for Website
When building each batch's payload, the function checks whether any child in that batch is one of those source records and adds _fields accordingly:
for each childId in batch
{
childMap = Map();
childMap.put("id", childId);
fieldsToTake = List();
if(childId == phoneSourceId)
{
fieldMap = Map();
fieldMap.put("api_name", "Phone");
fieldsToTake.add(fieldMap);
}
if(childId == websiteSourceId)
{
fieldMap = Map();
fieldMap.put("api_name", "Website");
fieldsToTake.add(fieldMap);
}
if(fieldsToTake.size() > 0) { childMap.put("_fields", fieldsToTake); }
dataList.add(childMap);
}
This produces different payloads depending on the group. When no arbitration is needed (the master already has Phone and Website), the payload is:
{
"merge": [{
"data": [
{ "id": "child_id_1" },
{ "id": "child_id_2" }
]
}]
}
When a child holds a value the master is missing:
{
"merge": [{
"data": [
{
"id": "child_id_1",
"_fields": [
{ "api_name": "Phone" }
]
},
{ "id": "child_id_2" }
]
}]
}
When Phone and Website come from two different children in the same batch:
{
"merge": [{
"data": [
{ "id": "child_id_1", "_fields": [{ "api_name": "Phone" }] },
{ "id": "child_id_2", "_fields": [{ "api_name": "Website" }] }
]
}]
}
The Merge API applies those field values from the child to the master. Once the batch containing the source child is processed, the master has the arbitrated values. Subsequent batches need no _fields at all.
If phoneSourceId and websiteSourceId are both empty because the master already has both values, none of the _fields checks fire and every batch uses the plain payload.
Batching
The Merge API accepts a maximum of two child records per call. Groups with more than two children are batched accordingly, each batch being a separate API call to the same master ID. If a batch fails, groupStatus is set to FAILED and the reason is recorded. The loop continues to process remaining batches. Groups with more than 100 records sharing the same email are skipped for manual review. In our experience, groups of this size are often associated with import or integration issues and may warrant additional validation before merging.
API calls made during a run
Before running this on a large dataset, it helps to have a rough sense of how many API calls will be made. Using the following variables:
R = total number of Lead records in the module
G = number of duplicate email groups found
D = average number of duplicate records per group
The approximate call count is:
Scan calls = ceil(R / 200) + 1
COQL calls = G
Merge calls = G × ceil((D − 1) / 2)
Total API calls ≈ ceil(R / 200) + 1 + G + G × ceil((D − 1) / 2) assuming each duplicate group contains approximately D records.
The scan phase reads 200 records per page and stops on the first empty page. Each duplicate group makes one COQL call to fetch its records, then one merge call per two children. The merge calls dominate for large groups. A group of 17 records, for example, needs 8 merge calls.
Validating with Zylker's data:
We tested the solution across three runs as Zylker's Leads module grew with each import.
|
Run
|
R (leads)
|
G (groups)
|
Avg. D
|
Scan calls
|
COQL calls
|
Merge calls
|
Total calls
|
Time
|
|
1
|
160
|
10
|
17
|
2
|
10
|
80
|
92
|
48s
|
|
2
|
201
|
37
|
~2.4
|
3
|
37
|
26
|
66
|
68s
|
|
3
|
521
|
160
|
~3.2
|
4
|
160
|
242
|
406
|
156s
|
All three runs were completed with zero failures. Execution time depends on API response latency and group sizes and will vary across runs.
Is this production ready?
This implementation works well for orgs with up to a few thousand leads and a manageable number of duplicate groups. For larger datasets, the following limitations apply.
Large datasets
The
findDuplicateEmails() function caps at 50 pages (10,000 records). If your org exceeds this, records beyond page 50 are silently missed and those duplicate groups will never be processed. The button also runs synchronously, so a large number of duplicate groups will eventually hit the function timeout, terminating the run mid-way with no resume mechanism. See A
dapting for your org size for how to address both of these.
Async merges
The Merge API returns SCHEDULED (with status: "success") when a child has more than 1,000 related records. The current code counts this as a success. The master is locked while the background job runs, so clicking the button again before it completes will attempt to re-merge the same children and return NOT_ALLOWED. For orgs where Leads have large activity histories, check explicitly for code == "SCHEDULED", stop processing that group, and poll the
Get Merge Status API before retrying.
Locked or in-approval records
These return NOT_ALLOWED and appear in the Failed_Groups count. Unlock them manually and re-run.
Partial success within a group
If a group fails mid-way (for example, a batch succeeds but the next batch returns NOT_ALLOWED), the children already merged are not rolled back. They are permanently merged into the master. The remaining unmerged children still exist in CRM.
Re-running the button after resolving the underlying issue is safe for most failure types. The COQL query fetches only records that currently exist. Already-merged children are gone, so the function finds only the remaining unmerged children and processes them from there.
The exception is an async merge (SCHEDULED response). In that case the children are not yet deleted and the master is locked. Re-running before the async job completes will return NOT_ALLOWED for that group. Wait for the async merge to complete before re-running.
Run timing
Do not run during an active import. The COQL scan reads the module at that moment. If records are being created simultaneously, a group may be partially scanned and incompletely merged.
Adapting for your org size
Over 10,000 leads
Increase the page limit in the findDuplicateEmails() loop. The current ceiling of 50 pages covers 10,000 records. Extend the loop range to cover your actual record count. For example, 60 pages for 12,000 leads. Alternatively, move to the scheduled architecture described below, which processes records in batches and has no ceiling.
Too many groups (function times out)
Button-triggered functions run synchronously. As the number of duplicate groups grows, execution time may become a constraint. In such cases, consider moving to a scheduler-based architecture:
- Create a custom module called Dedup Queue with fields for the duplicate email address, processing status (Pending / Completed / Failed / Skipped), master record ID, duplicate count, merged count, and failure reason if any.
- Initiate the process with a function that scans for duplicate groups and writes one record per group to the Dedup Queue module with Status: Pending.
- A function scheduled every 15 minutes picks up a small batch of pending records, processes them, and updates each status to Completed, Failed, or Skipped.
- Once all records in the queue are resolved, send a summary email to the admin with the details.
Customizing the master selection and arbitration logic
The current scoring: +10 for Company, +10 for Phone, +10 for Website. Ties go to the oldest record.
To always prefer the oldest record: remove the scoring loop and use records.get(0) directly. The COQL query already orders by Created_Time ASC.
To prefer the most recently modified record: change order by Created_Time asc to order by Modified_Time desc in the COQL query and use records.get(0).
To weight a field higher, for example to favor records from a specific lead source:
if(ifnull(record.get("Lead_Source"), "") == "Cold Call") { score = score + 15; }
To arbitrate additional fields: the current code covers Phone and Website. To also carry over Annual_Revenue, add a source ID variable in the arbitration phase:
revenueSourceId = "";
if(ifnull(masterRecord.get("Annual_Revenue"), "") == "")
{
for each record in records
{
if(record.get("id") != masterId &&
ifnull(record.get("Annual_Revenue"), "") != "")
{
revenueSourceId = record.get("id");
break;
}
}
}
Then add the field check inside the payload building loop:
if(childId == revenueSourceId)
{
fieldMap = Map();
fieldMap.put("api_name", "Annual_Revenue");
fieldsToTake.add(fieldMap);
}
The Merge API will then carry Annual_Revenue from that child onto the master as part of the merge call itself.
Extending to other modules
Change the module name in the COQL query and the merge endpoint URL. For Accounts, use No_of_Employees, Industry, and Annual_Revenue as scoring fields. For custom modules, the match field does not have to be Email. Use whichever unique identifier your module has, such as a License Number or Order ID.
Preventing duplicates going forward
- Make Email a unique field: go to Setup > Modules > Leads > Fields, open the Email field, and enable Unique. Any record with a duplicate email, whether from the UI, an import, or an API call, will be rejected.
- Make Phone a unique field: the same option is available on Phone, useful if leads arrive without email addresses.
- Configure via API: the Duplicate Check Preferences API lets you enable and configure uniqueness checks programmatically, useful for org setup scripts.
- Use Upsert instead of Insert: for integrations creating leads via API, switch to the Upsert Records API with duplicate_check_fields set to Email. An existing record with that email gets updated instead of a new one being created.
Conclusion
In this Kaizen, we built a reusable deduplication framework using COQL, Functions, and the Merge Records API. Along the way, we looked at duplicate discovery, custom master-selection logic, field arbitration, merge execution, performance considerations, and approaches for adapting the solution to different data volumes.
The implementation presented here is intended for historical cleanup scenarios where duplicate groups can be processed within a single function execution. The actual volume that can be processed in one run depends on factors such as the number of duplicate groups, the size of those groups, API response times, and the number of merge operations required. For larger datasets, a queued and scheduled architecture is generally a better fit, allowing duplicate groups to be processed in smaller batches while providing better visibility into progress and error handling.
Have you implemented a different strategy for duplicate management in your CRM? Are there additional master-selection or field-arbitration rules that you'd like to see covered?Would you like to see a follow-up Kaizen covering a production-scale implementation using scheduled processing, queue management, progress tracking, and email notifications?
As always, we'd love to hear your feedback. Let us know in the comments below.