Cluster and Merge

Cluster and merge





Zoho DataPrep helps you to perform fuzzy matching using the Cluster & Merge transform. Using this transform, you can replace multiple versions of the data which are expressed differently with the version required by the user. 

For example, if the following country names all appear in your data: U.S., U.S.A., USA.
They all refer to the same country. You can choose to replace all these variations with one term, USA

The transform can also be used to remove spelling errors in data and ensure the columns show uniformity in data. This is particularly helpful for cleaning and preparing data harvested from multiple data sources.

To perform Cluster & Merge

1. Right-click the column and select the Cluster & Merge option from the context menu.

2. Choose one of the following language model algorithms to find clusters in your data:  
  1. Metaphone 
  2. Fingerprint
  3. n-gram
  1. The metaphone algorithm groups words by pronunciation and is the default algorithm used to find the clusters. 
  2. The fingerprint and n-gram algorithms are used to check spelling errors in your column data and resolve text mismatches.
  3. The 'N' value represents the continuous sequence of N words to be found in the clusters. 
    For example, n-gram size of 'Zoho' is 1-gram, 'Zoho Corporation' is 2-gram, etc.
2. The Transform panel shows all of the clusters recognized as cards. 

3. Select the items to be replaced using the checkboxes.

4. Use the text box in each card to enter a new value. This value will replace the selected items in the columns.


Note: i) You can also use the Copy and fill option to fill the text box with the required value.


ii) You can also add a new value to the cluster manually using Add new data option.



InfoYou can deselect a cluster card using the checkbox at the top (bookmarked in blue).

To apply filters

If you want to apply some filters along with this transform, you can use the filters functionality.

1. Click the Filters tab.

2. Click the   icon and add the required columns in the Filters section. You can also reorder the filters using the drag-and-drop method.



3. For every column added, you can select one of the following options from the drop-down:
  1. Actual: This option lets you filter rows based on the actual values in the column. Click here to know more.
  2. Data quality: This option lets you filter rows based on the quality of data in the column. Click here to know more.
  3. Patterns: This option helps you filter rows based on the data patterns in the selected column. Click here to know more.
  4. Outliers: This option allows you to filter rows based on the outliers present in the data of the selected column. Click here to know more. 
Note: The filter options are displayed based on the datatype of the column added for the filter.

4. When you add more than one filter to the Filters section, the logical operators, AND or OR, appear next to the filters. You can click to toggle the logical operator between AND and OR.
  1. Using the logical operators, you can combine the conditions and apply logic to determine the rule of precedence. The final expression is displayed in the  Criteria expression box. You can click Edit to alter the default expression using logical operators and parentheses to specify the precedence or the sequential order as to which condition should be evaluated first. Click Save after making the required changes. 
  1. For example, in the expression ((1 OR 2) AND (3 OR 4)), at first the condition ( 1 OR 2 ) will be executed, and the condition ( 3 OR 4 ) will be executed next. Thirdly, since the AND operator is used, the filter will be applied when both conditions are true.
5. You can further drill down to choose specific values based on the filter option selected for each filter in the next section.



For example, in the above screenshot, the Data quality option is selected for the All columns filter in the Filters section. Based on the selection, further options to filter specific values are displayed in the All columns (Data quality) section.

6. You can choose to include or exclude the selected items in the last section.

7. If you want to remove all the filters for some reason, you can use the Clear button.

8. A live preview of the filter transform is shown as you make changes. 

9. Click the Apply button to apply the transform along with the filters.



Limitation

Zoho DataPrep can identify a maximum of 300 clusters from your data.

SEE ALSO