What is Anomaly Detection?
Anomaly detection, also known as outlier detection, is the process of identifying records or data points that significantly deviate from the expected pattern or behavior in a dataset. Detecting anomalies helps users identify potential data quality issues, unusual business trends, operational risks, or unexpected spikes and drops in performance.
Zoho DataPrep provides anomaly detection capabilities using statistical techniques such as IQR (Interquartile Range), Z-score, and Percentile-based methods to identify unusual values in the data. These methods can be applied based on the nature and distribution of the dataset, helping users clean, monitor, and analyze data more effectively before downstream processing and analytics.
1. Right-click a column and select the Anomaly Detection option from the context menu.
Note: The Anomaly Detection transform supports all data types except Text, List, Map, Boolean, URL, and Email.
2. Select the Model that you want to use to detect outliers. The available models are grouped into two categories — Statistical models (IQR, Z-score, and Percentile) and Custom range.
Statistical models
IQR - The interquartile range (IQR) is a measure of statistical dispersion that describes the range within which the middle 50% of a data set falls. The Outliers are calculated based on this range.
Learn more
ZScore - The Z-score determines how far a data point is from the mean of the dataset in terms of standard deviations. By default, a data point will be marked as an outlier if the Z-score is more than 3 standard deviations.
Learn more
Percentile - A percentile is a statistical measure that represents the relative position of a particular value within a dataset. Percentiles divide the data into 100 equal parts, and the value at each percentile indicates the percentage of values that are equal to or below a given value. Outliers will be determined based on the percentile value.
Learn more
Custom range
In this method, the upper bound and the lower bound are any particular values from the data. The outliers are calculated based on the thresholds.
Learn more
Info: Range fields support a maximum numeric precision of 38 and scale of 15. Values exceeding this limit will automatically be set to null
3. Provide the required details and click Apply to detect the outliers.
If you choose the IQR model
What is IQR?
The interquartile range (IQR) is a measure of statistical dispersion that describes the range within which the middle 50% of a data set falls. The Outliers are calculated based on this range.
The IQR is calculated as the difference between the third quartile (Q3) and the first quartile (Q1). The default scale value of 1.5 is multiplied by the IQR to determine the upper and lower boundaries (thresholds). The threshold values can be adjusted based on the distribution of the data.
For data with a skewed distribution, the adjusted IQR method will be applied.
1. Under Threshold, enter the values for the Lower limit and Upper limit. The threshold defines the boundary beyond which a value is considered an outlier. The scale value is multiplied by the IQR to determine these boundaries.
Note: The allowed range is 0-2
2. Select the Apply adjusted IQR method when data is skewed check box to automatically apply the adjusted IQR method for skewed data. This checkbox is selected by default. Uncheck it if you want to apply the standard IQR method regardless of skewness.
3. Select the Flag anomaly check box if you want to flag records with anomalies in a new column. For records with anomalies, the Anomaly Flag column will be marked. For records with no anomalies, the column will remain empty. You can use this column to review or filter anomalous records before deciding what action to take.
4. Click Apply.
If you choose the Z-score model
What is Z-score model?
The Z-score determines how far a data point is from the mean of the dataset in terms of standard deviations. By default, a data point will be marked as an outlier if the Z-score is more than 3 standard deviations.
A Z-score of ±3 is often set as a threshold for detecting the outliers. This means that data points falling beyond 3 standard deviations from the mean in either direction (to the left or right) are considered outliers. The thresholds can be adjusted based on the distribution of the data.
For data with a skewed distribution, the modified Z-score method will be applied. The Z-score is calculated based on the Median Absolute Deviation method.
1. Under Number of standard deviations, enter the lower and upper limits. The default values are -3 and +3. Data points that fall beyond these limits will be marked as outliers.
2. Select the Apply modified Z-score method when data is skewed check box to automatically apply the modified Z-score method for skewed data. The modified Z-score is calculated from the median absolute deviation (MAD) multiplied by a constant value of 0.675. Uncheck this option if you want to ignore skewness.
3. Select the Flag anomaly check box if you want to flag records with anomalies in a new column. For records with anomalies, the Anomaly Flag column will be marked. For records with no anomalies, the column will remain empty. You can use this column to review or filter anomalous records before deciding what action to take.
4. Click Apply.
If you choose the Percentile model
What is Percentile model?
A percentile is a statistical measure that represents the relative position of a particular value within a dataset. Percentiles divide the data into 100 equal parts, and the value at each percentile indicates the percentage of values that are equal to or below a given value. Outliers will be determined based on the percentile value.
The 10th and 90th percentiles are set as the default values for the lower and upper limits respectively.
1. Enter the lower and upper percentile limits. Values below the lower limit and above the upper limit will be treated as outliers. For example, a lower limit of 10 means values below the 10th percentile are flagged, and an upper limit of 90 means values above the 90th percentile are flagged.
Note: The allowed values range is 1 to 99. The percentile cannot be decimal value.
2. Select the Flag anomaly check box if you want to flag records with anomalies in a new column. For records with anomalies, the Anomaly Flag column will be marked. For records with no anomalies, the column will remain empty. You can use this column to review or filter anomalous records before deciding what action to take.
If you choose the Custom range option
In this method, the upper bound and the lower bound are any particular values from the data. The outliers are calculated based on the thresholds.
Info: Range fields support a maximum numeric precision of 38 and scale of 15. Values exceeding this limit will automatically be set to null
1. Enter the lower and upper limit values for the custom range.
Note: Please enter absolute values for the lower and upper limits, and ensure the value provided for the lower limit is less than the one provided for the upper limit.
Info: For date columns, the lower and upper limit values displayed in the rule will appear as Unix timestamp values instead of the date format you entered.
2. Provide the required input and click Apply.
3. Select the Flag anomaly check box if you want to flag records with anomalies in a new column. For records with anomalies, the Anomaly Flag column will be marked. For records with no anomalies, the column will remain empty. You can use this column to review or filter anomalous records before deciding what action to take.