Dataset operations consists of combining and transforming datasets to suit your needs. You can perform the following operations in a dataset.
- Calculated columns
- Derive dataset
- Mark as personal data
All the dataset operations are present in the topbar transform menu
Duplicate records from your data can be removed using the Deduplicate transformation. This can be done in two ways: row-wise and column-wise.
This method removes rows with duplicate data, allowing only unique rows to be present in your dataset.
To apply row-wise deduplication:
1. Click the Transform menu, click Deduplicate, then select Row-wise.
2. You can choose to ignore case and whitespace while removing duplicate rows.
Note: There could be instances where no duplicates were found in the sample dataset. You can still apply the rule to remove duplicates rows when the entire dataset is processed during export.
3. A live preview will be shown with the duplicate rows highlighted in red.
4. Click Remove duplicates.
You can also select single or multiple columns and choose to Deduplicate. The operation removes rows based on duplicate values present in the selected columns.
To apply column-wise deduplication:
1. Click the Transform menu, click Deduplicate then select Column-wise.
2. You can choose to ignore case and whitespace to find duplicates.
3. You can choose one of the two methods to deduplicate your dataset based on the selected column: Automatic deduplication or Manual conditions.
4. When you choose the Automatic deduplication method, DataPrep works for you to deduplicate your data based on the columns you've selected.
5. When you choose the Manual conditions method, you will need to enter the conditions and expressions and construct the 'if' statements. You can then select which rows to keep, or remove, within each of the duplicate cluster if the condition is true.
6. The following table lists available conditions for the if statement.
| contains|| = equal to|| is empty|| contains|
| doesn't contain|
> more than
is not empty
| doesn't contain|
< less than
| begins with|
>= more than or equal
| ends with|
| <= less than or equal|| is |
| is empty|| is not|
| is not empty|| is empty|
is not empty
| is not empty|
Note: You cannot use conditions on columns with array or map data types.
7. You can also keep adding more conditions using the AND and OR operators to apply deduplication using a combination of conditions.
For example, you can write a condition that goes like this, "If the mail column contains zoho.com, keep those rows", i.e.,
Enter conditions to select which
If mail contains zoho.com
8. With the Advanced option, you can insert functions and provide conditions to remove duplicates.
9. A live preview shows which rows will be removed during the transformation.
10. You can also select multiple columns for deduplication using (+) in Columns to de-duplicate.
You can join two datasets together using common columns.
For example, consider a dataset which has the purchase details of customers and another dataset with their contact information. These datasets can be combined using a common column such as the customer ID to identify and match the records from both the datasets.
DataPrep offers all four types of joins: inner join, left join, right join, and outer join.
DataPrep automatically calculates and shows the Join potential between the current dataset and the other datasets from your project. The join potential is shown in percentage, making it easier for you to choose a dataset for the Join operation.
To perform a join operation:
1. Click the Transform menu, click Combine, then select Join.
2. Choose the dataset with which you are going to join the current dataset with, along with the join type. At this point you can choose to create a new dataset with the result of the join operation or update the current dataset with the result.
3. Select the dataset that you want to join your current dataset with.
4. Choose the type of join using the Join type option. You can also change the type of join later from the Join type menu in the bottom-left of the operations bar.
5. If you choose to create a new dataset, enter a name for the new dataset in the New dataset name box in the operations bar after the join pop up closes.
6. Click the Preview button and DataPrep will show a live preview of the join operation along with Join info that displays useful stats.
7. Click Join to join the two datasets.
8. You've now successfully joined the two datasets.
9. If you wish to view the join configuration, you can click open the Data source configuration from the Ruleset pane.
10. You can also click Edit join to update the join configuration.
Joins are typically used to combine two different datasets on the common items. Let's look into each type of join with an example.
An inner join combines two datasets using common columns and discards all the unmatched rows.
The inner join is best represented as:
Let us take the example of the following datasets.
Order ID is the common column here. Inner join has combined the rows of the two datasets that shared values in the Order ID column and discarded the unmatched values from both Orders and Customers datasets.
A left join combines the current dataset with another dataset using common columns, and discards all unmatched rows from the other dataset.
Left join can be represented as:
Let's take the example of the following datasets:
Left join has combined the rows of the two datasets that shared values in the Order ID column. The unmatched values were only discarded from the Customers dataset.
A right join combines the current dataset with another dataset using common columns and discards all unmatched rows from the current dataset.
Right join can be represented as:
Let's take the example of the following datasets:
Right join has combined the rows of the two datasets that shared values in the Order ID column. The unmatched values were only discarded from the Orders dataset.
Outer join uses common columns to combine two datasets, including the unmatched rows from both the datasets. An outer join can be represented as:
Let's take the example of the following datasets.
Outer join has combined the rows of the two datasets that shared values in the Order ID column. The unmatched values are kept from both Orders and Customers datasets.
In DataPrep, you can append one dataset with another in two ways. For example, if you need to include the survey data of 2015 with your current dataset 2016, you can append your incoming dataset to your current dataset directly, or you can create a new dataset to append the individual datasets.
To append two datasets:
1. Click the Transform menu, click Combine, then select Append.
2.Click the Preview button and DataPrep will show a live preview of the operation once the datasets are selected.
3. It also compares the column names of the two datasets and shows the unmatched columns so you can choose to keep or delete them. For the matched ones, incoming data is added directly below the columns that have matched.
4. In the new dataset, you can open the data source settings from the Ruleset and edit the Append configuration.
Pivot Table distributes data for easy consumption. It spreads out the data in long, winding tables by converting categories into columns. Pivot can be created by selecting the Column, Row, and Data fields.
To apply pivot:
1. Click the Transform menu, then select Pivot.
2. Select the fields that you want to convert to columns in the Columns box. Choose the fields to feature as rows in the Rows box and the fields with data (only numerical values) in the Data box.
3. You change to one of these functions from the dropdown in the Data shelf.
- DISTINCT COUNT
4. Click the Preview button and DataPrep will show a live preview of the pivot transformation.
5. When two or more fields are selected in the Columns box, the column names are joined using a dash or hyphen (-). You can double-click to rename the column names in the Preview area.
6. In the new dataset, you can open the data source settings from the Ruleset and edit the Pivot configuration.
Unpivot converts columns to rows. Unpivot is useful for condensing data and exporting to analytics software for creating reports and dashboards.
1. Click the Transform menu, then select Unpivot.
2. Select the columns to unpivot and enter the column names for headers and values.
3. Click on the Preview button and DataPrep will show a live preview of the transformation.
4. Enter a name for the new dataset and apply the transformation.
5. In the new dataset, you can open the data source settings from the Ruleset and edit the Unpivot configuration.
You can create new columns and customize them using the Formula column option.
To create a formula column:
1. Click on the Transform menu, then select Formula column.
2. You can call the functions in the text field, or use the drop-down menus to select columns and apply functions directly.
3. DataPrep shows a live preview of the changes made to this column when you click on the preview button.
To view the list of all formula column functions available in Zoho DataPrep, click here
Derived dataset allows you to create a branch of your dataset from the last applied transformation in your dataset. The new dataset will have the most recent state of your data with an empty Ruleset.
To derive from your dataset:
1. Click the Transform menu, then select Derive dataset.
2. Enter a name for your derived dataset under New dataset name.
3. Click Save to create your derived dataset.
4. The data source details in your derived dataset will show derived from the name of your parent dataset.
Mark as personal data
You can mark a column as containing personal data and secure the column data using the Mark as personal data option.
To mark columns with personal data:
1. Click the Transform menu, then select Mark as personal data.
2. Select the columns with personal data in the operation panel.
3. Click Apply.
During export, there are three security measures that can be applied to the columns marked as containing personal data:
1. Data masking
Data masking hides original data with 'x' to protect personal information.
2. Data Tokenization
Data tokenization replaces each distinct value in your data with a random value. Hence the output is statistically identical to the original data.
Marking none will mark the selected column as personal data column. You can then choose or not to export this column later while configuring export.