An overview of data preparation in the
Studio
page
is covered in the following sections:
-
Data distribution
-
Data quality
-
Intelligent suggestions
-
Search & filter
-
Topbar
The image above displays the
Studio page
and all the sections associated with it.
You can get a page tour by clicking the
Page tour
option from the
Help menu in the topbar. The page tour gives you a complete walkthrough of the
Studio
page
and makes it easy to understand.
Data distribution
In DataPrep, a histogram is a graphical representation of data distribution and the range of values present in a column.
You can spot outliers and anomalies in the data using it. Selecting a bar or a section of the histogram filters the data within the range.
A detailed version of this histogram is also present under Column details, which appears at the bottom when a column is selected.
You can edit the values present in the histogram and have it changed in the entire column.
You can also sort the values using the
icon.
You can also click the search icon and search values in the histogram using one of the conditions below:
-
contains
-
doesn't contain
-
is
-
is not
-
begins with
-
doesn't begin with
-
ends with
-
doesn't end with
Data quality
DataPrep offers numerous options to measure and improve the quality of your data. Data quality can be assessed from the following areas in the data prep page.
-
Data quality bar
-
Column details section
-
Dataset details section
Data quality bar
A data quality bar represents the quality of data in each column. It splits data quality into valid data, invalid data, and missing values, based on data type of the column.
Green represents valid data, red represents invalid data, and grey represents the missing values.
When you click on a section, DataPrep filters out appropriate rows so that you can easily deal with invalid or missing values in your dataset.
-
Hover over the data quality bar to get a quick look at the data quality of a column.
-
You can also click on the
Show for all columns
option to view the data quality of all individual columns.
Column details section
Column details section shows a data summary of each column with its data type, number of unique values in the column, and number of missing, invalid, and valid entries.
Note: The first 100 rows of the dataset are processed to suggest a data type.
-
The
Column details
are shown in the bottom panel whenever a column is selected.
-
This section has a detailed version of the histogram present at the top of each column. You can edit the values present in the histogram to have it changed in the entire column. You can also sort the values using the
icon.
-
You can also click the search icon and search values in the histogram using one of the conditions below:
-
contains
-
doesn't contain
-
is
-
is not
-
begins with
-
doesn't begin with
-
ends with
-
doesn't end with
-
You can also click the
Show more details
link to see the expanded view of the details of the selected column. Various aspects of the column such as statistics, outliers, unique values, and data patterns are displayed under this section.
Outliers and Anomaly detection
- Zoho DataPrep uses machine learning techniques to identify anomalies in the data. This lets users deal with anomalies in the data pipeline. Users can decide to keep or remove anomalies from their dataset. Here's a quick GIF to demonstrate DataPrep's outlier detection capabilities.
- You can also choose to have the widgets to be shown in the
Show more details
page from the context menu options next to the column name.
Dataset details section
Dataset details reveal data quality for the entire dataset using a donut chart. The number is derived from the collective quality of the individual columns. You will see this section for the first time when a dataset loads onto the data preparation screen, and whenever none of the columns are selected.
Dataset details display the following information.
-
Sample rows
-
Sample strategy (includes Random, Erroneous, Column based, and Initial data samples)
-
Total rows
-
Number of columns
-
Number of data types in the dataset
-
Overall dataset data quality as a donut chart.
The donut chart splits data into a percentage of valid data, invalid data, and missing values. Click on the sections of the donut chart to selectively view valid, invalid, and missing values in your dataset.
Sample strategy
Generating a sample is essential in speeding up the transformations performed. It is performed by taking a sample from the entire data using various strategies. The
Initial sample
strategy is used when the dataset is imported for the first time. You can change the strategy at any point during the data preparation process. Click on the edit icon in the dataset details panel to change the sample strategy.
The different sample strategies available are:
-
Initial sample:
Generated from the first 5 MB data of the imported file.
-
Random sample:
Randomly selected rows from the imported file.
-
Erroneous sample:
Rows containing invalid or missing entries.
-
Column based sample:
Generated based on the distinct values from the selected column.
Intelligent suggestions
DataPrep suggests transforms based on the imported data and makes for effective data preparation. Suggestions are shown when one or multiple columns are selected, it is also shown when a filter is applied.
-
When you click one of the suggested transform, you will be taken to the
Studio panel
with a live preview of the transformation to be applied to your data.
-
You may choose to edit the options and conditions in the operation bar before applying the suggested operation.
Search and filter
Perform search operations and filter data using the
Search and filter
box. The search and filter box is exploratory in nature and provides a way to filter your dataset without applying them as a rule. However, you can choose to apply them as a rule by either keeping or deleting the filtered rows.
You can also select one of the default filter options using the filter icon from the
Search and filter
box:
-
Filter rows with valid values - Only rows with valid data are displayed
-
Filter rows with invalid values - Only rows with invalid data are displayed
-
Filter rows with missing values - Only rows with missing data are displayed
-
Filter rows with missing or invalid values - Only rows with missing or invalid data are displayed
If you want to filter data based on custom filter conditions, you can use the
Advanced option that appears once you filter data.
To search data
-
To search data, enter the value in the
Search and filter
box. The searched keyword will be added as a chip below the box based on the default condition '
contains
'.
-
You can also select the chip and edit the searched keyword and the condition at any time.
-
Once the search result appears, you can choose to Keep or Delete
the filtered rows. The resultant dataset will be displayed with or without the filtered rows based on the selection.
All your searched keywords will be automatically included when you open the Advanced filters
pane.
To filter data
-
To filter dataset by other means, you can simply click on the histogram, data quality bar, or the donut chart.
-
When you filter, a chip appears below the
Search and filter
box. You can select the chip and edit the searched keyword and the associated condition at any time.
-
You can do multiple filters, and a chip would be added for each filter.
1. All filters present will be automatically included when you open the
Advanced filters
pane.
2. This also includes the ones that you add while having the pane open.
3. You can also edit the filters in the advanced filters pane.
Next section covers
Advanced filters
-
The conditions available in the filter are:
-
Contains (default)
-
Doesn't contain
-
Starts with
-
Doesn't start with
-
Ends with
-
Doesn't end with
-
Is
-
Is not
-
Matches regex
-
After filtering, you can choose to apply either
Keep rows
or
Delete rows rule that appears next to the chip.
-
You can also edit the filters added using the
Edit
link.
-
To remove a particular filter, click the close icon that appears when you hover over the chip.
-
To remove all filters at one go, click the
Clear link.
Advanced filters
The Advanced filters option allows you to filter data based on custom conditions applied over one or more columns. The advanced filters is exploratory in nature and provides a way to filter your dataset without applying them as a rule. However, you can choose to apply them as a rule by either keeping or deleting the filtered rows. The Advanced link appears once you filter data.
Click here to know more about filtering data.
1. All filters present will be automatically included when you open the Advanced filters
pane.
2. Including the ones that you add while having the pane open.
3. You can also edit the filters in the Advanced filters pane.
To apply advanced filters
1. Click on the donut chart, histogram or the default filter options and filter data. You can also right-click on a column and select the
Click here to learn more about filtering data.
2. The Advanced link appears above the data grid. Click the Advanced link and pane will slide open to view.
2.
Click the
icon to add columns to the filters. You can also reorder the filters using the drag and drop method.
3. When you add more than one filter to the
Filters
section, the logical operators, AND or OR appear next to the filters. You can click to toggle the logical operator between AND and OR.
4. Using the logical operators, you can combine the conditions and apply logic to determine the rule of precedence. The final expression is displayed in the
Criteria expression
box. You can click
Edit
to alter the default expression using logical operators and parenthesis to specify the precedence or the sequential order as to which condition should be evaluated first. Click
Save
after making the required changes.
For example, In the expression, ((1 OR 2) AND (3 OR 4)), first the condition ( 1 OR 2 ) will be executed and then, the condition ( 3 OR 4 ) will be executed next. Thirdly, since, the AND operator is used, the filter will be applied when both the conditions are true.
5. You can use the
Clear
button to remove all the filters.
6.
For every filter added, you can select one of the following options from the drop-down:
-
Actual
: This option lets you filter rows based on the actual values in the column.
-
Data quality
: This option lets you filter rows based on the quality of data in the column.
-
Patterns
: This option helps you filter rows based on the data patterns in the selected column.
-
Seasonal
: This option helps you filter rows based on the seasonal parameters such as quarter, month, week, etc.
-
Outliers
: This option allows you to filter rows based on the outliers present in the data of the selected column.
The filter options are displayed based on the datatype of the column added for the filter. Click
here
to know more about the filter options.
7. Click the
Filter
button. The number of filters added will be shown in the chip that appears above the data grid.
8.
Advanced filters
filter the data without applying any rule. After when you're satisfied with your filtered data, you can choose to apply either the
Keep rows
or
Delete rows
rule.
Topbar
The topbar in the
Studio page
has a dataset switcher on the left hand side and a menu bar on the right. Click the dropdown next to the dataset name in the top bar to switch to other datasets. This is particularly useful to resume the remaining data prep activities in other datasets instantly.
You can also use this dropdown to take a look at the data quality of other datasets. This helps you prioritize data preparation activities in the workspace.