How to extract values from a text column?

Extract from text




Zoho DataPrep offers options to identify and extract a subset of the data from a column. You can extract very specific portions of the column data using the extract transform.

For example, your column contains a mixture of letters and numbers, but you only need the letters.  A text column with the value "ABC123" can be extracted to a new column with the value "ABC".  You could also extract "123" or "BC12" or any other combination to a new column.

DataPrep offers the following options to identify and extract text from a column: 
  1. Start and end index
  2. Start index and length
  3. Matching text or pattern
  4. Numbers
  5. Regex
  6. First ’n’ characters
  7. Last ’n’ characters
  8. Valid values
  9. Invalid values
  10. Email
  11. URL

Extract options

Here are more details for some of the options listed above. 

Start and end index 

Start index - Extract the value starting from the start index. The default start index is 1.
End index - Extract the value till the end index.

Start index and length 

Start index - Extract the value starting from the start index. The default start index is 1.
Length - Extract the value from the start index to the given length. 

Matching text or pattern 

Text or pattern to match - Extract the value that matches with the given text or pattern.
Starting text or pattern -  Extract the value starting from the given text or pattern.
Ending text or pattern -  Extract the value ending before the given text or pattern.
Notes
Note : If you are not familiar with pattern matching in DataPrep, read about it here .

Numbers

Extract the numbers from the text in the column.

Regex 

Regex pattern - Enter the regex of the value which you want to extract.

First 'n' characters 

Number of characters to extract - Specify the number of characters to extract from the start of the value. 

Last 'n' characters

Number of characters to extract - Specify the number of characters to extract from the end of the value.

Valid values

Extract the valid values from the column.

Invalid values

Extract the invalid values from the column.

Email

Extract the username, domain, or both from an email column. 



URL

Extract the domain of the URL, port, path used, the query parameters and more.




To extract data from a column

1. Right-click the text column and select the Extract option from the context menu. 

2. Give a name under the  Base column name field to update the new column name.

3. Choose one of the Extract options  and provide the inputs required to extract specific portion of the value from the selected column.

4. You can also choose to store the extracted value in a 'column' or as a 'list' using the Store output as option.



Notes
Ignore case : Ignore case when matching text or pattern.
Number of matches to extract : Specify the number of matches to be extracted as columns. The default number is 1.

5.  You can apply this transform to multiple columns at the same time. Select the columns using the   icon under the Columns to apply section.

To apply filters

If you want to apply some filters along with this transform, you can use the filters functionality. 

Info
Please be aware that the transform will be applied only on the filtered rows and not on the whole dataset.

1. Click the  Filters  tab.

2. Click the   icon and add the required columns in the  Filters  section. You can also reorder the filters using the drag and drop method.



3. For every column added, you can select one of the following options from the drop-down:
  1. Actual: This option lets you filter rows based on the actual values in the column. Click  here  to know more.
  2. Data quality: This option lets you filter rows based on the quality of data in the column. Click  here  to know more.
  3. Patterns: This option helps you filter rows based on the data patterns in the selected column. Click  here  to know more.
  4. Seasonal: This option helps you filter rows based on the seasonal parameters such as quarter, month, week, etc. Click  here  to know more.
  5. Outliers: This option allows you to filter rows based on the outliers present in the data of the selected column. Click  here  to know more. 
Notes
Note: The filter options are displayed based on the datatype of the column added for the filter.

4. When you add more than one filter to the  Filters  section, the logical operators, AND or OR appear next to the filters. You can click to toggle the logical operator between AND and OR.
  1. Using the logical operators, you can combine the conditions and apply logic to determine the rule of precedence. The final expression is displayed in the  Criteria expression  box. You can click  Edit  to alter the default expression using logical operators and parenthesis to specify the precedence or the sequential order as to which condition should be evaluated first. Click  Save  after making the required changes. 
  1. For example, In the expression, ((1 OR 2) AND (3 OR 4)) , at first the condition ( 1 OR 2 ) will be executed and the condition ( 3 OR 4 ) will be executed next. Thirdly, since, the AND operator is used, the filter will be applied when both the conditions are true.
5. You can further drill down to choose specific values based on the filter option selected for each filter, in the next section.



For example, in the above screenshot, the  Data quality  option is selected for the All columns filter in the Filters section. Based on the selection, further options to filter specific values are displayed in the All columns (Data quality) section.

6. You can choose to include or exclude the selected items in the last section.

7. If you want to remove all the filters for some reason, you can use the Clear button.

8. A live preview of the filter transform is shown as you make changes. 

9. Click the Apply button to apply the transform along with the filters.

To sort data

Under the Sort tab, you can sort data in the ascending or descending order based on any column. You can choose the column in the Sort by column drop down and choose the order to be sorted. 

Info
You can use this functionality only with the transform and not as a standalone function. However, you can use the Sort transform if you want only to sort data.



SEE ALSO 
Count values
How to extract date values?
How to extract values from list and map columns?