Language detection

Language detection





You can detect the language of the text in the selected text column using the Language detection operation powered DataPrep's own machine learning engine . As an example, if the text value in the selected column is, "Hello, World!", the Language detection transform will return: 'English'.

 

The transform supports over 70 languages in total. The list of languages supported are:

  • Afrikaans (af)

  • Aragonese (an)

  • Arabic (ar)

  • Asturian (ast)

  • Belarusian (be)

  • Breton (br)

  • Catalan (ca)

  • Bulgarian (bg)

  • Bengali (bn)

  • Czech (cs)

  • Welsh (cy)

  • Danish (da)

  • German (de)

  • Greek (el)

  • English (en)

  • Spanish (es)

  • Estonian (et)

  • Basque (eu)

  • Persian (fa)

  • Finnish (fi)

  • French (fr)

  • Irish (ga)

  • Galician (gl)

  • Gujarati (gu)

  • Hebrew (he)

  • Hindi (hi)

  • Croatian (hr)

  • Haitian (ht)

  • Hungarian (hu)

  • Indonesian (id)

  • Icelandic (is)

  • Italian (it)

  • Japanese (ja)

  • Khmer (km)

  • Kannada (kn)

  • Korean (ko)

  • Lithuanian (lt)

  • Latvian (lv)

  • Macedonian (mk)

  • Malayalam (ml)

  • Marathi (mr)

  • Malay (ms)

  • Maltese (mt)

  • Nepali (ne)

  • Dutch (nl)

  • Norwegian (no)

  • Occitan (oc)

  • Punjabi (pa)

  • Polish (pl)

  • Portuguese (pt)

  • Romanian (ro)

  • Russian (ru)

  • Slovak (sk)

  • Slovene (sl)

  • Somali (so)

  • Albanian (sq)

  • Serbian (sr)

  • Swedish (sv)

  • Swahili (sw)

  • Tamil (ta)

  • Telugu (te)

  • Thai (th)

  • Tagalog (tl)

  • Turkish (tr)

  • Ukrainian (uk)

  • Urdu (ur)

  • Vietnamese (vi)

  • Walloon (wa)

  • Yiddish (yi)

  • Simplified Chinese (zh-cn)

  • Traditional Chinese (zh-tw)

To detect languages in a column

1. Right-click the column and select Language detection  transform from the context menu.



2. Provide a name to the resultant column in the New column name section. 


3. Select the type of output required. As the option name suggests, Language name will render the name of the language as the output, and Language code will render the code of the language.


4. For example, selecting the Language name will give 'English' as the output for an English text , and Language code for an English text will give 'en' as the output. 


5. DataPrep shows a live preview of the column during the transform. You can click the Preview button at the bottom of the side panel to preview the output column.


6. You can apply this transform to only one column. Click the Apply button to apply this transform.


Notes
Note : Language detection transform gives accurate results when the text length is 50 characters or more.

To apply filters

If you want to apply some filters along with this transform, you can use the filters functionality.

1. Click the  Filters  tab.

2. Click the   icon and add the required columns in the  Filters  section. You can also reorder the filters using the drag and drop method.



3. For every column added, you can select one of the following options from the drop-down:
  1. Actual: This option lets you filter rows based on the actual values in the column. Click  here  to know more.
  2. Data quality: This option lets you filter rows based on the quality of data in the column. Click  here  to know more.
  3. Patterns: This option helps you filter rows based on the data patterns in the selected column. Click  here  to know more.
  4. Outliers: This option allows you to filter rows based on the outliers present in the data of the selected column. Click  here  to know more. 
Notes
Note: The filter options are displayed based on the datatype of the column added for the filter.

4. When you add more than one filter to the  Filters  section, the logical operators, AND or OR appear next to the filters. You can click to toggle the logical operator between AND and OR.
  1. Using the logical operators, you can combine the conditions and apply logic to determine the rule of precedence. The final expression is displayed in the  Criteria expression  box. You can click  Edit  to alter the default expression using logical operators and parenthesis to specify the precedence or the sequential order as to which condition should be evaluated first. Click  Save  after making the required changes. 
  1. For example, In the expression, ((1 OR 2) AND (3 OR 4)) , at first the condition ( 1 OR 2 ) will be executed and the condition ( 3 OR 4 ) will be executed next. Thirdly, since, the AND operator is used, the filter will be applied when both the conditions are true.
5. You can further drill down to choose specific values based on the filter option selected for each filter, in the next section.



For example, in the above screenshot, the  Data quality  option is selected for the All columns filter in the  Filters section. Based on the selection, further options to filter specific values are displayed in the  All columns (Data quality)  section.

6. You can choose to include or exclude the selected items in the last section.

7. If you want to remove all the filters for some reason, you can use the  Clear  button.

8. A live preview of the filter transform is shown as you make changes. 

9. Click the  Apply  button to apply the transform along with the filters.

SEE ALSO
Learn about keyword extraction
Learn about sentiment analysis