Skip to main content

Turkish Input Processors Chain

Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization of inputs, for example. Each language supported by the Teneo Platform has a chain of input processors that know how to process that particular language.

IP Chain Setup

The following graph displays the setup of the Turkish Input Processors chain; each Input Processor is described further in the following sections.

The default Input Processors are shortly described below:

  • Turkish Analyzer: performs user input normalization, sentence splitting, tokenization and Part-of-Speech (POS) and morphological annotations.
  • System Annotation: sets a number of annotations based on properties of the user input text.
  • Basic Number Recognizer: recognizes all Arabic numbers and annotates them with a NUMBER annotation and a variable which contains the found number.
  • Language Detector: identifies the language of the input sentence provided and annotates it with the predicted language together with a confidence score.
  • Predict: classifies user inputs based on a machine learning model and annotates the user input with the predicted top intent classes and a confidence score; note that as of Teneo 7.3, deferred intent classification is applied.

Standard Simplifier

The Standard Simplifier is a simplifier implementation with support for configurable character decomposition and normalization as well as character mapping and it executes the following processing steps:

  1. Conversion to lower case, considering the configured language locale.
  2. Optional compatibility simplification: this is Unicode compatibility decomposition (like mapping 2 to 2, etc.), with optional exceptions defined by the property excludeFromCompatibilitySimplify.
  3. This step is disabled by default, see compatibilitySimplify
    Optional canonical simplification: Unicode canonical decomposition is applied, then by default all combining characters are deleted (exceptions may be defined with the property excludeFromCanonicalSimplify, which letter-combining character will be left untouched).
  4. Conversion to Unicode composed form.
  5. Optional simplification mapping: character/substring replacement as specified by property simplificationMapping.* are applied. No mappings are set by default.

Configuration Properties

Canonical Simplify
NameTypeRequiredDefault
canonicalSimplifytrue/falsenotrue

The canonicalSimplify enables/disables simplification based on canonical decomposition of Unicode characters; see Unicode normalization forms for more information. An exception list can be defined in excludeCanonicalSimplify.

If enabled:

  • First, canonical decomposition is applied, meaning that accented characters are decomposed into the base letter and combining marks (non-spacing mark) for the accent(s).
  • On a second step, all non-spacing marks are deleted, i.e., á becomes a, etc.
  • Finally, canonical composition is applied.
Exclude From Canonical Simplify
NameTypeRequiredDefault
excludeFromCanonicalSimplifystringnoempty

All characters int he string given here will be excluded from the canonical simplification defined above. To be more precise, for character-combinations resulting from step one while step two will be skipped; see steps above.

Compatibility Simplify
NameTypeRequiredDefault
compatibilitySimplifytrue/falsenofalse

The compatibilitySimplify property allows to enable/disable simplification based on compatibility decomposition of Unicode characters, for example, 5 becomes 5.

Exclude From Compatibility Simplify
NameTypeRequiredDefault
excludeFromCompatibilitySimplifystringnoempty

All characters in the string given here are excluded from the compatibility simplification described above.

Simplification Mapping
NameTypeRequiredDefault
simplificationMapping.*Format:
simplificationMapping.<n> = <letter(s)>=<replacement>
<n>: number, which must be unique within the simplification mappings of one file
<letter(s)>: string, letter(s) to be replaced
<replacement>: string, replacement
noempty

Custom simplification mapping which is applied after canonical and compatibility simplification.

Be aware that, for example, an accented character - for which a custom simplification mapping is applied - must be listed also under excludeFromCanonicalSimplify if canonical simplification isn't disabled.
For example: simplificationMapping.1 = ä=ae also requires excludeFromCanonicalSimplify = ...ä...

Input Processors

Turkish Analyzer

The Turkish Analyzer is based on Zemberek and performs the following tasks:

  • User input normalization,
  • Sentence splitting,
  • Tokenization, and
  • Part-of-Speech (POS) and morphological annotations.

Normalization, Sentence Splitting and Tokenization

The Turkish Analyzer performs normalization on user inputs and, furthermore, will segment the input into sentences, tokenize and analyze the morphological structure of each token in the context of the sentence.

This means that each sentence will be normalized by the Turkish Analyzer, i.e., the sentence will be lowercased and, in some cases, typos will be fixed. Unlike other Teneo input processors, the API method getOriginal() on a word object will return the normalized form (which might be different from the simplified form) as the normalization happens before the tokenization.
This has direct implications on the exact option, which for other languages works on the ORIGINAL form, but for Turkish, users need to be aware that the exact option operates on the normalized strings.
The original user input is not modified and can be retrieved with getUserInputText().

A sentence in Turkish is an instance of the TurkishSentence class, which implements the SentenceI interface from the engine-input-processor-api. The method getText() of the class TurkishSentence returns the normalized sentence text. The original sentence text can be retrieved with the method getRawSentence() within for a direct caller of the input processor chain by casting a Sentence to a TurkishSentence. It cannot be accessed via the engine scripting API.

The sentence indices point to the characters in the original user input string. The word indices point to the characters in the sentence, i.e. the normalized sentence string.

POS and Morphological Annotations

The Turkish Analyzer also annotates user inputs with POS and morphological information. Each word will be annotated with its lemma, if available. A lemma annotation contains the POS tag as an annotation variable pos:<string>.

The morphological information will be returned as annotations for the three different types that Zemberek returns with the following suffixes:

  • .POS: primary part-of-speech tag of the entire token
  • .POS/.NER: secondary part-of-speech tag (mix of entities/POS tags) based on the stem of the token
  • .MST: morphosyntactic information based on the morphemes of the token

The MST annotations all have the annotation variable surface=<string> that contains the substring of the surface form of that morpheme in the word, if available.

The table below lists how the tags from Zemberek are mapped to annotations in Teneo; please see here for information related to available ANNOT Language Objects in the Turkish Lexical Resource.

Zemberek TypeZemberek TagMap to annotations
POSNounNN.POS
POSAdjADJ.POS
POSAdvADV.POS
POSConjCC.POS
POSInterjINTERJ.POS
POSVerbVB.POS
POSPronPRON.POS
POSNumNUMERAL.POS
POSDetDET.POS
POSPostpPOST_POSITIVE.POS
POSQuesINTERROG.POS
POSDupDUPLICATOR.POS
POSPuncPUNCT.POS
POS2DemonsDEMOS.POS
POS3TimeTIME.NER
POS4QuantQUANTITATIVE.POS
POS5QuesINTERROG.POS
POS6PropPROPER.POS
POS7PersPERS.POS
POS8ReflexREFLEXIVE.POS
POS9OrdORDINAL.POS
POS10CardCARDINAL.POS
POS11PercentPERCENT.NER
POS12RatioRATIO.NER
POS13RangeRANGE.NER
POS14DistDIST.NER
POS15ClockCLOCK.NER
POS16DateDATE.NER
POS17EmailEMAIL.NER
POS18UrlURL.NER
POS19MentionMENTION.NER
POS20HashTagHASHTAG.NER
POS21EmoticonEMOTICON.NER
POS22RegAbbrvABBREVIATION.NER
POS23AbbrvABBREVIATION.NER
MSTNounNN.MST
MSTAdjADJ.MST
MSTAdvADV.MST
MSTConjCC.MST
MSTInterjINTERJ.MST
MSTVerbVB.MST
MSTPronPRON.MST
MSTNumNUMERAL.MST
MSTDetDET.MST
MSTPostpPOST_POSITIVE.MST
MSTQuesINTERROG.MST
MSTDupDUPLICATOR.MST
MSTPuncPUNCT.MST
MSTA1sg1STPERSON.MST,SG.MST
MSTA2sg2NDPERSON.MST,SG.MST
MSTA3sg3RDPERSON.MST,SG.MST
MSTA1pl1STPERSON.MST,PL.MST
MSTA2pl2NDPERSON.MST,PL.MST
MSTA3pl3RDPERSON.MST,PL.MST
MSTPnonNO_POSESSION.MST
MSTP1sgPOSS_1STPERSON.MST,POSS_SG.MST
MSTP2sgPOSS_2NDPERSON.MST,POSS_SG.MST
MSTP3sgPOSS_3RDPERSON.MST,POSS_SG.MST
MSTP1plPOSS_1STPERSON.MST,POSS_PL.MST
MSTP2plPOSS_2NDPERSON.MST,POSS_PL.MST
MSTP3plPOSS_3RDPERSON.MST,POSS_PL.MST
MSTNomNOMINATIVE.MST
MSTDatDATIVE.MST
MSTAccACCUSATIVE.MST
MSTAblABLATIVE.MST
MSTLocLOCATIVE.MST
MSTInsINSTRUMENTAL.MST
MSTGenGENITIVE.MST
MSTEquEQUATIVE.MST
MSTDimDIMINUTIVE.MST
MSTNessNESS.MST
MSTWithWITH.MST
MSTWithoutWITHOUT.MST
MSTRelatedRELATED.MST
MSTJustLikeJUST_LIKE.MST
MSTRelRELATION.MST
MSTAgtAGENTIVE.MST
MSTBecomeBECOME.MST
MSTAcquireACQUIRE.MST
MSTLyLY.MST
MSTCausCAUSATIVE.MST
MSTRecipRECIPROCAL.MST
MSTReflexREFLEXIVE.MST
MSTAbleABILITY.MST
MSTPassPASSIVE.MST
MSTInf1INFINITIVE1.MST
MSTInf2INFINITIVE2.MST
MSTInf3INFINITIVE3.MST
MSTActOfACT_OF.MST
MSTPastPartPART_PAST.MST
MSTNarrPartPART_NARRATIVE.MST
MSTFutPartPART_FUTURE.MST
MSTPresPartPART_PRESENT.MST
MSTAorPartPART_AORIST.MST
MSTNotStateNOT_STATE.MST
MSTFeelLikeFEEL_LIKE.MST
MSTEverSinceEVER_SINCE.MST
MSTRepeatREPEAT.MST
MSTAlmostALMOST.MST
MSTHastilyHASTILY.MST
MSTStaySTAY.MST
MSTStartSTART.MST
MSTAsIfAS_IF.MST
MSTWhileWHILE.MST
MSTWhenWHEN.MST
MSTSinceDoingSoSINCE_DOING_SO.MST
MSTAsLongAsAS_LONG_AS.MST
MSTByDoingSoBY_DOING_SO.MST
MSTAdamantlyADAMANTLY.MST
MSTAfterDoingSoAFTER_DOING_SO.MST
MSTWithoutHavingDoneSoWITHOUT_HAVING_DONE_SO.MST
MSTWithoutBeingAbleToHaveDoneSoWITHOUT_BEING_ABLE_TO_DO_SO.MST
MSTZeroZERO.MST
MSTCopCOP.MST
MSTNegNEGATIVE.MST
MSTUnableUNABLE.MST
MSTPresPRESENT.MST
MSTPastPAST.MST
MSTNarrNARRATIVE.MST
MSTCondCONDITION.MST
MSTProg1PROGRESSIVE1.MST
MSTProg2PROGRESSIVE2.MST
MSTAorAORIST.MST
MSTFutFUTURE.MST
MSTImpIMPERATIVE.MST
MSTOptOPTATIVE.MST
MSTDesrDESIRE.MST
MSTNecesNECESSITY.MST

Configuration Properties

Non Word Chars
NameTypeRequiredDefault
nonWordCharsstringno" “ ” . ¡ ! ¿ ? … , ; ‘ ’ ' ´ `

A list of characters that are removed if they are single tokens.

There are three types of annotation mapping properties. Their value is of the form:

P1sg=POSSESIVE.MST,1STPERSON.MST,SG.MST

Note that the properties are numbered in the configuration file. For example:

annotationsForMST.1 = A1sg=1STPERSON.MST,SG.MST  
annotationsForMST.2 = A2sg=2NDPERSON.MST,SG.MST
NameTypeRequiredDefault
AnnotationsForPosstringnoempty

Mapping for the POS tags (primary POS returned from Zemberek API)

NameTypeRequiredDefault
annotationsForPos2stringnoempty

Mapping for the POS/NER tags (secondary POS returned from Zemberek API)

NameTypeRequiredDefault
annotationsForMSTstringnoempty

Mapping for the morphological information tags (morpheme returned from Zemberek API)

Further, there are the following files that configure the Zemberek normalizer directly:

  • asci-map: list of auto-correct mapping
  • lm.2gram.slm: language model
  • look-up-from-graph: list of auto-correct mappings
  • split: list of words to be split

For more information please visit the project site of Zemberek.

System Annotation

The System Annotation Input Processor performs simple analysis of the sentence texts to set some annotations and the decision algorithms are configurable by various properties.
Further customization is possible by sub-classing this Input Processor and overriding one or more of the following methods: decideBinary, decideBrackets, decideEmpty, decideExclamation, decideNonsense, decideQuestion, decideQuote.

This Input Processor works on the sentences passed in, but does not modify them.

Other Considerations

Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations this input processor may generate:

AnnotationDescription
_BINARYThe input consists only of characters specified by the properties binaryCharacters (at least one of them) and binaryIgnoredCharacters (zero or more of them)
_BRACKETPAIRThe input contains at least one matching pair of the bracket characters specified under the property bracketPairCharacters
_EXCLAMATIONThe input contains at least one of the characters specified for the property exclamationMarkCharacters
_EM3The input contains three or more characters in a row of the characters specified with exclamationMarkCharacters
_EMPTYThe input contains no text / the sentence text is empty
_NONSENSEThe input probably contains nonsense text as configured with the properties consonants, nonsenseThreshold.absolute and nonsenseThreshold.relative
_QUESTIONThe input contains at least one of the characters specified in property questionMarkCharacters
_QT3The input contains three or more characters specified in property questionMarkCharacters
_QUOTEThe input contains at least one of the characters specified with property quoteCharacters
_DBLQUOTEThe input contains at least one of the characters specified with property doubleQuoteCharacters

Configuration Properties

Consonants
NameTypeRequiredDefault
consonantsstringnoBCÇDFGĞHJKLMNPQRSŞTVWXZ bcçdfgğhjklmnpqrsştvwxz

Contains all letters (upper and lower case) that are considered consonants in the language.
Together with the following two properties for absolute and relative nonsense threshold, the defined consonants are used for detecting probable nonsense inputs like kljljljljjlj.

Nonsense Threshold Absolute
NameTypeRequiredDefault
nonsenseThreshold.absolutePositive integer numberNo6

For nonsense detection, an input consisting exclusively of the defined number of consonants without any non-consonants is considered nonsense.

Nonsense Threshold Relative
NameTypeRequiredDefault
nonsenseThreshold.relativePositive integer numberno10

For nonsense detection, an input containing the defined number of consonants in a row is considered nonsense.

Exclamation Mark Characters
NameTypeRequiredDefault
exclamationMarkCharactersstringno!

List of characters considered exclamation mark in the language where at least one must occur in the input to set the annotation _EXCLAMATION and a sequence of at least three must occur to set the annotation _EM3.

Question Mark Characters
NameTypeRequiredDefault
questionMarkCharactersstringno?

List of characters considered question mark in the language where at least one must occur in the input to set the annotation _QUESTION and a sequence of at least three must occur to set the annotation _QT3.

Double Quote Characters
NameTypeRequiredDefault
doubleQuoteCharactersstringno"

List of characters considered double quotes in the language; at least one must occur in the input to set the annotation _DBLQUOTE.

Quote Characters
NameTypeRequiredDefault
quoteCharactersstringno

List of characters considered quote in the language; at least one must occur in the input to set the annotation _QUOTE.

Binary Characters
NameTypeRequiredDefault
binaryCharactersstringno01

List of characters to be recognized as binary, sets the annotation _BINARY.

Binary Ignored Characters
NameTypeRequiredDefault
binaryIgnoredCharactersstringno! ? , . - ; : # \r \n \t " '

List of characters additionally allowed in binary inputs.

Bracket Pair Characters
NameTypeRequiredDefault
bracketPairCharactersstringno( ) [ ] { }

List of bracketing characters of which at least one pair (opening and closing bracket of the same type) must occur in the input to set the annotation _BRACKETPAIR.

Special System Annotations

The following two, special annotations are set by the Teneo Engine. These special system annotations are not related to individual inputs but rather to whole dialogues and are dependent on the session state.

AnnotationDescription
_INITIndicates session start, i.e., the first input in a dialogue
_TIMEOUTIndicates the continuation of a previously timed-out session/dialogue

Basic Number Recognizer

The Basic Number Recognizer identifies all Arabic numbers of the type 123 and 3.14 in the user input and annotates each of them with an annotation associated with a variable which holds the actual numeric value of the number found.
The Basic Number Recognizer is language dependent and each language has its own configuration defining the decimal point characters and the thousands separator character to be ignored.

AnnotationVariableDescription
NUMBERnumericValueAnnotation created for identified Arabic numbers in user inputs

For the annotation and its numeric value variable to be added, a number in the user input must meet the following syntax:

It must match the regular expression:

[,]?[0-9]+([,][0-9]+)*([.][0-9]+)?|[.][0-9]+

It must be parseable by Java's BigDecimal to ensure it is a number

The above syntax provides the following guarantees:

  • The sign is not included in the annotated token
  • The numericValue variable contains a BigDecimal representation of the number.

The decimal marker(s) and the thousands separator(s) can be configured; in the above regex, the dot is used as a decimal marker and the comma as a regular expression.

Configuration properties

NameDefault
decimalMarkers,

The default decimal markers in Turkish is the comma.

NameDefault
charactersToIgnore.

The default character to ignore is the dot.

Language Detector

The Language Detector uses a machine learning model to predict the language of a given user input and adds an annotation, as seen in below table, to the input together with a confidence score of the prediction.

AnnotationVariableDescription
<language label>.LANG, e.g., %$DA.LANGConfidenceAnnotation created for the predicted language

The Language Detector can predict the following 45 languages; the language label used to create the annotation name is in brackets:

Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).

Serbian, Bosnian and Croatian are treated as one language under the label SR_HR, and Indonesian and Malay are treated as one language under the label ID_MS

A number of regexes are also in use by the Input Processor, helping the model to not predict a language for fully numerical inputs, URLs or other type of nonsense inputs.

The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for the following listed languages, language annotations are always created (even for predictions below 0.2) since the Language Detector is mostly accurate when predicting them: Arabic, Bengali, Greek, Hebrew, Hindi, Japanese, Korean, Tamil, Telugu, Thai, Chinese, Vietnamese, Persian and Urdu.

Predict

The Predict Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, deferred intent classification is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.

When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:

  • the confidence is above the minimum confidence (defaults to 0.01)
  • the confidence is higher than 0.5 times the confidence value of the top class.

For each selected class, an annotation with the scheme <CLASS_NAME>.INTENT is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or LearnFallback) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).
A special annotation <CLASS_NAME>.TOP_INTENT is created for the class with the highest confidence score.

AnnotationVariableVariableVariableDescription
<CLASS_NAME>.TOP_INTENTclassifierconfidenceAnnotation created for the class with the highest confidence score
<CLASS_NAME>.INTENTclassifierconfidenceOrderAnnotation given to each selected class with a maximum of five top classes

The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria.

Configuration Properties

Min Confidence Similarity Distance
NameTypeRequiredDefault
minConfidenceSimilarityDistancefloatno0.5

Confidence percentage of the top score confidence a class must have in order to be considered, e.g., if the top confidence class has a confidence of 0.7, classes with confidence lower than 0.5 x 0.7 = 0.35 will be discarded.

Max Number of Annotations
NameTypeRequiredDefault
maxNumberOfAnnotationsintno5

The maximum number of class annotations created for each user input.

Min Confidence Threshold
NameTypeRequiredDefault
minConfidenceThresholdfloatno0.01

The minimum value of confidence a model must have for a class in order to add it as one of the candidate annotations.

Intent Model File Name
NameTypeRequiredDefault
ìntent.model.file.namestring (filename)noinexistent

Name of the file containing the machine learning model; it is usually set automatically by Teneo Studio so no configuration is required.