Modulo:spec-splitter-en

--[[
Specification of the lemma splitter syntax
==========================================

About
-----

The lemma is read from the pagename. The splitter can divide it
into useful fragments and link to those, and optionally the page
can be added to compound (morpheme) categories for them.
A split control parameter and an extra parameter are available.

Split strategies
----------------

The 7 split and processing strategies selectable with
the split control parameter sent to the template are:

- automatic multiword split (default)
- assisted split (assisted automatic multiword split)
- manual split
- simple root split (mortyp:s N + U)
- simple bare root (mortyp M or N)
- large letter split (mortyp M)
- no split

Incoming information
--------------------

- boolean information whether compound categories are desired
- lemma (may NOT be empty)
- split control parameter ("fra=", may be empty)
- extra parameter ("ext=", may be and usually is empty)
- language stuff (code and some variants of language name)
- word class (reduced to 2 questions)

The extra parameter creates extra category includes, but
does NOT affect the actual split.

The language stuff is built into category names, but it does
NOT affect the actual split. The splitter is language-independent.

The word class is reduced to 2 boolean YES-vs-NO questions:
- "Is it KA ie sentence?" (affects category name in some cases)
- "Is it NR ie nonstandalone root?" (affects mortyp for "simple bare
  root" strategy)
Thus the word class is needed but does NOT affect the actual split.

Output
------

- wikitext intended to be sent to screen, usually with
  wikilinks ie [[...|...]]
- list of category names without prefix ("Category:"), without rectangular
  brackets, and without sorting hint text
- list of bool values parallel to the list of category names revealing what
  category include is to be formed as main page of the category by means
  of sorting hint "|-"

Overall limits
--------------

- length of the incoming lemma : 1...120 octet:s
- total number of compound categories created : 18 (this is less than 16 + 4)

Limits for the split control parameter
--------------------------------------

- length of the split control parameter : 1...120 octet:s (example: "-")

Limits for the extra parameter
------------------------------

- length of the extra parameter : 2 or 5...120 octet:s (example:
  "&X" or "[U:u]")
- number of extra fragments : min 1 max 4 (unless "&"-syntax is used)

Limits for assisted split
-------------------------

- length of the split control parameter : 2...120 octet:s (example: "%0")
- length of explicitly provided link target : 1...40 octet:s
- number of blocked input boundaries : max 8
- number of accessible output fragments : max 16 (numbered "0"..."F")

Limits for manual split
-----------------------

- length of the split control parameter : 4...120 octet:s (example: "b[o]")
- number of output fragments : min 1 or 2 and max 16 (numbered "0"..."F")

Split control parameter
-----------------------

The split control parameter is evaluated only if splitting is globally
enabled in the module, otherwise it is useless and ignored, and no error
occurs, and the splitter is not called at all.

Base syntax of the split control parameter:

- special value "-" : no split
- sequence of tuning commands, beginning
  with "%" or "#" : assisted split
- special value "$S" : simple root split
- special value "$B" : simple bare root
- special value "$H" : large letter split
- any other content : considered as request for manual split

No split
--------

No split can be achieved by the special value "-". The raw lemma is showed,
there are no links, and no compound (morpheme) category includes are created
automatically, but the extra parameter is still available.

Automatic multiword split and assisted split
--------------------------------------------

If the lemma is multiword then it is automatically split at detected split
boundaries. Such a boundary consists of one or multiple qualifying char:s,
namely space and punctuation (5 char:s: ! , . ; ?). Note that particularly
dash "-" and apo "'" do NOT count as punctuation, thus for example words
"berjalan-jalan" or "o'clock" will remain together (but "there's" will
too, see below). The fragments are linked by default but there are various
options to tune the linking.

Assisted split
--------------

During the split work 2 separate ZERO-based counters are maintained
and commands in the split control parameter refer to those.

- input boundary counter : Counts boundaries between incoming words forming
  the lemma. Multiple consecutive qualifying char:s count as one boundary,
  this applies even to leading and trailing position. For example the text
  "Apples, ? bananas and beer." contains 4 boundaries numbered from 0 to 3,
  the string ", ? " (4 char:s) receives index 0. The string "?va?" contains
  2 boundaries.

- output fragment counter : Counts generated fragments. For example
  "pembangkit listrik tenaga surya" contains 3 boundaries (see above) and will
  by default generate 4 fragments. If you disable breaking at boundaries 0 and
  2, then the result will be only 2 fragments "pembangkit listrik" and "tenaga
  surya" instead, numbered 0 and 1.

The counters are referenced by one-digit numbers, "0" to "9", and "A" to "F"
(must be uppercase) for rarely needed indexes "10"..."15", thus actually
hex numbers.

For assisted split the split control parameter contains a sequence of
tuning commands separated by spaces, or even only one command:

- "%" followed by 1...8 ascending hex digits : do not split at listed
          input word boundaries
- "#" followed by a hex digit followed by "N" or "I" or "A" : tune
          at pointed output fragment index
  - "N" do not link the fragment (this blocks the categorization too)
  - "I" convert beginning letter to lowercase ("I" minusklo) for link target
  - "A" convert beginning letter to uppercase ("A" majusklo) for link target
- "#" followed by a hex digit followed by colon ":" followed by
          string : link to that target instead

The "#"-items (ZERO or ONE or more permitted) must be ascending but need not
to be consecutive, and they must follow the single "%"-item if it is present.

For example "%3A #2N #5A #7N #8:test" will:

- avoid breaking at input boundaries 3 and 10
- avoid linking of fragments 2 and 7
- link fragment 5 to target with uppercase letter
- link fragment 8 to "test"

The most common use will be "#0I" fixing the case of the word at
beginning of a sentence lemma, for example "Yes we can." will
link to "yes", not to "Yes", besides "we" and "can".

Too high positions of boundaries and fragments are ignored, but other
errors are not and they do result in an error, most notably:

- messing up the order of "%" and "#", ie putting "#" before "%",
  for example "#2N %3A #5A #7N #8:test"
- numbers after "%" are not ascending, for example "%A3 #2N #5A #7N #8:test"
- "#"-items are not ascending, for example "%3A #2N #5A #7N #6:test"
- invalid char:s or missing spaces, for example "%3A #2N#5A #7N #8:test"

A too high number of boundaries or fragments occurring in the lemma does not
cause an error, but it is not possible to tune those with index >= 16 anymore.

Manual split
------------

The manual split shares some ideas with the "very raw manual split" carried
out with wiki syntax but there are some crucial differences. Fragments are
enclosed by single rectangular brackets (as opposed to double ones in "very
raw manual split"), slash "/" is the primary field separator instead of wall
"|" and there is a secondary (and early) separator denoted by colon ":".
There is a "sum check" feature that ensures that the visible text (sum or
concatenation of all fragments) still equals to the lemma, otherwise an error
occurs. Moving (ie renaming) a lemma page where this manual split is applied
will inevitably generate an error and force the contributor to adjust the
split control parameter. Besides the usual syntax with one field separator
(here slash instead of wall) there is a syntax with one colon, and a syntax
with both one colon and one slash allowing to specify the morpheme type of
the fragment. Spaces are permitted in the lemma and to some degree in the
predefined split fragments. It is possible to apply manual split to multiword
lemmas, but in practice this is rarely useful (see above and below, automatic
multiword split or assisted split is much better). If the morpheme type is
specified but link target is not, then the splitter will construct the link
target (see below). It is permitted to use the plus "+" sign as fragment
separator (adjacent to a rectangular bracket, except at the very beginning or
end of the split control parameter) in the manual split that will be visible,
but is excluded from the "sum check".

Types of fragments:

- F000 : no brackets, no colon, no slash (visible text no link)
- F200 : 2 brackets, no colon, no slash (combo field target visible text)
- F201 : 2 brackets, no colon, 1 slash (target / visible text)
- F210 : 2 brackets, 1 colon, no slash (mortyp : combo field
           target visible text)
- F211 : 2 brackets, 1 colon, 1 slash (mortyp : target / visible text)

Deleted substrings:

Arc brackets are permitted to some degree in the visible text and combo
field for types F200 F201 F210 F211 allowing to specify deleted substrings
(usually single letters, even single space), but prohibited in the target
field (left of the slash "/"). Such deleted substrings are excluded from
the "sum check", and with pseudo mortyp "L" (see below) from categorization,
but never from linking.

Spaces are permitted with some restrictions:

- a field may not begin nor end with a space
  ("[U:-are /ar(e)]" is bad)
- a deleted substring may not begin nor end with a space
  ("[M:loep( a)]" is bad)
- deleted spaces are prohibited after "L:" but otherwise
  permitted ("[L:fingr( )]" is bad but "[M:kereta( )api]" is good)

Special magic features of the type F210:

- automatic dash adding for mortyp:s I P U for linking and categorization
- pseudo mortyp "L" (-> "N")

Note that omitting deleted letters for the "sum check" is NOT restricted to
the fragment type F210, and it is performed much earlier during prevalidation
of the split control parameter.

Restrictions:

- at least 2 fragments, or ONE fragment if it contains a slash (ONE fragment
  with colon but without slash is NOT sufficient)
- max 16 fragments
- two fragments of type F000 may not follow each other
- leading and trailing spaces are prohibited inside rectangular brackets (but
  spaces inside lemma parts are possible), ie leading and trailing spaces are
  prohibited except for type F000
- leading and trailing spaces are prohibited inside arc brackets
  if "L" is used with type F210
- empty content of a field (left and right of slash "/") is prohibited

List of 6+1+1+1 selectable morpheme type codes:

C  circumfix           cirkumfikso
I  infix               infikso (-eo-: -o- -et- -ist- | -en-: -fucking-)
M  standalone root     memstara radiko (-eo-: tri dek post | -en-: hole)
N  nonstandalone root  nememstara radiko (-eo-: fer voj | -en-: lingon)
P  prefix              prefikso
U  suffix          sufikso (postfikso, finajxo) (-eo-: -a -j -n | -en-: -ist)
-------
W  word                vorto
-------
L  same as "N" but changes categorization behavior (only in F210, see below)
-------
X  only after "&" in the extra parameter (converted to M plus W and pagename)

These mortyp:s can be used in the split control parameter before colon ":"
with manual split, and in the extra parameter (see below), but then "L" is
prohibited (thus C I M N P U W are left plus maybe X), either after "&", or
in fragments before ":" or "!".

The default mortyp with manual split is none (link but do not put the page
into any compound (morpheme) category) as long as nothing else is specified.
These types are partially ignored if compound categories are entirely not
desired for some reason, but affix types I P U still affect the linking.

The letter "L" belongs to same morpheme type as "N" and is categorized
as "N" but changes the categorization behavior of the splitter. It is to be
used in fragment type F210 only together with deleted letters (not spaces)
and causes the long form to be linked but the short form categorized (this
cannot be achieved by any other means with the split control parameter
only, code "N" links and categorizes the long form only, but can with
the extra parameter), and the short form is fed into the "sum check"
(long time earlier). This is useful with Esperanto where most of the
vocabulary consists of nonstandalone roots. Instead of for example
"[N:pomo/pom(o)][N:arbo/arb(o)][U:o]" (complicated and wrongly categorizing
"pomo" instead of intended "pom") we can write "[L:pom(o)][L:arb(o)][U:o]"
resulting in lemma links to "pomo", "arbo" and "-o" and categories
type "N" of "pom", type "N" of "arb", and type "U" of "o".

Colon evaluation:

- colon is only regarded as a control char and can cause an error if:
  - it is preceded by an uppercase letter ("A"..."Z")
  and
  - those 2 char:s are located in the beginning of a fragment and inside [...]
- otherwise it is considered to be an ordinary letter

For example:

- "+[M:crap]" is regarded and valid (although maybe useless)
- "+[A:crap]" is regarded and an error
- "+[m:A:crap]" and "A:crap" is maybe nonsense but ignored
  and not an error against this specification

Examples of legal syntax of fragments:

- "blah"              -- F000 show "blah" no link
- " blah"             -- F000
- " "                 -- F000

- "[blah]"            -- F200 show "blah" link to "blah"

- "[blah/Blah]"       -- F201 show "Blah" link to "blah"
- "[bl ah/Bl ah]"     -- F201 show "Bl ah" link to "bl ah" (inner spaces
                              are legal)

- "[I:il]"            -- F210 show "il" link to "-il-" categorize
                              "-il-" as mortyp "I" and feed only "il"
                              to the "sum check"
- "[M:preter]"        -- F210 show "preter" link to "preter" categorize
                              "preter" as mortyp "M" and feed "preter"
                              to the "sum check"
- "[M:(k)irim]"       -- F210 show "(k)irim" link to "kirim" categorize
                              "kirim" as mortyp "M" and feed only "irim"
                              to the "sum check"
- "[M:kereta( )api]"  -- F210 show "kereta( )api" link to "kereta api"
                              categorize "kereta api" as mortyp "M" and feed
                              only "keretaapi" to the "sum check"
- "[L:kat(o)]"        -- F210 show "kat(o)" link to "kato" categorize only
                              "kat" as mortyp "N" and feed only "kat"
                              to the "sum check"

- "[P:blah-/Blah]"    -- F211 show "Blah" link to "blah-" categorize
                              "blah-" as mortyp "P" and feed "Blah"
                              to the "sum check"

Examples of dubious syntax (not against this specification but not useful):

- "[blah/blah]"       -- F201 unnecessary slash ("target" = "visible text",
                              use "[blah]" instead)
- "[blah-/blah]"      -- F201 show "blah" link to "blah-"
                              ("[P:blah]" is usually better)

- "[N:kat(o)]"        -- F210 show "kat(o)" link to "kato" categorize
                              "kato" as mortyp "N" and feed "kat"
                              to the "sum check" (dubious effect,
                              "[L:kat(o)]" is probably what we intended)

- "[P:blah-/blah]"    -- F211 show "blah" link to "blah-" and select mortyp
                              "P" (unnecessary, "[P:blah]" is sufficient)
- "[N:kato/kat(o)]"   -- F211 show "kat(o)" link to "kato" categorize "kato"
                              as mortyp "N" and feed only "kat" to the
                              "sum check", this does have same effect as
                              "[N:kat(o)]" but not as "[L:kat(o)]" (dubious
                              effect, "[L:kat(o)]" is probably
                              what we intended)
- "[N:kat/kat(o)]"    -- F211 show "kat(o)" link to only "kat" categorize
                              only "kat" as mortyp "N" and feed only "kat"
                              to the "sum check", this does not have
                              same effect as "[L:kat(o)]" (dubious effect,
                              "[L:kat(o)]" is probably what we intended)
- "[M:kirim/(k)irim]" -- F211 show "(k)irim" link to "kirim" categorize
                              "kirim" as mortyp "M" and feed only "irim"
                              to the "sum check", this does have
                              same effect as "[M:(k)irim]" (unnecessary,
                              "[M:(k)irim]" is sufficient)
- "[I:il/il]"         -- F211 show "il" link to "il" categorize
                              "il" as mortyp "I" and feed "il"
                              to the "sum check", no auto-added slashes here
                              (dubious effect, "[I:il]" is probably what
                              we intended)

Examples of illegal syntax of fragments:

- "[[blah/Blah]]"  -- double or multiple brackets
- "[blah/Bl[a]h]"  -- nested or unbalanced brackets
- "[blah|Blah]"    -- illegal genuine wall
- "[blah/Blah ]"   -- illegal space
- "[ blah/Blah]"   -- illegal space
- "[blah/]"        -- illegal empty content (invisible link makes no sense)
- "[/blah]"        -- illegal empty content (use "blah" instead)
- "[M:/blah]"      -- illegal empty content (use "M:blah" instead)
- "[N:kat(o)/kat]" -- arc brackets are illegal in the target field
- "[L:katrol]"     -- "L" used but no arc brackets
- "[L:kat/kat(o)]" -- "L" used with type F211 ie 2 fields
- "[L:kat(r )ol]"  -- "L" used with leading or trailing space
                      inside arc brackets
- "[A:blah-/blah]" -- illegal mortyp (only some selected uppercase
                                      letters are permitted)

Examples of complete syntax of the split control parameter:

lemma "pertidaksamaan"
  -> [C:per-...-an/per][M:tidak][M:sama][C:per-...-an/an]
lemma "perkeretaapian"
  -> [C:per-...-an/per][M:kereta][M:api][C:per-...-an/an]
  -> [C:per-...-an/per][M:kereta( )api][C:per-...-an/an]
  -> [C:per-...-an/per]+[M:kereta( )api]+[C:per-...-an/an]
lemma "mengirim"       -> [P:meN-/meng][M:(k)irim]    (dash "-" is required)
lemma "icke-binaer"    -> [P:icke-][M:binaer]
lemma "kingdom"        -> [M:king][U:dom]
lemma "kingdom"        -> [M:king]+[U:dom]
lemma "hallon"         -> hall[U:on]
lemma "hallon"         -> hall+[U:on]
lemma "God"            -> [god/God]                            (no category)
lemma "tridek"         -> [M:tri][M:dek]
lemma "tridek"         -> [M:tri]+[M:dek]
lemma "tridek"         -> [tri][dek]                           (no category)
lemma "fervojo"        -> [L:fer(o)][L:voj(o)][U:o]
lemma "kungadoeme      -> [M:kung]+a+[M:doeme]

Examples of dubious syntax (not against this specification but not useful):

lemma "perkeretaapian"
  -> [C:per-...-an/per][M:keretaapi][C:per-...-an/an]
     (links to wrong morpheme "keretaapi)
  -> [C:per-...-an/per][M:kereta api/kereta( )api][C:per-...-an/an]
     (unnecessarily complicated)
lemma "mengirim"
  -> [P:meng][N:irim]                               (links to wrong morphemes)
  -> [P:meng][irim]         (links to wrong morphemes, "irim" not categorized)
  -> [P:meN-/men][M:kirim/girim]    (shows one dubious and one wrong morpheme)
  -> [meN-/meng][M:(k)irim]                        ("meN-" is not categorized)
lemma "hallon"
  -> hall +[U:-on/on]                           (junk space precedes the plus)
  -> hall[U:-on/on]                  (unnecessary, "hall[U:on]" is sufficient)
  -> hall+[U:-on/on]                (unnecessary, "hall+[U:on]" is sufficient)
lemma "God"      -> [W:god/God]                              (better use "$B")
lemma "God"      -> [M:god/God]                              (better use "$B")
lemma "dek tri"  -> [dek] [tri]                         (better use automatic)
lemma "dek tri"  -> [W:dek] [W:tri]                     (better use automatic)

Examples of illegal syntax of the split control parameter:

lemma "perkeretaapian"
  -> [C:per-...-an/per][M:kereta api][C:per-...-an/an] (fails the "sum check")
lemma "icke-binaer"
  -> [P:icke][M:binaer]                                (fails the "sum check")
lemma "mengirim"
  -> [P:meN][M:kirim/girim]                            (fails the "sum check")
lemma "hallon"      -> hall++[U:-on/on]                          (double plus)
lemma "hallon"      -> hall+ [U:-on/on]           (junk space beside the plus)
lemma "hallon"      -> +hall[U:-on/on]                          (leading plus)
lemma "hallon"      -> hall[U:-on/on]+                         (trailing plus)
lemma "hallon"      -> ha+ll[U:-on/on]     (plus is not adjacent to a bracket)

Field processing:

The left field (left of the slash but after a possible colon) is for the link
and the category. The right one (right of the slash) is showed on the screen
and fed into the "sum check". If only one field is given (combo field), then
it is assumed to be primarily the right one and the left one is auto-generated
by copying from the right one with removing brackets but not content between
them (for example "fer(o)" -> "fero"). If the morpheme type is specified but
link target (left field) is not, thus fragment type is F210, then the splitter
will in some cases (I P U) additionally enhance the left field by adding "-"
if it is not yet present (for example for I: "il" -> "-il-" and for P "re" ->
"re-" and "icke-" -> "icke-"). The strings used for link and category are
always exactly same except when pseudo-type "L" is used (note that the simple
root split also breaks against this principle). Other rules apply to the right
field where the string is displayed literally but bracketed part is removed
before feeding it into the "sum check" allowing to display deleted letters
and even deleted spaces.

Simple root split
-----------------

The syntax "$S" selects the simple root split strategy. It is fully
automatic and cannot be tuned in any way.

The pagename must consist of at least 2 letters and the last one must
be ASCII lowercase "a"..."z", otherwise an error is triggered.

The last letter of the lemma is separated. The remaining root with beginning
letter changed to lowercase if needed is used to brew the category include
with type "N" (nonstandalone) and as the main page of the category. If the
beginning letter was uppercase, then a link to the full lemma with beginning
letter changed to lowercase is created, otherwise there is no link (because
it would be a self-link). The last letter becomes a suffix with a dash and
is linked and the category include receives the type "U".

This is intended for Esperanto words built from and representing nonstandalone
roots including proper nouns having the lowercase variant (for example "Suno"
and "suno") (see below under "Examples and selecting the optimal strategy").

This assumes that all Esperanto roots are denoted with lowercase letters,
for example "sun" for "suno" and "Suno" and there is no root "Sun".

For Esperanto proper nouns without any lowercase variant no split or manual
split is the choice, and if desired the extra parameter to categorize the main
root (see below example word "GXakarto" under "Examples and selecting the
optimal strategy").

Simple bare root
----------------

The syntax "$B" selects the simple bare root strategy. It is fully
automatic and cannot be tuned in any way. But it depends on the word
class, or more precisely whether it is "NR" ie nonstandalone root.

The root with beginning letter changed to lowercase (unless "NR") if needed
is used to brew the category include with type "M" (standalone) or "N"
(nonstandalone, if "NR") and as the main page of the category. If the
beginning letter was uppercase, then a link to the lemma with beginning
letter changed to lowercase is created, otherwise (and with "NR") there
is no link (because it would be a self-link).

This is intended for standalone roots including proper nouns having
the lowercase variant (for example "Sun" and "sun") (see below under
"Examples and selecting the optimal strategy").

Note that "fra=$B" has very same effect as "ext=&M" if the word class is
not "NR" and the lemma does not begin with uppercase, and "fra=$B" has very
same effect as "ext=&N" if the word class is "NR" irrespective letter case.

Large letter split
------------------

The syntax "$H" selects the large letter split strategy. It is fully
automatic and cannot be tuned in any way.

The lemma is split into single letters. This is most useful for but
not restricted to Chinese ones. The lemma must not contain punctuation,
spaces, dash "-", apo "'" and so-called combining accents, and it may
be at most 16 characters long. Decorated Latin letters such as Swedish
or Esperanto ones are tolerable but probably not useful to feed into
this split.

The resulting single letters are linked and categorized as mortyp "M". If
another mortyp is needed, then the manual split must be used instead.

Words with only one fragment
----------------------------

For standalone words that are also roots (for example "sun") use the
simple bare root strategy with syntax "$B". This will categorize the
page as type "M" and main page of the category (sorting hint "-") but
not link.

In Esperanto it might be desirable to have the lemma (for example "suno")
with its "native" suffix but the category without it, this can be achieved
by the "$S"-syntax and simple root split. This will categorize the page as
type "N" and main page of the category (sorting hint "-") but not link.

This works even for standalone words differing from the root by case
(for example "Sun" or "Suno"). This will categorize in very same way but
link to "sun" or "suno".

Thus both "sun" and "Sun" will be main pages of the category under "-" as
opposed to ordinary words as for example "sunshine" under "S" and
(theoretically) "insunity" under "I".

For Esperanto standalone roots (prepositions, numerals, subordinators, some
adverbs, ...) "$B" is the preferred solution.

For non-Esperanto proper nouns without any lowercase counterpart no split
is the choice, and if desired the extra parameter to categorize the root
(see below example word "Inverness" under "Examples and selecting the
optimal strategy").

For affix lemmas (for example "meN-", "-kan", "-il-") use no split and the
extra parameter with "&"-syntax. This will categorize the page as selected
type (for example "I") and main page of the category (sorting hint "-").

Syntax of the extra parameter
-----------------------------

The extra parameter is evaluated only if compound (morpheme) categories
are desired (globally enabled in the module and further conditions met),
otherwise it is useless and ignored, and no error occurs, but the splitter
may still be called and generate links.

The syntax of the extra parameter is similar to the syntax of the split
control parameter requesting manual split, but there are some crucial
differences and restrictions, and 3 enhancements ("!" and "&" and "X").

The difference is that nothing is visible, nothing is linked and there is no
"sum check". The purpose of the extra parameter is to create extra compound
(morpheme) categories. For example with the split control parameter we can
link the lemma "perkeretaapian" to either "kereta api" or "kereta" and "api"
but never both. To get both we need the extra parameter.

The extra parameter consists either of 1 to 4 fragments similar to those
for manual split, or 2 char:s of a special value.

Restrictions for fragments:
- only type F210 (2 brackets, 1 colon/exclam, no slash) is permitted
- morpheme type must be specified, L is prohibited, only C I M N P U W left
- no arc brackets "(" ")", no plus "+", no slash "/"

Enhancement for fragments:
- separator after morpheme type can be not only colon ":" but also
  exclam "!" to request main page in category

Special value:
- char "&" followed by mortyp code (pseudo type "X" permitted here) to add
  page to compound (morpheme) category by pagename and as main page

The char "&" followed by a valid uppercase letter (one of 8, same as for the
fragment syntax plus "X", but obviously not "L") creates one or two extra
compound (morpheme) category includes. The type of the morpheme must be
specified (see even list above), namely one of "&C" "&I" "&M" "&N" "&P" "&U"
"&W" creating one include, and "&X" creating two, namely types "M" and "W".
Note that combinations other than "M" + "W" are obviously useless, and
that pseudo type "L" is prohibited. The lemma is marked as main page of the
category (key "-"). The "&"-syntax is useful for affix lemmas (note that
dashes "-" must come from the lemma itself then, the splitter does not add
any), whereas for standalone and nonstandalone roots the simple bare root
and simple root split strategies are preferable.

Difficult cases
---------------

The splitter is designed to automate common cases, save typing work and
minimize the risk of errors. But there are situations where the restrictions
and sanity checks block seemingly easy solutions and create almost
unsolvable challenges.

There are 4 versions of the morpheme that have to be managed, exemplified on
the lemma "perkeretaapian" and its root "keretaapi" and last 2 letters "an":

- morpheme fed into the "sum check" cut from the pagename
  (here "keretaapi" and "an", anything else would fail the "sum check")
- linked morpheme
  (here "kereta api" and "-an", for example "keretaapi" is not
   a valid word or root)
- categorized morpheme
  (here "kereta api" with type "M" and "per-...-an" with type "C",
   for example categorizing "-an" only would be definitely inferior)
- showed morpheme
  (here "kereta( )api" and "an")

Here are the workarounds needed for such cases.

# link and categorize
  - the default and easy case, use "$B", or type F210 ie
    morpheme type + colon + morpheme

# link but do NOT categorize
  - use type F200, omit the morpheme type and colon

# categorize but do NOT link
  - put the morpheme type and morpheme into the extra parameter,
    type F210, and bare text if supposed to be visible into the
    split control parameter, type F000

# do NOT categorize and do NOT link
  - put bare text into the split control parameter, type F000

# link to several alternatives
  - put the main alternative into the the split control parameter and
    further one or several alternatives into the extra parameter
  - this is frequently needed for "middle-level" compounds

# link and categorize but with different names
  - use the pseudo-type "L" if useful, otherwise use type F200 in the
    split control parameter and category in the extra parameter
  - the former is needed for all nonstandalone roots in Esperanto, the
    latter for compounds involving proper nouns or assimilations, for
    example "penektomio" and "ekstertero"

Examples and selecting the optimal strategy
-------------------------------------------

# "pembangkit listrik" (-id-) (-en-: "power plant")
  - default automatic split gives perfect result

# "pembangkit listrik tenaga surya" (-id-) (-en-: "solar power plant")
  - default automatic split gives good result
  - assisted split
      "fra=%0"
    gives a maybe better result and costs only 2 char:s

# "pertidaksamaan" (-id-)
  - note that there is most likely no lemma "tidak sama"
  - default automatic split gives no result
  - assisted split does not help either (it is not
    possible to add boundaries, only to block such)
  - use manual split
      "fra=[C:per-...-an/per][M:tidak][M:sama][C:per-...-an/an]"
      "fra=[C:per-...-an/per]+[M:tidak]+[M:sama]+[C:per-...-an/an]"

# "perkeretaapian" (-id-)
  - note that there most likely is a lemma "kereta api" we want to link to
  - default automatic split gives no result
  - tempting solution
      "fra=[C:per-...-an/per][M:kereta api][C:per-...-an/an]"
    is invalid and fails the "sum check"
  - tempting solution
      "fra=[C:per-...-an/per][M:keretaapi][C:per-...-an/an]"
    is not against this specification but links to invalid
    morpheme "keretaapi"
  - tempting solution
      "fra=[C:per-...-an/per][M:kereta api/keretaapi][C:per-...-an/an]"
    is not against this specification and links to morpheme "kereta api"
    but shows invalid morpheme "keretaapi" on the screen
  - possible manual split
      "fra=[C:per-...-an/per][M:kereta][M:api][C:per-...-an/an]"
    links to "kereta" and "api" but not "kereta api"
  - possible manual split
      "fra=[C:per-...-an/per][M:kereta( )api][C:per-...-an/an]"
    links to "kereta api" and shows "kereta( )api" but less nice
  - probably better way to do
      "fra=[C:per-...-an/per]+[M:kereta( )api]+[C:per-...-an/an]"
    links to "kereta api" and shows "kereta( )api" and the extra parameter can
    be used to add categorization for "kereta" and "api" besides "kereta api"

# "mengirim" (-id-) (-en-: "send")
  - we want to link to "meN-" and "kirim"
  - default automatic split gives no result
  - tempting solution
      "fra=[P:meN][M:kirim/girim]"
    is invalid and fails the "sum check"
  - tempting solution
      "fra=[P:meng][N:irim]"
    links to wrong morphemes "meng-" and "irim"
  - tempting solution
      "fra=[P:meng][irim]"
    links to two wrong morphemes and "irim" is not categorized
  - tempting solution
      "fra=[P:meN-/men][M:kirim/girim]"
    shows one dubious ("men") and one wrong ("girim") morpheme
  - tempting solution
      "fra=[meN-/meng][M:(k)irim]"
    "meN-" is not categorized
  - tempting solution
      "fra=[P:meN/meng][M:(k)irim]"
    links to "meN", dash is not added since this is type F211
  - correct way to do
      "fra=[P:meN-/meng][M:(k)irim]"
  - correct way to do and nicer
      "fra=[P:meN-/meng]+[M:(k)irim]"

# "penectomy" (-en-)
  - the root is "penis" but is reduced to "pen"
  - correct way to do
      "fra=[M:pen(is)]+[U:ectomy]"

# "When in a hole, stop digging." (en)
  - default automatic split gives inferior result linking to "When"
  - use assisted split
      "fra=#0I"
    costs 3 char:s

# "When in Rome, do as the Romans do." (en)
  - default automatic split gives inferior result
    linking to "When" and "Romans"
  - assisted split
      "fra=#0I"
    is insufficient as it fixes "When" only but not "Romans"
  - use assisted split
      "fra=#0I #6"
    costs 6 char:s
  - manual split is possible but far more expensive
      "fra=[when/When] [in] [Rome], [do] [as] [the] [Roman/Romans] [do]."
    costs 61 char:s and note that "very raw manual split" would
    be even worse

# "sun" (-en-)
  - there is nothing to split and nothing to link to
  - use simple bare root strategy
    "$B"

# "Sun" (en)
  - we want to link to "sun"
  - use simple bare root strategy
    "$B"

# "Inverness" (en)
  - there is nothing to split and nothing to link to
  - use no split (default)
  - if you badly want to categorize the root "Inverness" (with uppercase
    "I") then use the extra parameter
    "" and "&M"
    no link but category of type "M" and main page in it

# "polvosucxilo" (-eo-)
  - we want to link to "polvo" and "sucxi" but categorize "polv" and "sucx"
  - default automatic split gives no result
  - tempting solution
    "[N:polv(o)]+[I:o]+[N:sucx(i)]+[I:il]+[U:o]"
    categorizes "polvo" and "sucxi", probably not what we intended
  - tempting solution
    "[N:polvo/polv(o)]+[I:o]+[N:sucxi/sucx(i)]+[I:il]+[U:o]"
    categorizes "polvo" and "sucxi", probably not what we intended
    and a bit horrible to type
  - tempting solution
    "[N:polv/polv(o)]+[I:o]+[N:sucx/sucx(i)]+[I:il]+[U:o]"
    links to "polv" and "sucx", probably not what we intended
    and a bit horrible to type
  - correct way to do
    "[L:polv(o)]+[I:o]+[L:sucx(i)]+[I:il]+[U:o]"
    links to "polvo" and "sucxi" but categorizes "polv" and "sucx"
  - note that same string "o" is used twice with different types of morpheme

# "penektomio" (-eo-)
  - the root is "penis(o)" but is reduced to "pen"
  - here the assimilation and nonstandalone root together create a
    dilemma that cannot be solved without the extra parameter
  - correct way to do
      "fra=[peniso/pen(iso)]+[I:ektomi]+[U:o]|ext=[N:penis]"
  - alternative way to do
      "fra=[L:penektomi(o)]+[U:o]|ext=[N:penis][I:ektomi]"

# "suno" (-eo-) (-en-: sun)
  - there is nothing to link to for the main root "sun"
  - most likely there is a page "sun" but it is not a valid Esperanto
    word, it's English and we do not want to link it here (maybe in the
    translation section instead)
  - tempting solution
      "fra=[N:sun(o)]+[U:o]"
    shows "sun(o) + o", tries to link to "suno" (self-link)
    and categorizes "suno", probably not what we intended
  - tempting solution
      "fra=[L:sun(o)]+[U:o]"
    shows "sun(o) + o", tries to link to "suno" (self-link)
  - use simple root split
      "fra=$S"
    shows "sun + o", "sun" is not linked, "o" links to "-o",
    "sun" is categorized as type "N" and main, "-o" as type "U"

# "Suno" (-eo-) (-en-: Sun)
  - we want to link to "suno"
  - use simple root split
      "fra=$S"
    shows "Sun + o", "Sun" links to "suno", "o" links to "-o",
    "Sun" is categorized as type "N" and main, "-o" as type "U"

# "terpomo" (-eo-) (-en-: "potato")
  - very typical compound
  - use manual split
      "fra=[L:ter(o)]+[L:pom(o)]+[U:o]"

"ekstertero" (-eo-) (-en-: "outer space")
  - very tough compound
  - we want to link to "Tero" (-en-: "Earth", "planet Earth"), not to
    "tero" (-en-: "soil", "ground", "earth")
  - the root "ter" is common for "terpomo" and "extertero", making it case
    sensitive would be good for this word, but otherwise cause much more
    trouble than benefit, thus both "terpomo" and "ekstertero" will appear
    at both "tero" and "Tero"
  - tempting solution
    "[L:ekster(a)]+[L:ter(o)]+[U:o]"
    is suboptimal as it links to "tero"
  - tempting solution
    "[L:ekster(a)]+[L:Ter(o)]+[U:o]"
    is invalid and fails the "sum check"
  - correct way to do
    "[L:ekster(a)]+[Tero/ter(o)]+[U:o]" and "[N:ter]"
    links to "Tero", this is a unique tought case where the extra
    parameter is needed to circumvent the restrictions of the splitter

"etfingro" (-eo-)
  - here an infix (originally intended to be located near the end of a
    word and frequently even called "suffix" although never tolerable at
    very end of a word) became a prefix
  - correct way to do
    "[I:et]+[L:fingr(o)]+[U:o]"

"GXakarto" (-eo-) (-en-: "Jakarta", city in Indonesia)
  - the root could be "gxakart" but we do not want to link to it
  - tempting solution
    "gxakart+[U:o]"
    is invalid and fails the "sum check"
  - tempting solution
    "$S"
    shows "GXakart + o", "GXakart" links to "gxakarto" (invalid target)
  - use no split (default)
  - if you badly want to categorize the suffix "-o" then use manual split
    "GXakart+[U:o]"
  - if you badly want to categorize both the root "gxakart" (with lowercase
    "gx") and the suffix "-o" then use manual split and the extra parameter
    "GXakart+[U:o]" and "[N:gxakart]"

"-ist-" (-eo-)
  - there is nothing to split and nothing to link to
  - let the automatic split fail, alternatively use "-" for no split
  - use the extra parameter with "&"-syntax, type is "I"
    "" and "&I"

"-ist" (en, sv)
  - there is nothing to split and nothing to link to
  - let the automatic split fail, alternatively use "-" for no split
  - use the extra parameter with "&"-syntax, type is "U"
    "" and "&U"

"skolbok" (-sv-)
  - default automatic split gives no result
  - tempting solution
    "[M:skola][M:bok]"
    is invalid and fails the "sum check"
  - tempting solution
    "[M:skol][M:bok]"
    links to invalid morpheme "skol"
  - tempting solution
    "[L:skol(a)][M:bok]"
    links as supposed to "skola" but categorizes "skol" as type "N"
  - correct way to do
    "[M:skola/skol(a)][M:bok]"
    "[M:skola/skol(a)]+[M:bok]"
  - pseudo-type "L" is of no use here

"varumaerke" (-sv-)
  - default automatic split gives no result
  - tempting solution
    "[L:var(a)u][M:maerke]"
    links to invalid morpheme "varau", and categorizes "varu" as type
    "N" that was probably not intended either
  - correct way to do
    "[M:vara/var(a)u][M:maerke]"
    "[M:vara/var(a)u]+[M:maerke]"
  - pseudo-type "L" is of no use here

"loeparsko" (-sv-) (-en-: running shoe)
  - here two letters are stolen in the assimilation, one of them in a suffix
  - default automatic split gives no result
  - correct way to do
    "[M:loep(a)]+[U:-are/ar(e)]+[M:sko]"

"#@" (-zh-) (note that chars "#" and "@" represent 2 Chinese letters here)
  - default automatic split gives no result
  - use the large letter split strategy
    "$H"
    if the morphemes are not standalone then manual split is needed

Limitations and (lack of) further automatization (rationale)
------------------------------------------------------------

The splitter is language-independent (and somewhat script-independent)
and does not know any grammar.

Thus it cannot in the above example "When in Rome, do as the Romans
do." change "Romans" to "Roman" although it looks like a trivial task.
Adding (partial) support for grammar in some language (EN, EO, SV, ...)
would start a never-ending mess and dissatisfaction. Automatic removing
of some "common" affixes (-en-: plural or 3rd person "-s", -eo-: "-j" and
"-n", -sv-: "-ar", ...) would cause false positives and much trouble.

The simple principle is: no grammar and no dictionary.

Similarly, the word "berjalan-jalan" will remain together, like all other
words containing a dash. If you want to split for example the lemma
"user-friendly" you have to use the (optimized) manual split. The
word "o'clock" will remain together and so "there's" will, in order
to split the latter to "there"+"'s" the manual approach is required.

Also the splitter does not try to guess the earliest (linkable)
morpheme to be a prefix or the last one to be a suffix.
--]]