Modulo:mutf8debug

El Vikivortaro
Salti al navigilo Salti al serĉilo

Dokumentado por ĉi tiu modulo povas esti kreata ĉe Modulo:mutf8debug/dokumentado

--[===[

MODULE "MUTF8DEBUG" (debug UTF8 text)

"eo.wiktionary.org/wiki/Modulo:mutf8debug" <!--2020-Dec-25-->
"id.wiktionary.org/wiki/Modul:mutf8debug"

Purpose: allows to debug an incoming UTF8 string (literally submitted or
         generated by a template) by splitting it into isolated chars,
         checking validity of the UTF8 stream and displaying chars and codes,
         or by performing a "hard nowiki" and displaying all text including
         spaces

Utilo: ebligas sencimigi enirantan UTF8 signocxenon (lauxlitere enigitan aux
       generitan far sxablono) per dispecigo farigxante apartaj signoj ...

Manfaat: memungkinkan ...

Syfte: moejliggoer att debugga en inkommande UTF8 straeng (oevergiven ...

Used by templates / Uzata far sxablonoj:
- "debu" (for debugging, see below)

Required submodules / Bezonataj submoduloj:
- none / neniuj

This module can accept parameters whether sent to itself (own frame) or
to the caller (caller's frame). If there is a parameter "caller=true"
on the own frame then that own frame is discarded in favor of the
caller's one.

Incoming: - one anonymous obligatory parameter
            - input string (empty is legal but not very useful, 64 KiO max)
          - one anonymous optional parameter
            - output type selection (4 digits, bool or fourstate)
              - octet bloat ("0" or "1")
              - big boxes for single char:s ("0" or "1")
              - hard nowiki ("0" or "1" (no colour) or "2" (coloured)
                or 3 (coloured and split UTF8))
              - UTF8 char bloat ("0" or "1")
            - default is "1101", "0000" is prohibited, "nw" is synonymous
              with "0010", empty main input switches the type to "1000"

Returned: - large text with complicated wikicode

This module is unbreakable (when called with correct module name
and function name).

Cxi tiu modulo estas nerompebla (kiam vokita kun gxustaj nomo de modulo
kaj nomo de funkcio).

This module is special in that it can seem unused and useless. Do not
delete it just because no pages link to it. Its purpose is not to be linked
from article, lemma, appendix or whatever pages. It is to be used temporarily
when debugging UTF8 text, preferably from the sandbox. With the option
"hard nowiki" it can even be used for documentation and selftest of modules
and templates and the proxy template "debu" can be considered as a
documetation template.

Note that "<nowiki>" does NOT work in wikitext generated by a module. We
must DEC-encode instead. This works for the commmon problem char:s ":#*='[]"
(there is no problem with "{}"). But DEC-encoding does NOT work for UTF8
multi-octet char:s. So we DEC-encode only some ANSI/ASCII char:s $00...$7F
and leave the remaining ones pass unchanged (both for "big boxes" and "hard
nowiki"). Note that DEC-encoding does NOT work for LF either. In the "big
boxes" mode we catch LF separately.

In text coming from a module some evil stuff (broken encoding, ZERO,
FF/12, ZWSP, LRM, RLM) is replaced by U+$FFFD, whereas other (TAB, CR,
NBSP, BOM) survide.

Color coding of the result in the "big boxes" mode:

1  white         ordinary ANSI/ASCII char
2  light grey    valid 2-octet UTF8 with some exceptions
3  grey          valid 3-octet UTF8 with some exceptions
4  dark grey     valid 4-octet UTF8 (with no exceptions yet)
5  red           code ZERO or invalid UTF8 sequence or empty main input
6  yellow        dubious TAB CR NBSP ZWSP LRM RLM BOM
7  light yellow  invisile LF SPACE
8  light blue    initial (except empty main input) and final UTF8 bloat report

Error <<FATAL in "mutf8debug" : internal error or missing or invalid
parameter'>> is NOT included in the above list.

Some interesting UTF8 codepoints:

codepo    codepo      UTFG-8                   visible    silly
int HEX   int DEC     encoding                 name       notes
--------  ----------  -----------------------  ---------  ----------------------
   $0000     #00'000                           ZERO
   $0009     #00'009                           TAB
   $000A     #00'010                           LF
   $000D     #00'013                           CR
   $0020     #00'032                           SPACE
   $007F     #00'127                                      inclusive end of 1-oct
   $0080     #00'128  $C2,$80                             begin of 2-oct
   $00A0     #00'160  $C2,$A0                  NBSP       (don't break me)
   $00BF     #00'191  $C2,$BF                             inclusive end of $C2,xx
   $00C0     #00'192  $C3,$80                             begin of $C3,xx
   $00FF     #00'255  $C3,$BF                             inclusive end of $C3,xx
   $0100     #00'256  $C4,$80                             begin of $C4,xx
   $034F     #00'847                                      COMBINING GRAPHEME JOINER
   $0401     #01'025  $D0,$81                             CCCP case delta $50
   $0451     #01'105  $D1,$91                             CCCP case delta $50
   $07FF     #02'047  $DF,$BF                             inclusive end of 2-oct
   $0800     #02'048  $E0,$80,$80                         begin of 3-oct
   $200B     #08'203  $E2,$80,$8B              ZWSP       ZERO WIDTH SPACE
   $200C     #08'204  $E2,$80,$8C                         ZWNJ ZERO WIDTH NON-JOINER
   $200D     #08'205  $E2,$80,$8D                         ZWJ ZERO WIDTH JOINER
   $200E     #08'206  $E2,$80,$8E              LRM        LEFT-TO-RIGHT MARK
   $200F     #08'207  $E2,$80,$8F              RLM        RIGHT-TO-LEFT MARK
   $2060     #08'288  $E2,$81,$A0                         (absurd "WORD JOINER")
   $2068     #08'296  $E2,$81,$A8              FSI        FIRST STRONG ISOLATE
   $20AC     #08'364  $E2,$82,$AC                         EURO (bank robbery)
   $D800     #55'296                                      begin of banned range
   $DFFF     #57'343                                      inclusive end of banned range
   $E000     #57'344                                      begin of legal range again
   $FEFF     #65'279  $EF,$BB,$BF 239,187,191  BOM        (absurd "BOM" Sigi)
   $FFFD     #65'533  $EF,$BF,$BD 239,191,189             REPLACEMENT CHARACTER
   $FFFE     #65'534  $EF,$BF,$BE 239,191,190             invalid (last 2)
   $FFFF     #65'535  $EF,$BF,$BF 239,191,191             invalid (last 2), inclusive end of 3-oct
$01'0000     #65'536  $F0,$90,$80,$80                     begin of 4-oct
$01'0348     #66'376  $F0,$90,$8D,$88                     one of few somewhat known
$0F'FFFF  #1'048'575  $F3,$BF,$BF,$BF                     one Mi almost reached
$10'0000  #1'048'576  $F4,$80,$80,$80                     one Mi reached here and no end yet
$10'FFFE  #1'114'110  $F4,$8F,$BF,$BE                     invalid (last 2)
$10'FFFF  #1'114'111  $F4,$8F,$BF,$BF                     invalid (last 2), inclusive end of unicode
$11'0000  #1'114'112 ($F4,$90,$80,$80)                    invalid (finally out of range)

- UTF8 is defined by "RFC 3629" from 2003-Nov (but already used to
  exist before, though)
- Absolute unicode range has 17 (seventeen !!!) planes per 65'536 values,
  total 1'114'112, most of them are unused, plane ZERO is somewhat full, other
  ones are almost or totally empty, official notation: "U+0000..U+10FFFF"
- Codepoint range ZERO to 31 is valid by RFC but useless, same for 127,
  128 to 159, whereas 160 (AKA NBSP) is maybe useful
- Range "U+D800" to "U+DFFF" is invalid by RFC
- UTF8 starting octet can be only $C2 to $DF , $E0 to $EF , $F0 to $F4
  giving a continuous range from $C2 to $F4 of size $33 = #51 values
- UTF8 subsequent octet's (1 or 2 or 3) can be only $80 to $BF
  (6 bit:s, 64 possible values)
- The octet values $C0, $C1 and $F5 to $FF may never appear in a UTF8 file

Abs. char number range |      UTF8 octet sequence       | beginning octet
   (hexadecimal)       |            (binary)            |
-----------------------+--------------------------------+------------------
0000'0000 to 0000'007F | 0xxxxxxx                       | $00        to $7F
0000'0080 to 0000'07FF | 110xxxxx 10xxxxxx              | $C0 -> $C2 to $DF
0000'0800 to 0000'FFFF | 1110xxxx 10xxxxxx 10xxxxxx     | $E0        to $EF
0001'0000 to 0010'FFFF | 11110xxx 10xxxxxx 10xxxxxx ... | $F0 to $F7 -> $F4

]===]

local utf8debug = {}

------------------------------------------------------------------------

---- CONSTANTS ----

------------------------------------------------------------------------

  -- constant strings (error circumfixes)

  local constrkros  = '&nbsp;#&nbsp;#&nbsp;'       -- lagom -> huge circumfix
  local constrelabg = '<span class="error"><b>'    -- lagom whining begin
  local constrelaen = '</b></span>'                -- lagom whining end

  -- HTML stuff for our tiny table and background around every char

  local constrtabu3  = '<table style="display:inline-block; vertical-align:middle; margin:0.15em; padding:0.15em; border:0.15em solid #000000; text-align:center; background-color:#' -- missing color code and many char:s (only 3 ';">' to close element)
  local constrtabu4  = ';"><tr><td>'
  local constrtabu5  = '</td></tr></table>'
  local constrbkg3   = '<span style="font-size:160%;background-color:#E0A0FF;">&nbsp;'
  local constrbkg4   = '&nbsp;</span>'

  local contabwarna = {}
  contabwarna = {'FFFFFF','E8E8E8','D0D0D0','B8B8B8','FF6060','FFFF60','FFFFB0','C8C8FF'}  -- (index 1...8)

  -- constant strings EN vs EO vs ID vs SV

  -- local constrkosong = 'empty string submitted'           -- EN
  local constrkosong = 'malplena signocxeno transdonita'  -- EO
  -- local constrkosong = 'string datang bersifat kosong'    -- ID
  -- local constrkosong = 'inkommen string aer tom'          -- SV

  -- local constrinvalid = 'invalid code sequence'           -- EN
  local constrinvalid = 'nevalida sekvo de kodoj'         -- EO
  -- local constrinvalid = 'rantai kode bersifat invalid'    -- ID
  -- local constrinvalid = 'ogiltig kodsekvens'              -- SV

  -- local constrinvalid = 'invalid optional parameter'           -- EN
  local constrinvalid = 'nevalida opcia parametro'             -- EO
  -- local constrinvalid = 'parameter opsional bersifat invalid'  -- ID
  -- local constrinvalid = 'ogiltig optional parameter'           -- SV

------------------------------------------------------------------------

---- ORDINARY LOCAL MATH FUNCTIONS ----

------------------------------------------------------------------------

-- Local function MATHDIV

local function mathdiv (xdividend, xdivisor)
  local resultdiv = 0 -- DIV operator lacks in LUA :-(
  resultdiv = math.floor (xdividend / xdivisor)
  return resultdiv
end--function mathdiv

-- Local function MATHMOD

local function mathmod (xdividendo, xdivisoro)
  local resultmod = 0 -- MOD operator is "%" and bitwise AND operator lack too
  resultmod = xdividendo % xdivisoro
  return resultmod
end--function mathmod

------------------------------------------------------------------------

-- Local function MATHXOR

-- This sub depends on "MATH FUNCTIONS"\"mathdiv"
-- and "MATH FUNCTIONS"\"mathmod".

local function mathxor (xa, xb)
  local resultxor = 0
  local crap6 = 0
  local crap7 = 0
  local crap8 = 1 -- single bit value 1 -> 2 -> 4 -> 8 ...
  while (true) do
    if ((xa==0) and (xb==0)) then
      break
    end--if
    crap6 = mathmod (xa,2) -- seize remainder before dividing
    crap7 = mathmod (xb,2) -- seize remainder before dividing
    xa    = mathdiv (xa,2)
    xb    = mathdiv (xb,2)
    if (crap6~=crap7) then
      resultxor = resultxor + crap8
    end--if
    crap8 = crap8 * 2
  end--while
  return resultxor
end--function mathxor

------------------------------------------------------------------------

---- ORDINARY LOCAL STRING FUNCTIONS ----

------------------------------------------------------------------------

-- test whether char is an ASCII digit "0"..."9", return bool

local function lftestnum (numkaad)
  local boodigit = false
  boodigit = ((numkaad>=48) and (numkaad<=57))
  return boodigit
end--function lftestnum

------------------------------------------------------------------------

-- test whether char is an ASCII uppercase letter, return bool

local function lftestuc (numkode)
  local booupperc = false
  booupperc = ((numkode>=65) and (numkode<=90))
  return booupperc
end--function lftestuc

------------------------------------------------------------------------

-- test whether char is an ASCII lowercase letter, return bool

local function lftestlc (numcode)
  local boolowerc = false
  boolowerc = ((numcode>=97) and (numcode<=122))
  return boolowerc
end--function lftestlc

------------------------------------------------------------------------

-- Local function LFIS62SAFE

-- Test whether incoming ASCII char is very safe (0...9 A...Z a...z).

-- This sub depends on "STRING FUNCTIONS"\"lftestnum" and
-- "STRING FUNCTIONS"\"lftestuc" and "STRING FUNCTIONS"\"lftestlc".

local function lfis62safe (numcxair)
  local booguud = false
  booguud = lftestnum (numcxair) or lftestuc (numcxair) or lftestlc (numcxair)
  return booguud
end-- function lfis62safe

------------------------------------------------------------------------

---- ORDINARY LOCAL CONVERSION FUNCTIONS ----

------------------------------------------------------------------------

-- Local function LFDEC1DIGLM

-- Convert 1 decimal ASCII digit to UINT8 with inclusive upper limit.

-- Use this for single-digit conversions with range and for pseudo-bool
-- (0,1) and for genuine bool (false,true) via "boosplitit=(numcrap==1)".

local function lfdec1diglm (num1dygyt, num1lim)
  num1dygyt = num1dygyt - 48 -- may become invalid ie negative
  if ((num1dygyt<0) or (num1dygyt>num1lim)) then
    num1dygyt = 255
  end--if
  return num1dygyt
end--function lfdec1diglm

------------------------------------------------------------------------

-- Local function LFUINT8TOHEX

-- Convert UINT8 (0...255) to 2-digit hex

-- This sub depends on "MATH FUNCTIONS"\"mathdiv"
-- and "MATH FUNCTIONS"\"mathmod".

local function lfuint8tohex (numinclow)
  local strheksulo = ''
  local numhajhaj = 0
  numhajhaj = mathdiv (numinclow,16)
  numinclow = mathmod (numinclow,16)
  if (numhajhaj>9) then
    numhajhaj = numhajhaj + 7 -- now 0...9 or 17...22
  end--if
  if (numinclow>9) then
    numinclow = numinclow + 7 -- now 0...9 or 17...22
  end--if
  strheksulo = string.char (numhajhaj+48) .. string.char (numinclow+48)
  return strheksulo
end--function lfuint8tohex

------------------------------------------------------------------------

-- Local function LFUINT32TOHEX

-- Convert UINT32 (0 ... $FFFF'FFFF = #4'294'967'295) to
-- (2 or 4 or 6 or 8)-digit hex

-- We need crass functions "mathdiv" and "mathmod"

local function lfuint32tohex (numincom)
  local strheksulego = ''
  while (true) do
    strheksulego = lfuint8tohex ( mathmod (numincom,256) ) .. strheksulego
    numincom = mathdiv (numincom,256)
    if (numincom==0) then
      break
    end--if
  end--while
  return strheksulego
end--function lfuint32tohex

------------------------------------------------------------------------

-- Local function LFHEXDEC

-- Example output : "$FE=#254" (we have to save text with)

-- Depends on "lfuint8tohex"

local function lfhexdec (numkodo)
  local strrezulto = ''
    strrezulto = "$" .. lfuint8tohex (numkodo) .. "=#" .. tostring (numkodo)
  return strrezulto
end--function lfhexdec

------------------------------------------------------------------------

-- Local function LFBUNCH

-- Add digit bunching to raw decimal number string

local function lfbunch (strnomorin)
  local strnomorut = ""
  local numlenn = 0
  local numindeex = 0 -- ZERO-based counts up
  local numcaar = 0 -- char of string
  numlenn = string.len(strnomorin)
  while (true) do
    if (numindeex==numlenn) then
      break
    end--if
    numcaar = string.byte(strnomorin,(numlenn-numindeex),(numlenn-numindeex))
    if ((mathmod(numindeex,3)==0) and (numindeex~=0)) then
      strnomorut = "'" .. strnomorut -- apo
    end--if
    strnomorut = string.char(numcaar) .. strnomorut
    numindeex = numindeex + 1 -- index counts up but we go back
  end--while
  return strnomorut
end--function lfbunch

------------------------------------------------------------------------

---- ORDINARY LOCAL UTF8 FUNCTIONS ----

------------------------------------------------------------------------

-- Local function LFUTF8LENGTH

-- Measure length of a single UTF8 char, return ZERO if invalid

-- Does NOT thoroughly check the validity, looks at 1 octet only

-- Input  : - numbgoctet (beginning octet of a UTF8 char)

-- Output : - numlen1234x (1...4 or ZERO if invalid)

local function lfutf8length (numbgoctet)
  local numlen1234x = 0
    if (numbgoctet<128) then
      numlen1234x = 1 -- $00...$7F -- ANSI/ASCII
    end--if
    if ((numbgoctet>=194) and (numbgoctet<=223)) then
      numlen1234x = 2 -- $C2 to $DF
    end--if
    if ((numbgoctet>=224) and (numbgoctet<=239)) then
      numlen1234x = 3 -- $E0 to $EF
    end--if
    if ((numbgoctet>=240) and (numbgoctet<=244)) then
      numlen1234x = 4 -- $F0 to $F4
    end--if
  return numlen1234x
end--function lfutf8length

------------------------------------------------------------------------

-- Local function LFUTF8DEKO

-- Decode a UTF8 char, return ZERO length if invalid

-- Result is a table: [0] length and [1] codepoint

-- This sub depends on "MATH FUNCTIONS"\"mathxor".

local function lfutf8deko (num0, num1, num2, num3)

  local tabresult = {}
  local numlength = 0 -- preassume invalid
  local numkodepoin = 0 -- preassume invalid

  num1 = mathxor (num1,128) -- XOR 3 of 4
  num2 = mathxor (num2,128) -- XOR 3 of 4
  num3 = mathxor (num3,128) -- XOR 3 of 4

  while (true) do -- fake loop

    if ((num0>193) and (num1>63)) then
      break -- to join mark
    end--if
    if ((num0>223) and (num2>63)) then
      break -- to join mark
    end--if
    if ((num0>239) and (num3>63)) then
      break -- to join mark
    end--if

    if (num0<128) then -- ZERO to $7F
      numkodepoin = num0
      numlength = 1
      break -- to join mark
    end--if

    if ((num0>193) and (num0<224)) then -- $C0 # $C2 to $DF
      numkodepoin = (mathxor(num0,192)) * 64 + num1
      if ((numkodepoin>127) and (numkodepoin<2048)) then
        numlength = 2
      end--if
      break -- to join mark
    end--if

    if ((num0>223) and (num0<240)) then -- $E0 to $EF
      numkodepoin = (mathxor(num0,224)) * 4096 + num1 * 64 + num2
      if (((numkodepoin>2047) and (numkodepoin<55296)) or ((numkodepoin>57343) and (numkodepoin<65536))) then
        numlength = 3
      end--if
      break -- to join mark
    end--if

    if ((num0>239) and (num0<245)) then -- $F0 to $F7 # $F4
      numkodepoin = (mathxor(num0,240)) * 262144 + num1 * 4096 + num2 * 64 + num3
      if ((numkodepoin>65535) and (numkodepoin<1114112)) then
        numlength = 4
      end--if
      break -- to join mark
    end--if

    break -- finally to join mark
  end--while -- fake loop -- join mark

  tabresult [0] = numlength
  tabresult [1] = numkodepoin
  return tabresult

end--function lfutf8deko

------------------------------------------------------------------------

-- Local function LFULTENCODE

-- Our cool module has brewed something with "[["..."]]" and repeated
-- spaces but we want to see plain text for debugging purposes. Thus we
-- dec-encode it, use NBSP to fix spaces and maybe add colour.

-- This hepls with "[["..."]]", "["..."]", "*", "#", ":" and with
-- multiple spaces. They are no longer reduced to one piece. Note that
-- there is no problem with "{{"..."}}".

-- We must be UTF8-aware. A UTF8 char must be either split in a controlled
-- way, or completely preserved ie not split or encoded at all.

-- Note that this causes BLOAT. The caller is responsible for
-- adding "<big>"..."</big>" if desired.

-- Input  : * "strkrampuj" : string, empty tolerable, but "nil" type is NOT
--          * "boowarrna"  : color enable
--          * "boosplit"   : split UTF8 char:s into hex numbers

-- Output : * "strkood"    : string, empty in worst case

-- This sub depends on "MATH FUNCTIONS"\"mathdiv"
-- and "MATH FUNCTIONS"\"mathmod" and "CONVERSION FUNCTIONS"\"lfuint8tohex"
-- and "STRING FUNCTIONS"\"lfis62safe" and "UTF8 FUNCTIONS"\"lfutf8length".

local function lfultencode (strkrampuj,boowarrna,boosplit)
  local stronechar = ""
  local strkolorr = ""
  local strkood = ""
  local numstrlne = 0
  local numpeekynx = 1 -- ONE-based index
  local numcahr = 0
  local numutf8len = 0
  local numcolour = 0 -- 0,1,2,3 -- R,G,B,Y
  local boonbsp = false
  local boosplnow = false
  numstrlne = string.len (strkrampuj)
  while (true) do -- genuine loop
    if (numpeekynx>numstrlne) then
      break
    end--if
    numcahr = string.byte (strkrampuj,numpeekynx,numpeekynx)
    numpeekynx = numpeekynx + 1 -- ONE-based index
    while (true) do -- fake loop
      if (numcahr==32) then
        if (boonbsp) then
          stronechar = "&nbsp;" -- this prevents space reduction
        else
          stronechar = " "
        end--if
        boonbsp = not boonbsp
        break -- to join mark
      end--if
      if (numcahr>127) then
        boosplnow = boosplit
        numutf8len = lfutf8length (numcahr)
        if (numutf8len==0) then
          boosplnow = true -- forced split for broken UTF8 sequence
        else
          numutf8len = numutf8len - 1 -- more char:s to pick
        end--if
        if ((numpeekynx+numutf8len)>(numstrlne+1)) then
          boosplnow = true -- forced split for broken UTF8 sequence
        end--if
        if (boosplnow) then
          stronechar = "{$" .. lfuint8tohex (numcahr) .. "}"
        else
          stronechar = string.char (numcahr)
          while (true) do
            if (numutf8len==0) then
              break
            end--if
            numcahr = string.byte (strkrampuj,numpeekynx,numpeekynx)
            numpeekynx = numpeekynx + 1
            numutf8len = numutf8len - 1
            stronechar = stronechar .. string.char (numcahr)
          end--while
        end--if
        break -- to join mark
      end--if
      if (lfis62safe(numcahr)) then -- safe ASCII ie 0...9 A...Z a...z
        stronechar = string.char (numcahr) -- do NOT encode safe char:s
        break -- to join mark
      end--if
      stronechar = "&#" .. tostring (numcahr) .. ";"
      break -- finally to join mark
    end--while -- fake loop -- join mark
    if (boowarrna) then
      if (numcolour==0) then
        strkolorr = "FFA0A0" -- red
      end--if
      if (numcolour==1) then
        strkolorr = "D0FFD0" -- light green
      end--if
      if (numcolour==2) then
        strkolorr = "A0A0FF" -- blue
      end--if
      if (numcolour==3) then
        strkolorr = "D0D0D0" -- light grey
      end--if
      numcolour = mathmod ((numcolour+1),4)
      strkood = strkood .. '<span style="background-color:#' .. strkolorr .. ';">' .. stronechar .. '</span>'
    else
      strkood = strkood .. stronechar
    end--if
  end--while
  return strkood
end--function lfultencode

------------------------------------------------------------------------

---- ORDINARY LOCAL HIGH LEVEL FUNCTIONS ----

------------------------------------------------------------------------

-- Local function LFWARNA

-- Convert integer 1...8 (must be valid) to 6 digits hex color.

-- fill the gap between "constrtabu3" and "constrtabu4" always with help of
-- this sub, do NOT put hardcoded color values there

-- we use "contabwarna"

-- 1 white default, 2...4 grey getting darker,
-- 5 red, 6 yellow, 7 light yellow, 8 light blue

local function lfwarna (indexofcolor)
  local strfaerg = ''
  strfaerg = contabwarna [indexofcolor]
  return strfaerg
end--function lfwarna

------------------------------------------------------------------------

-- Local function LFIGATEYELLOW

-- Detect TAB CR NBSP ZWSP LRM RLM BOM -- "yellow class error"

-- ZERO is "red class error" and not included here

local function lfigateyellow (numcodepoint)
  local strnamev = ''
  if (numcodepoint==    9) then
    strnamev = 'TAB'
  end--if
  if (numcodepoint==   13) then
    strnamev = 'CR'
  end--if
  if (numcodepoint==  160) then
    strnamev = 'NBSP'
  end--if
  if (numcodepoint== 8203) then
    strnamev = 'ZWSP'
  end--if
  if (numcodepoint== 8206) then
    strnamev = 'LRM'
  end--if
  if (numcodepoint== 8207) then
    strnamev = 'RLM'
  end--if
  if (numcodepoint==65279) then
    strnamev = 'BOM'
  end--if
  return strnamev
end--function lfigateyellow

------------------------------------------------------------------------

-- Local function LFIGATESPECIAL

-- Detect LF SPACE -- "light yellow class char"

local function lfigatespecial (numcoodepoint)
  local strnme = ''
  if (numcoodepoint==   10) then
    strnme = 'LF'
  end--if
  if (numcoodepoint==   32) then
    strnme = 'SPACE'
  end--if
  return strnme
end--function lfigatespecial

------------------------------------------------------------------------

---- MAIN EXPORTED FUNCTION ----

------------------------------------------------------------------------

function utf8debug.ek (arxframent)

  -- general unknown type

  local vartmp = 0     -- variable without type

  -- special type "args" AKA "arx"

  local arxsomons = 0  -- metaized "args" from our own or caller's "frame"

  -- general "tab"

  local tabutf8dec = {}

  -- general "str"

  local strinc     = ""  -- incoming text
  local strctrl    = ""  -- from optional parameter
  local strnamevil = ""  -- name of a bad char, for example "CR" "ZWSP"
  local strnamechr = ""  -- name of an invisile char, for example "LF" "SPACE"
  local strsngchar = ""  -- one char with "span" background
  local strchrblok = ""  -- prebrewed block with table for one char
  local strret     = ""  -- final output string

  -- general "num"

  local numlongtx = 0  -- length of incoming parameter
  local numlung   = 0  -- temp
  local numindx   = 0
  local numreserv = 0
  local numwarna  = 0
  local numoct    = 0  -- temp some char
  local numodt    = 0  -- temp some char
  local numoet    = 0  -- temp some char
  local numoft    = 0  -- temp some char
  local numutflen = 0
  local numdecode = 0  -- decoded "codepoint" value
  local numchrlen = 0  -- number of UTF8 char:s

  -- general "boo"

  local boocrap   = false
  local boooktblo = false
  local boobigbox = false -- big boxes
  local boohardnw = false -- "true" from "1" or "2" or "3"
  local boohnwcol = false -- "true" from "2" or "3" only
  local boohnwspt = false -- "true" from "3" only
  local booutfblo = false -- report UTF8 char bloat

  ---- GUARD AGAINST INTERNAL ERROR ----

  -- "constrkosong" and "constrinvalid" must be uncommented and assigned

  -- note that reporting of this error may NOT depend of uncommentable strings

  if ((type(constrkosong)~="string") or (type(constrinvalid)~="string")) then
    boocrap = true
  end--if

  ---- GET THE ARX (ONE OF TWO) ----

  if (boocrap==false) then
    arxsomons = arxframent.args -- "args" from our own "frame"
    vartmp = arxsomons ["caller"]
    if (vartmp=="true") then
      arxsomons = arxframent:getParent().args -- "args" from caller's "frame"
    end--if
  end--if

  ---- SEIZE 1 OBLIGATORY ANONYMOUS PARAMETER ----

  -- on success assign "strinc" and "numlongtx" (not to be touched later)

  if (boocrap==false) then
    vartmp = arxsomons [1]
    if (type(vartmp)~="string") then
      boocrap = true -- empty string is legal, missing parameter is NOT legal
    else
      numlongtx = string.len (vartmp)
      if (numlongtx>65536) then
        boocrap = true -- this causes bloat, we can never enocode such big
      else
        strinc = vartmp
      end--if
    end--if (type(vartmp)~="string") else
  end--if

  ---- SEIZE AND PRECHECK 1 OPTIONAL ANONYMOUS PARAMETER ----

  -- default is "1101", "0000" is prohibited, "nw" is synonymous
  -- with "0010", empty main input switches the type to "1000"

  if (boocrap==false) then
    strctrl = "1101"
    vartmp = arxsomons [2]
    if (type(vartmp)=="string") then
      if (vartmp=="0000") then
        vartmp = "-" -- invalid
      end--if
      if (vartmp=="nw") then
        vartmp = "0010"
      end--if
      numlung = string.len (vartmp)
      if (numlung~=4) then
        boocrap = true
      else
        strctrl = vartmp
      end--if
    end--if (type(vartmp)=="string") then
    if (numlongtx==0) then
      strctrl = "1000" -- empty main input switches the type to "1000"
    end--if
  end--if

  ---- PROCESS OPTIONAL PARAMETER ----

  if (boocrap==false) then
    while (true) do -- fake loop

      numoft = lfdec1diglm(string.byte(strctrl,1,1),1) -- 255 if invalid
      if (numoft==255) then
        boocrap = true
        break -- to join mark
      end--if
      boooktblo = (numoft==1) -- octet bloat

      numoft = lfdec1diglm(string.byte(strctrl,2,2),1) -- 255 if invalid
      if (numoft==255) then
        boocrap = true
        break -- to join mark
      end--if
      boobigbox = (numoft==1) -- big boxes

      numoft = lfdec1diglm(string.byte(strctrl,3,3),3) -- 255 if invalid
      if (numoft==255) then
        boocrap = true
        break -- to join mark
      end--if
      boohardnw = (numoft~=0) -- "true" from "1" or "2" or "3"
      boohnwcol = (numoft>1)  -- "true" from "2" or "3" only
      boohnwspt = (numoft==3) -- "true" from "3" only

      numoft = lfdec1diglm(string.byte(strctrl,4,4),1) -- 255 if invalid
      if (numoft==255) then
        boocrap = true
        break -- to join mark
      end--if
      booutfblo = (numoft==1) -- UTF8 char bloat

      break -- finally to join mark
    end--while -- fake loop -- join mark
  end--if

  ---- WHINE IF YOU MUST ----

  -- note that reporting of this error may NOT depend of uncommentable strings

  if (boocrap) then
    strchrblok = 'FATAL in "mutf8debug" : internal error or missing or invalid parameter'
    strret = constrkros .. constrelabg .. strchrblok .. constrelaen .. constrkros
  end--if

  ---- OCTET BLOAT ----

  if ((boocrap==false) and boooktblo) then

    if (numlongtx==0) then
      numwarna = 5 -- red on empty string (only 5 or 8 here)
      strsngchar = constrkosong
    else
      numwarna = 8 -- light blue (only 5 or 8 here)
      strsngchar = "number of<br>octet:s : " .. lfbunch (tostring (numlongtx) )
    end--if
    strret = constrtabu3 .. lfwarna (numwarna) .. constrtabu4 .. strsngchar .. constrtabu5 .. "<br>"

  end--if

  ---- BIG BOXES ----

  -- we have "strinc" and "numlongtx"

  -- we brew a private table with just one cell for every single char

  if ((boocrap==false) and boobigbox) then

    numindx = 0 -- counts octet:s
    numchrlen = 0 -- counts UTF8 char:s

    while (true) do

      if (numindx>=numlongtx) then
        break
      end--if

        numreserv = numlongtx - numindx -- at least 1
        numoct = string.byte (strinc,(numindx+1),(numindx+1))
        numodt = 0
        numoet = 0
        numoft = 0
        if (numreserv>=2) then
          numodt = string.byte (strinc,(numindx+2),(numindx+2))
        end--if
        if (numreserv>=3) then
          numoet = string.byte (strinc,(numindx+3),(numindx+3))
        end--if
        if (numreserv>=4) then
          numoft = string.byte (strinc,(numindx+4),(numindx+4))
        end--if
        tabutf8dec = lfutf8deko (numoct,numodt,numoet,numoft)
        numutflen = tabutf8dec [0]
        numdecode = tabutf8dec [1]
        strnamevil = '' -- preassume, NOT reporting any name -- yellow
        strnamechr = '' -- preassume, NOT reporting any name -- light yellow
        if (numutflen~=0) then
          strnamevil = lfigateyellow (numdecode) -- re empty string if no hit
          strnamechr = lfigatespecial (numdecode) -- re empty string if no hit
        end--if
        numwarna = numutflen -- preassume, ZERO to 4, ZERO is invalid
        if ((numoct==0) or (numutflen==0)) then
          numwarna = 5 -- red on code ZERO or invalid sequence
          if (numoct==0) then
            strnamevil = "ZERO"
          end--if
        end--if
        if (strnamevil~='') then
          numwarna = 6 -- yellow on TAB CR NBSP ZWSP LRM RLM BOM
        end--if
        if (strnamechr~='') then
          numwarna = 7 -- light yellow on LF SPACE
        end--if
        strchrblok = constrtabu3 .. lfwarna (numwarna) .. constrtabu4 .. "<small>index</small> " .. lfbunch (tostring (numindx) )
        strchrblok = strchrblok .. "<br><small>beg code</small> " .. lfhexdec (numoct)
        if (numutflen==0) then
          strchrblok = strchrblok .. "<br>" .. constrinvalid -- color sudah done before
        else
          strchrblok = strchrblok .. "<br><small>length</small> " .. tostring (numutflen)
          strsngchar = string.char (numoct) -- maybe we will need it
          if (numutflen>=2) then
            strchrblok = strchrblok .. "<br><small>extra</small> $" .. lfuint8tohex (numodt)
            strsngchar = strsngchar .. string.char (numodt)
            if (numutflen>=3) then
              strchrblok = strchrblok .. ",$" .. lfuint8tohex (numoet)
              strsngchar = strsngchar .. string.char (numoet)
            end--if
            if (numutflen==4) then
              strchrblok = strchrblok .. ",$" .. lfuint8tohex (numoft)
              strsngchar = strsngchar .. string.char (numoft)
            end--if
            strchrblok = strchrblok .. "<br><small>codepoint</small> U+$" .. lfuint32tohex (numdecode)
            strchrblok = strchrblok .. "<br><small>dec</small> #" .. lfbunch (tostring (numdecode) )
          end--if (numutflen>=2) then
          if (strnamevil~='') then
            strchrblok = strchrblok .. "<br>" .. strnamevil -- whine only if reason to
          end--if
          if (strnamechr~='') then
            strchrblok = strchrblok .. "<br>" .. strnamechr -- boast only if reason to
          end--if
          if ((strnamevil..strnamechr)=='') then
            strchrblok = strchrblok .. "<br>" .. constrbkg3 -- begin char background
            if (numutflen==1) then
              strchrblok = strchrblok .. "&#" .. tostring (numoct) .. ";" -- give a F**K in "strsngchar"
            else
              strchrblok = strchrblok .. strsngchar -- let wiki software & browser bother
            end--if
            strchrblok = strchrblok .. constrbkg4 -- close char background
          end--if
        end--if (numutflen==0) else

      strchrblok = strchrblok .. constrtabu5 -- close table
      numindx = numindx + numutflen -- ZERO-based index
      numchrlen = numchrlen + 1 -- invalid char:s do count too
      strret = strret .. strchrblok

    end--while

    strret = strret .. "<br>"

  end--if ((boocrap==false) and boobigbox) then

  ---- HARD NOWIKI ----

  -- we have "strinc" and "numlongtx"

  -- boohardnw "true" from "1" or "2" -- do "hard nowiki"
  -- boohnwcol "true" from "2" only -- requested colour

  if ((boocrap==false) and boohardnw) then

    strret = strret .. "<big>" .. lfultencode (strinc,boohnwcol,boohnwspt) .. "</big><br>"

  end--if

  ---- UTF8 BLOAT ----

  -- incoming "numchrlen" cannot be ZERO if "booutfblo" is "true"

  if ((boocrap==false) and booutfblo) then

    strsngchar = "number of UTF8<br>char:s : " .. lfbunch (tostring (numchrlen) )
    strret = strret .. constrtabu3 .. lfwarna (8) .. constrtabu4 .. strsngchar .. constrtabu5 .. "<br>"

  end--if

  ---- RETURN THE JUNK STRING ----

  return strret

end--function

  ---- RETURN THE JUNK LUA TABLE ----

return utf8debug