Modulo:utf8debug

MODULO
dokumentaĵo: redakti · rigardi · historio
modulo: subpaĝoj · uzata en ŝablonoj · uzata en moduloj
AKTUALIGI
Memtesto disponeblas sur la paĝo Ŝablono:debu.
--[===[

MODULE "UTF8DEBUG" (debug UTF8 text)

"eo.wiktionary.org/wiki/Modulo:utf8debug" <!--2024-Oct-18-->
"id.wiktionary.org/wiki/Modul:utf8debug"
"sv.wiktionary.org/wiki/Modul:utf8debug"
"eo.wikipedia.org/wiki/Modulo:Utf8debug"
"id.wikipedia.org/wiki/Modul:Utf8debug"

Purpose: allows to debug an incoming UTF8 string (directly submitted or
         generated by a template) by splitting it into isolated chars,
         checking validity of the UTF8 stream and displaying chars and codes,
         or by performing a "hard nowiki" and displaying complete text
         including spaces and line breaks

Utilo: ebligas sencimigi enirantan UTF8-signocxenon (rekte enigitan aux
       generitan far sxablono) per dispecigo farigxante apartaj signoj,
       kontrolante validecon de la UTF8-vico kaj montrante signojn kaj kodojn,
       aux per efektivigo de "hard nowiki" kaj montrado de kompleta teksto
       inkluzive spacojn kaj liniorompojn

Manfaat: memungkinkan ...

Syfte: moejliggoer att debugga en inkommande UTF8 straeng (direkt oeverlaemnad
       eller ...

Used by templates / Uzata far sxablonoj:
* only "debu" (not to be called from any other place, to be
  used only for debugging, see below)

Required submodules / Bezonataj submoduloj:
* none / neniuj

Required images:
* "File:Return arrow.svg", Public Domain

This module can accept parameters whether sent to itself (own frame) or
to the caller (caller's frame). If there is a parameter "caller=true"
on the own frame then that own frame is discarded in favor of the
caller's one.

Incoming: * one anonymous and obligatory parameter
            * input string (empty is legal but not very
              useful, missing ie "nil" same as empty, 64 KiO max)
          * two named and optional parameters
            * "outctl=" output type selection control string (4 digits,
              boolean or fourstate)
              * show octet bloat ("0" or "1")
              * show big boxes for single char:s ("0" or "1")
              * show hard nowiki ("0" or "1" (no color) or "2" (colored)
                or 3 (colored and split UTF8))
              * show UTF8 char bloat ("0" or "1")
              default is "1101", "0000" is prohibited, "nw" is synonymous
              with "0010", empty input switches the type to "1000"
              unless "empsil=1"
            * "empsil=1" to switch on empty input from default big red
              box to empty string too

Returned: * large text with complicated wikicode, empty possible

This module is unbreakable (when called with correct module name
and function name).

Cxi tiu modulo estas nerompebla (kiam vokita kun gxustaj nomo de modulo
kaj nomo de funkcio).

This module is special in that it can seem unused and useless. Do not
delete it just because no pages transclude it. Its purpose is not to be used
in article, lemma, appendix or whatever pages. It is intended to be used
temporarily when debugging UTF8 text, preferably from the sandbox. With the
option "hard nowiki" it can even be used for documentation and selftest of
modules and templates. Then the proxy template "debu" can be classed as
a documentation template. Still the template "pate" is a better choice
for this purpose.

Note that "<nowiki>" does NOT work in wikitext generated by a module. We
must dec-encode instead. This works for the commmon problem char:s ":#*='[]"
(there is no problem with curly "{}"). But dec-encoding does NOT work for UTF8
multi-octet char:s. So we dec-encode only some ANSI/ASCII char:s $00...$7F
and leave the remaining ones pass unchanged (both for big boxes mode and hard
nowiki mode). Note that dec-encoding does NOT work for LF either. We catch LF
separately in any case, and in the big boxes mode we show its name "LF",
whereas in the hard nowiki mode we show an arrow as image.

In text coming from a module some evil stuff (invalid UTF8 sequence, ZERO,
FF/12, ZWSP, LRM, RLM) is replaced with U+$FFFD by MediaWiki, whereas
other dubious content (TAB, CR, NBSP, BOM) survives.

Color coding of the result in the big boxes mode:

1  white         ordinary ANSI/ASCII char
2  light grey    valid 2-octet UTF8 with some exceptions
3  grey          valid 3-octet UTF8 with some exceptions
4  dark grey     valid 4-octet UTF8 (with no exceptions yet)
5  red           code ZERO or invalid UTF8 sequence (plus empty input
                 not limited to the big boxes mode)
6  yellow        dubious TAB CR NBSP ZWSP LRM RLM BOM
7  light yellow  invisile LF SPACE
8  light blue    initial octet bloat report (blue except on empty
                 input) and final UTF8 bloat report

Error <<FATAL in "utf8debug" : internal error or invalid
parameter>> is NOT included in the above list, possible causes:
* internal error
* input string too long
* extraneous anonymous parameter
* parameter "outctl=" or "empsil=" bad

Some interesting UTF8 codepoints:

--------  ----------  -----------------------  -------  ----------------------
codepo    codepo      UTFG-8                   short    official name and
int HEX   int DEC     encoding                 name     silly notes
--------  ----------  -----------------------  -------  ----------------------
   $0000     #00'000                           ZERO
   $0009     #00'009                           TAB
   $000A     #00'010                           LF
   $000D     #00'013                           CR
   $0020     #00'032                           SPACE
   $007F     #00'127                                    inclusive end of 1-oct
   $0080     #00'128  $C2,$80                           begin of 2-oct
   $00A0     #00'160  $C2,$A0                  NBSP     don't break me
   $00BF     #00'191  $C2,$BF                           inclusive end of $C2,xx
   $00C0     #00'192  $C3,$80                           begin of $C3,xx
   $00FF     #00'255  $C3,$BF                           inclusive end of $C3,xx
   $0100     #00'256  $C4,$80                           begin of $C4,xx
   $0200     #00'512  $C8,$80                           uppercase "A" with something above
   $0300     #00'768  $CC,$80                           strange horizontally misplaced apo
   $034F     #00'847  $CD,$8F                           COMBINING GRAPHEME JOINER
   $0401     #01'025  $D0,$81                           CCCP letter with case delta $50
   $0451     #01'105  $D1,$91                           CCCP letter with case delta $50
   $07FF     #02'047  $DF,$BF                           inclusive end of 2-oct
   $0800     #02'048  $E0,$80,$80                       begin of 3-oct
   $200B     #08'203  $E2,$80,$8B              ZWSP     ZERO WIDTH SPACE
   $200C     #08'204  $E2,$80,$8C                       ZWNJ ZERO WIDTH NON-JOINER
   $200D     #08'205  $E2,$80,$8D                       ZWJ ZERO WIDTH JOINER
   $200E     #08'206  $E2,$80,$8E              LRM      LEFT-TO-RIGHT MARK
   $200F     #08'207  $E2,$80,$8F              RLM      RIGHT-TO-LEFT MARK
   $2060     #08'288  $E2,$81,$A0                       absurd "WORD JOINER"
   $2068     #08'296  $E2,$81,$A8              FSI      FIRST STRONG ISOLATE
   $20AC     #08'364  $E2,$82,$AC                       EURO (bank robbery sign)
   $D7FF     #55'295  $ED,$9F,$BF                       last before banned range
   $D800     #55'296  ($ED,$A0,$80)                     begin of banned range
   $DFFF     #57'343  ($ED,$BF,$BF)                     inclusive end of banned range
   $E000     #57'344  $EE,$80,$80                       begin of legal range again
   $FEFF     #65'279  $EF,$BB,$BF 239,187,191  BOM      absurd "BOM" sigi
   $FFFD     #65'533  $EF,$BF,$BD 239,191,189           REPLACEMENT CHARACTER
   $FFFE     #65'534  $EF,$BF,$BE 239,191,190           invalid (last 2)
   $FFFF     #65'535  $EF,$BF,$BF 239,191,191           invalid (last 2), inclusive end of 3-oct
$01'0000     #65'536  $F0,$90,$80,$80                   begin of 4-oct
$01'0348     #66'376  $F0,$90,$8D,$88                   one of few somewhat known
$0F'FFFF  #1'048'575  $F3,$BF,$BF,$BF                   one Mi almost reached
$10'0000  #1'048'576  $F4,$80,$80,$80                   one Mi reached here and no end yet
$10'FFFE  #1'114'110  $F4,$8F,$BF,$BE                   invalid (last 2)
$10'FFFF  #1'114'111  $F4,$8F,$BF,$BF                   invalid (last 2), inclusive end of unicode
$11'0000  #1'114'112 ($F4,$90,$80,$80)                  invalid (finally out of range)
--------  ----------  -----------------------  -------  ----------------------

* UTF8 is defined by "RFC 3629" from 2003-Nov (but already used to
  exist before, though)
* UTF8 sigi AKA BOM : HEX: $EF $BB $BF | DEC: 239 187 191 | ABS: $FEFF
* absolute unicode range has 17 (seventeen !!!) planes per 65'536 values
* totally 1'114'112 codepoints, most of them are unused, plane ZERO is
  somewhat full, other ones are almost or totally empty
* official notation: "U+0000" ... "U+10FFFF"
* codepoint range ZERO to 31 is valid by RFC but mostly useless, same for
  127, range 128 to 159, whereas 160 AKA "&nbsp;" does appear in wikitext
* range "U+D800" to "U+DFFF" is invalid by RFC
* UTF8 starting octet can be only $C2 to $DF , $E0 to $EF , $F0 to $F4
  giving a continuous range from $C2 to $F4 of size $33 = #51 values
* UTF8 subsequent octet:s (1 or 2 or 3) can be only $80 to $BF
  (6 bit:s, 64 possible values)
* octet values $C0, $C1 and $F5 to $FF may never appear in a UTF8 stream

Abs. char number range |      UTF8 octet sequence       | beginning octet
   (hexadecimal)       |            (binary)            |
-----------------------+--------------------------------+------------------
0000'0000 to 0000'007F | 0xxxxxxx                       | $00        to $7F
0000'0080 to 0000'07FF | 110xxxxx 10xxxxxx              | $C0 -> $C2 to $DF
0000'0800 to 0000'FFFF | 1110xxxx 10xxxxxx 10xxxxxx     | $E0        to $EF
0001'0000 to 0010'FFFF | 11110xxx 10xxxxxx 10xxxxxx ... | $F0 to $F7 -> $F4

]===]

local exporttable = {}

------------------------------------------------------------------------

---- CONSTANTS [O] ----

------------------------------------------------------------------------

-- constant strings (error circumfixes)

  local constrelabg = '<span class="error"><b>'  -- lagom whining begin
  local constrelaen = '</b></span>'              -- lagom whining end
  local constrlaxhu = '&nbsp;#&nbsp;#&nbsp;'     -- lagom -> huge circumfix

-- HTML stuff for our tiny table and background around every char

  local constrtabu3  = '<table style="display:inline-block; vertical-align:middle; margin:0.15em; padding:0.15em; border:0.15em solid #000000; text-align:center; background-color:#' -- missing color code and many char:s (only 3 ';">' to close element)
  local constrtabu4  = ';"><tr><td>'
  local constrtabu5  = '</td></tr></table>'
  local constrbkg3   = '<span style="font-size:160%;background-color:#E0A0FF;">&nbsp;'
  local constrbkg4   = '&nbsp;</span>'

  local constrpilen = '[[File:Return arrow.svg|20px|link=]]' -- the file is Public Domain

-- color for "lfiultencode"

local contabempatwarna = {[0]='FFA0A0','D0FFD0','A0A0FF','D0D0D0'} -- red, light green, blue, light grey

-- color for main for big boxes and summary boxes

-- 1 white default, 2...4 grey getting darker,
-- 5 red (also bloat box), 6 yellow, 7 light yellow, 8 light blue

-- fill the gap between "constrtabu3" and "constrtabu4" always with
-- help of this table, do NOT put hardcoded color values there

local contabwar8na = {'FFFFFF','E8E8E8','D0D0D0','B8B8B8','FF6060','FFFF60','FFFFB0','C8C8FF'}  -- (index 1...8)

-- known codepoints

-- invalid sequence or codepoint ZERO -> "R" -> "red class error"
-- TAB CR NBSP ZWSP LRM RLM BOM -> "Y" -> "yellow class error"
-- LF SPACE -> "L" -> "light yellow class char"

local contabcodepoints = {}
contabcodepoints [   -1] = {''      , 'R'} -- pseudo codepoint, name not used -> "constrinvalid"
contabcodepoints [    0] = {'ZERO'  , 'R'}
contabcodepoints [    9] = {'TAB'   , 'Y'}
contabcodepoints [   10] = {'LF'    , 'L'}
contabcodepoints [   13] = {'CR'    , 'Y'}
contabcodepoints [   32] = {'SPACE' , 'L'}
contabcodepoints [  160] = {'NBSP'  , 'Y'}
contabcodepoints [ 8203] = {'ZWSP'  , 'Y'}
contabcodepoints [ 8206] = {'LRM'   , 'Y'}
contabcodepoints [ 8207] = {'RLM'   , 'Y'}
contabcodepoints [65279] = {'BOM'   , 'Y'}

-- constant strings EN vs EO vs ID vs SV

    -- local constrkosong = 'empty string submitted'           -- EN
      local constrkosong = 'malplena signocxeno transdonita'  -- EO
        -- local constrkosong = 'string datang bersifat kosong'    -- ID
          -- local constrkosong = 'inkommen string aer tom'          -- SV

    -- local constrinvalid = 'invalid UTF8 value sequence'         -- EN
      local constrinvalid = 'nevalida sekvo de UTF8-valoroj'      -- EO
        -- local constrinvalid = 'rantai nilai UTF8 bersifat invalid'  -- ID
          -- local constrinvalid = 'ogiltig sekvens av UTF8-vaerden'     -- SV

------------------------------------------------------------------------

---- MATH FUNCTIONS [E] ----

------------------------------------------------------------------------

-- Local function MATHDIV

local function mathdiv (xdividend, xdivisor)
  local resultdiv = 0 -- DIV operator lacks in LUA :-(
  resultdiv = math.floor (xdividend / xdivisor)
  return resultdiv
end--function mathdiv

-- Local function MATHMOD

local function mathmod (xdividendo, xdivisoro)
  local resultmod = 0 -- MOD operator is "%" and bitwise AND operator lack too
  resultmod = xdividendo % xdivisoro
  return resultmod
end--function mathmod

------------------------------------------------------------------------

-- Local function MATHXOR

-- Depends on functions :
-- [E] mathdiv mathmod

local function mathxor (xa, xb)
  local resultxor = 0
  local crap6 = 0
  local crap7 = 0
  local crap8 = 1 -- single bit value 1 -> 2 -> 4 -> 8 ...
  while true do
    if ((xa==0) and (xb==0)) then
      break -- we have run out of bits on both
    end--if
    crap6 = mathmod (xa,2) -- pick bit before dividing
    crap7 = mathmod (xb,2) -- pick bit before dividing
    xa    = mathdiv (xa,2) -- shift right
    xb    = mathdiv (xb,2) -- shift right
    if (crap6~=crap7) then
      resultxor = resultxor + crap8 -- add one bit rtl only if true
    end--if
    crap8 = crap8 * 2
  end--while
  return resultxor
end--function mathxor

------------------------------------------------------------------------

---- NUMBER CONVERSION FUNCTIONS [N] ----

------------------------------------------------------------------------

-- Local function LFDEC1DIGIT

-- Convert 1 decimal ASCII digit to integer 0...9 (255 if invalid).

local function lfdec1digit (num1digit)
  num1digit = num1digit - 48 -- may become invalid
  if ((num1digit<0) or (num1digit>9)) then
    num1digit = 255
  end--if
  return num1digit
end--function lfdec1digit

------------------------------------------------------------------------

-- Local function LFNUINT8TOHEX

-- Convert UINT8 (0...255) to a 2-digit hex string.

-- Depends on functions :
-- [E] mathdiv mathmod

local function lfnuint8tohex (numinclow)
  local strheksulo = ''
  local numhajhaj = 0
  numhajhaj = mathdiv (numinclow,16)
  numinclow = mathmod (numinclow,16)
  if (numhajhaj>9) then
    numhajhaj = numhajhaj + 7 -- now 0...9 or 17...22
  end--if
  if (numinclow>9) then
    numinclow = numinclow + 7 -- now 0...9 or 17...22
  end--if
  strheksulo = string.char (numhajhaj+48) .. string.char (numinclow+48)
  return strheksulo
end--function lfnuint8tohex

------------------------------------------------------------------------

-- Local function LFUINT32TOHEX

-- Convert UINT32 (0 ... $FFFF'FFFF = #4'294'967'295) to
-- a (2 or 4 or 6 or 8)-digit hex string.

-- Depends on functions :
-- [N] lfnuint8tohex
-- [E] mathdiv mathmod

local function lfuint32tohex (numincom)
  local strheksulego = ''
  while true do
    strheksulego = lfnuint8tohex ( mathmod (numincom,256) ) .. strheksulego
    numincom = mathdiv (numincom,256)
    if (numincom==0) then
      break
    end--if
  end--while
  return strheksulego
end--function lfuint32tohex

------------------------------------------------------------------------

---- LOW LEVEL STRING FUNCTIONS [G] ----

------------------------------------------------------------------------

-- test whether char is an ASCII digit "0"..."9", return boolean

local function lfgtestnum (numkaad)
  local boodigit = false
  boodigit = ((numkaad>=48) and (numkaad<=57))
  return boodigit
end--function lfgtestnum

------------------------------------------------------------------------

-- test whether char is an ASCII uppercase letter, return boolean

local function lfgtestuc (numkode)
  local booupperc = false
  booupperc = ((numkode>=65) and (numkode<=90))
  return booupperc
end--function lfgtestuc

------------------------------------------------------------------------

-- test whether char is an ASCII lowercase letter, return boolean

local function lfgtestlc (numcode)
  local boolowerc = false
  boolowerc = ((numcode>=97) and (numcode<=122))
  return boolowerc
end--function lfgtestlc

------------------------------------------------------------------------

-- Local function LFGIS62SAFE

-- Test whether incoming ASCII char is very safe (0...9 A...Z a...z).

-- Depends on functions :
-- [G] lfgtestnum lfgtestuc lfgtestlc

local function lfgis62safe (numcxair)
  local booguud = false
  booguud = lfgtestnum (numcxair) or lfgtestuc (numcxair) or lfgtestlc (numcxair)
  return booguud
end--function lfgis62safe

------------------------------------------------------------------------

---- SOME FUNCTIONS ---- !!!FIXME!!!

------------------------------------------------------------------------

-- Local function LFHEXDEC

-- Example output : "$FE=#254" (we have to save text with)

-- Depends on "lfnuint8tohex"

local function lfhexdec (numkodo)
  local strrezulto = ''
    strrezulto = "$" .. lfnuint8tohex (numkodo) .. "=#" .. tostring (numkodo)
  return strrezulto
end--function lfhexdec

------------------------------------------------------------------------

-- Local function LFNUMTODECBUN

-- Convert non-negative integer to decimal string with bunching.

-- Depends on functions :
-- [E] mathdiv mathmod

local function lfnumtodecbun (numnomoriin)
  local strnomorut = ''
  local numindeex = 0
  local numcaar = 0
  numnomoriin = math.floor (numnomoriin) -- transcendental numbers suck
  if (numnomoriin<0) then
    numnomoriin = 0 -- negative numbers suck
  end--if
  while true do
    numcaar = mathmod(numnomoriin,10) + 48 -- get digit moving right to left
    numnomoriin = mathdiv(numnomoriin,10)
    if (numindeex==3) then
      strnomorut = "'" .. strnomorut -- ueglstr apo
      numindeex = 0
    end--if
    strnomorut = string.char(numcaar) .. strnomorut -- ueglstr digit
    numindeex = numindeex + 1
    if (numnomoriin==0) then
      break
    end--if
  end--while
  return strnomorut
end--function lfnumtodecbun

------------------------------------------------------------------------

---- UTF8 FUNCTIONS [U] ----

------------------------------------------------------------------------

-- Local function LFULNUTF8CHAR

-- Evaluate length of a single UTF8 char in octet:s.

-- Input  : * numbgoctet  -- beginning octet of a UTF8 char

-- Output : * numlen1234x -- number 1...4 or ZERO if invalid

-- Does NOT thoroughly check the validity, looks at 1 octet only.

local function lfulnutf8char (numbgoctet)
  local numlen1234x = 0
    if (numbgoctet<128) then
      numlen1234x = 1 -- $00...$7F -- ANSI/ASCII
    end--if
    if ((numbgoctet>=194) and (numbgoctet<=223)) then
      numlen1234x = 2 -- $C2 to $DF
    end--if
    if ((numbgoctet>=224) and (numbgoctet<=239)) then
      numlen1234x = 3 -- $E0 to $EF
    end--if
    if ((numbgoctet>=240) and (numbgoctet<=244)) then
      numlen1234x = 4 -- $F0 to $F4
    end--if
  return numlen1234x
end--function lfulnutf8char

------------------------------------------------------------------------

-- Local function LFUTF8DEKO

-- Decode a single UTF8 char, return ZERO length if invalid.

-- Output : * "tabresult" -- LUA table [0] length and [1] codepoint

-- Depends on functions :
-- [E] mathdiv mathmod mathxor

local function lfutf8deko (num0, num1, num2, num3)

  local tabresult = {}
  local numlength = 0 -- preASSume invalid
  local numkodepoin = 0 -- preASSume invalid

  num1 = mathxor (num1,128) -- XOR 3 of 4
  num2 = mathxor (num2,128) -- XOR 3 of 4
  num3 = mathxor (num3,128) -- XOR 3 of 4

  while true do -- fake loop

    if ((num0>193) and (num1>63)) then
      break -- to join mark
    end--if
    if ((num0>223) and (num2>63)) then
      break -- to join mark
    end--if
    if ((num0>239) and (num3>63)) then
      break -- to join mark
    end--if

    if (num0<128) then -- ZERO to $7F
      numkodepoin = num0
      numlength = 1
      break -- to join mark
    end--if

    if ((num0>193) and (num0<224)) then -- $C0 # $C2 to $DF
      numkodepoin = (mathxor(num0,192)) * 64 + num1
      if ((numkodepoin>127) and (numkodepoin<2048)) then
        numlength = 2
      end--if
      break -- to join mark
    end--if

    if ((num0>223) and (num0<240)) then -- $E0 to $EF
      numkodepoin = (mathxor(num0,224)) * 4096 + num1 * 64 + num2
      if (((numkodepoin>2047) and (numkodepoin<55296)) or ((numkodepoin>57343) and (numkodepoin<65536))) then
        numlength = 3
      end--if
      break -- to join mark
    end--if

    if ((num0>239) and (num0<245)) then -- $F0 to $F7 # $F4
      numkodepoin = (mathxor(num0,240)) * 262144 + num1 * 4096 + num2 * 64 + num3
      if ((numkodepoin>65535) and (numkodepoin<1114112)) then
        numlength = 4
      end--if
      break -- to join mark
    end--if

    break -- finally to join mark
  end--while -- fake loop -- join mark

  tabresult [0] = numlength
  tabresult [1] = numkodepoin
  return tabresult

end--function lfutf8deko

------------------------------------------------------------------------

---- HIGH LEVEL STRING FUNCTIONS [I] ----

------------------------------------------------------------------------

-- Local function LFIULTENCODE

-- Generously encode char:s to prevent parsing and show hex if needed, make
-- single chars visible, bypass all wiki parsing and HTML parsing. Our cool
-- module has brewed something with "[["..."]]" and repeated spaces but we
-- want to see plain text for debugging purposes. Thus we dec-encode some
-- char:s, use NBSP to fix spaces, workaround EOL, and maybe add color.

-- Input  : * strkrampuj  : string, empty tolerable, but type "nil" is NOT
--          * nummxwidth  : maximal width of text (20...200, default 80)
--          * boowarrna   : "true" to enable color
--          * boosplitutf : "true" to split UTF8 char:s into hex numbers

-- Output : * strkood     : string, empty in worst case

-- Depends on functions :
-- [U] lfulnutf8char
-- [G] lfgtestnum lfgtestuc lfgtestlc lfgis62safe
-- [N] lfnuint8tohex
-- [E] mathdiv mathmod

-- Depends on constants :
-- * string constrpilen [[File:...]]
-- * table contabempatwarna 0...3

-- This helps with:
-- * "[["..."]]", "["..."]", "*", "#", ":" (note that there is no
--   problem with plain "{{"..."}}")
-- * multiple spaces (they are no longer reduced to one piece due to HTML)
-- * EOL:s (they do not vanish in favor of spaces due to HTML, instead
--   the EOL arrow is showed)
-- * too long lines (they are force-broken)
-- * codes below 32 other than EOL

-- There is also "mw.text.nowiki" with some limitations, most notably
-- about multiple spaces and EOL:s.

-- In order to fix EOL we show the EOL arrow (preceded by space) for every
-- incoming LF, but do a "<br>" only once after multiple subsequent LF:s.

-- We must be UTF8-aware. A UTF8 char must be either split into hex codes,
-- or preserved over its complete length ie not split nor encoded at all.

-- Note that this causes BLOAT. The caller is responsible for
-- adding "<big>"..."</big>" if desired.

local function lfiultencode (strkrampuj,nummxwidth,boowarrna,boosplitutf)

  local stronechar = ''
  local strkolorr = ''
  local strkood = ''
  local numstrlne = 0
  local numpeekynx = 1 -- ONE-based index
  local numcahr = 0
  local numcxxhr = 0
  local numutf8len = 0
  local numaccuwidth = 0 -- accumulated width
  local numcolor = 0 -- 0,1,2,3 -- R,G,B,Y
  local boonbsp = true -- "true" needed for junk lines containing only space
  local boosplnow = false -- allow forced split in some cases
  local boofickpilen = false -- true after LF arrow causes "<br>" later

  if (type(nummxwidth)~='number') then
    nummxwidth = 80
  end--if
  if ((nummxwidth<20) or (nummxwidth>200)) then
    nummxwidth = 80
  end--if
  numstrlne = string.len (strkrampuj)

  while true do -- outer genuine loop

    if (numpeekynx>numstrlne) then
      break
    end--if
    numcahr = string.byte (strkrampuj,numpeekynx,numpeekynx)
    numpeekynx = numpeekynx + 1 -- ONE-based index

    while true do -- inner fake loop
      if (numcahr==10) then
        break -- to join mark -- inner fake loop -- special processing for LF
      end--if
      if (numcahr==32) then
        if (boonbsp) then
          stronechar = '&nbsp;' -- this prevents space reduction
        else
          stronechar = ' '
        end--if
        boonbsp = not boonbsp
        break -- to join mark -- inner fake loop
      end--if
      if (numcahr<32) then
        stronechar = '{$' .. lfnuint8tohex (numcahr) .. '}' -- always as hex
        break -- to join mark -- inner fake loop
      end--if
      if (numcahr>127) then
        boosplnow = boosplitutf
        numutf8len = lfulnutf8char (numcahr)
        if (numutf8len==0) then
          boosplnow = true -- forced split for broken UTF8 sequence
        else
          numutf8len = numutf8len - 1 -- more char:s to pick
        end--if
        if ((numpeekynx+numutf8len)>(numstrlne+1)) then
          boosplnow = true -- forced split for truncated UTF8 sequence
        end--if
        if (boosplnow) then
          stronechar = '{$' .. lfnuint8tohex (numcahr) .. '}'
        else
          stronechar = string.char (numcahr) -- preserve "numcahr" below
          while true do -- deep loop copy UTF8 char
            if (numutf8len==0) then
              break
            end--if
            numcxxhr = string.byte (strkrampuj,numpeekynx,numpeekynx)
            numpeekynx = numpeekynx + 1
            numutf8len = numutf8len - 1
            stronechar = stronechar .. string.char (numcxxhr)
          end--while -- deep loop copy UTF8 char
        end--if
        break -- to join mark
      end--if (numcahr>127) then
      if (lfgis62safe(numcahr)) then -- safe ASCII ie 0...9 A...Z a...z
        stronechar = string.char (numcahr) -- do NOT encode safe char:s
        break -- to join mark
      end--if
      stronechar = '&#' .. tostring (numcahr) .. ';' -- dec-encode some ASCII
      break -- finally to join mark
    end--while -- inner fake loop -- join mark

    if (numcahr==10) then
      if (numaccuwidth>=nummxwidth) then
        strkood = strkood .. '<br>'
        numaccuwidth = 0
        boonbsp = true -- "true" needed for junk lines containing only space
      end--if
      strkood = strkood .. '&nbsp;' .. constrpilen
      numaccuwidth = numaccuwidth + 2 -- counts doubly
      boofickpilen = true
    else
      if (boofickpilen or (numaccuwidth>=nummxwidth)) then
        strkood = strkood .. '<br>'
        numaccuwidth = 0
        boonbsp = true -- "true" needed for junk lines containing only space
      end--if
      if (boowarrna) then
        strkolorr = contabempatwarna [numcolor]
        numcolor = mathmod ((numcolor+1),4) -- index 0...3
        strkood = strkood .. '<span style="background-color:#' .. strkolorr .. ';">' .. stronechar .. '</span>'
      else
        strkood = strkood .. stronechar
      end--if
      numaccuwidth = numaccuwidth + 1
      boofickpilen = false
    end--if (numcahr==10) else

  end--while -- outer genuine loop

  return strkood

end--function lfiultencode

------------------------------------------------------------------------

-- Local function LFIVALIUMDCTLSTR

-- Validate control string against restrictive pattern (dec).

-- Input  : * strresdpat -- restrictive pattern (max 200 char:s)
--          * strctldstr -- incoming suspect

-- Output : * numbadpos -- bad position, or 254 wrong length, or 255 success

-- Depends on functions :
-- [N] lfdec1digit

-- Content of restrictive pattern:
-- * "."                           -- skip check
-- * "-" and "?"                   -- must match literally
-- * digit "1"..."9" ("0" invalid) -- inclusive upper limit (min ZERO)

local function lfivaliumdctlstr (strresdpat, strctldstr)

  local numlenresdpat = 0
  local numldninkom = 0
  local numcomperindex = 0 -- ZERO-based
  local numead2 = 0
  local numead3 = 0
  local numbadpos = 254 -- preASSume guilt (len differ or too long or ...)
  local booddaan = false

  numlenresdpat = string.len(strresdpat)
  numldninkom = string.len(strctldstr)
  if ((numlenresdpat<=200) and (numlenresdpat==numldninkom)) then
    while true do
      if (numcomperindex==numlenresdpat) then
        numbadpos = 255
        break -- success
      end--if
      numead2 = string.byte(strresdpat,(numcomperindex+1),(numcomperindex+1)) -- rest
      numead3 = string.byte(strctldstr,(numcomperindex+1),(numcomperindex+1)) -- susp
      booddaan = false
      if ((numead2==45) or (numead2==63)) then
        if (numead2~=numead3) then
          numbadpos = numcomperindex
          break -- "-" and "?" must match literally
        end--if
        booddaan = true -- position OK
      end--if
      if (numead2==46) then -- skip for dot "."
        booddaan = true -- position OK
      end--if
      if (not booddaan) then
        numead2 = lfdec1digit(numead2) -- rest
        if (numead2>9) then -- limit defined or bad ??
          numbadpos = 254
          break -- bad restrictive pattern
        else
          numead3 = lfdec1digit(numead3) -- susp
          if (numead3>numead2) then
            numbadpos = numcomperindex
            break -- value limit violation
          end--if
        end--if (numead2>9) else
      end--if (not booddaan) then
      numcomperindex = numcomperindex + 1
    end--while
  end--if ((numlenresdpat<=200) and (numlenresdpat==numldninkom)) then

  return numbadpos

end--function lfivaliumdctlstr

------------------------------------------------------------------------

---- VARIABLES [R] ----

------------------------------------------------------------------------

function exporttable.ek (arxframent)

  -- general unknown type

  local vartamp = 0 -- variable without type

  -- special type "args" AKA "arx"

  local arxsomons = 0 -- metaized "args" from our own or caller's "frame"

  -- general "tab"

  local tabutf8dec = {}

  -- general "str"

  local strinctx  = ''  -- incoming text from anon parameter
  local strctrl   = ''  -- from optional parameter "outctl="

  local strmytemp = ''
  local strret    = ''  -- final output string

  -- general "num"

  local numinctx  = 0  -- length of incoming text in octets
  local numchrlen = 0  -- number of UTF8 char:s
  local numtymp   = 0

  -- general "boo"

  local boocrap   = false
  local boopendlf = false -- pending LF between sections

  -- more "boo" from parameters

  local booempsil = false -- from "empsil=1"
  local boooktblo = false -- from "outctl="
  local boobigbox = false -- from "outctl=" show big boxes
  local boohardnw = false -- from "outctl=" foursate "true" from "1" "2" "3"
  local boohnwcol = false -- from "outctl=" foursate "true" from "2" "3"
  local boohnwspt = false -- from "outctl=" foursate "true" from "3" only
  local booutfblo = false -- from "outctl=" show UTF8 char bloat

------------------------------------------------------------------------

---- MAIN [Z] ----

------------------------------------------------------------------------

  ---- GUARD AGAINST INTERNAL ERROR ----

  -- "constrkosong" and "constrinvalid" must be uncommented and assigned

  -- note that reporting of this error may NOT depend on uncommentable strings

  boocrap = ((type(constrkosong)~='string') or (type(constrinvalid)~='string'))

  ---- GET THE ARX (ONE OF TWO) ----

  if (not boocrap) then
    arxsomons = arxframent.args -- "args" from our own "frame"
    vartamp = arxsomons ['caller']
    if (vartamp=='true') then
      arxsomons = arxframent:getParent().args -- "args" from caller's "frame"
    end--if
  end--if

  ---- CHECK ----

  if (not boocrap) then
    if (type(arxsomons[2])=='string') then
      boocrap = true -- too much
    end--if
  end--if

  ---- SEIZE ONE ANONYMOUS AND OBLIGATORY PARAMETER ----

  -- on success assign "strinctx" and "numinctx" (not to be touched later)

  if (not boocrap) then
    vartamp = arxsomons [1]
    if (type(vartamp)=="string") then
      numinctx = string.len (vartamp)
      if (numinctx>65536) then
        boocrap = true -- this causes bloat, we can never enocode such big
      else
        strinctx = vartamp
      end--if
    end--if (type(vartamp)=="string") then
  end--if

  ---- SEIZE AND CHECK NAMED AND OPTIONAL PARAMETER WITH CONTROL STRING ----

  -- default is "1101", "0000" is prohibited, "nw" is synonymous
  -- with "0010", empty input switches the type to "1000"

  if (not boocrap) then
    do -- scope
      local vartumip = 0
      local numsilur = 0
      strctrl = '1101' -- default
      vartumip = arxsomons ['outctl']
      if (type(vartumip)=='string') then
        if (vartumip=='nw') then -- alias
          vartumip = '0010'
        end--if
        if (vartumip=='0000') then
          boocrap = true
        else
          numsilur = lfivaliumdctlstr ('1131',vartumip) -- 255 is OK
          if (numsilur==255) then
            strctrl = vartumip
          else
            boocrap = true
          end--if
        end--if
      end--if (type(vartumip)=='string') then
    end--do scope
  end--if (not boocrap) then

  ---- SEIZE AND CHECK NAMED AND OPTIONAL PARAMETER WITH BOOLEAN ----

  if (not boocrap) then
    vartamp = arxsomons ['empsil']
    if (type(vartamp)=='string') then
      if (vartamp=='1') then
        booempsil = true
      else
        boocrap = true
      end--if
    end--if
  end--if

  ---- EMPTINESS ----

  if ((not boocrap) and (numinctx==0)) then
    if (booempsil) then
      strctrl = '0000' -- empty input switches type to silly "0000"
    else
      strctrl = '1000' -- empty input switches type to "1000"
    end--if
  end--if

  ---- PROCESS CONTROL STRING TO BOOLEANS ----

  if (not boocrap) then
    numtymp = string.byte(strctrl,1,1)
    boooktblo = (numtymp==49) -- show octet bloat
    numtymp = string.byte(strctrl,2,2)
    boobigbox = (numtymp==49) -- big boxes mode
    numtymp = string.byte(strctrl,3,3) -- subtypes of hard nowiki mode
    boohardnw = (numtymp>=49) -- "true" from "1" or "2" or "3"
    boohnwcol = (numtymp>=50) -- "true" from "2" or "3"
    boohnwspt = (numtymp==51) -- "true" from "3" only
    numtymp = string.byte(strctrl,4,4)
    booutfblo = (numtymp==49) -- show UTF8 char bloat
  end--if

  ---- WHINE IF YOU MUST ----

  -- note that reporting of this error may NOT depend of uncommentable strings

  if (boocrap) then
    strmytemp = 'FATAL in "utf8debug" : internal error or invalid parameter'
    strret = constrlaxhu .. constrelabg .. strmytemp .. constrelaen .. constrlaxhu
  end--if

  ---- SHOW OCTET BLOAT ----

  -- empty input switches type to "1000" ie only "boooktblo" is
  -- true, or to "0000" (invalid from caller)

  if ((not boocrap) and boooktblo) then

    if (numinctx==0) then
      numtymp = 5 -- red on empty string (only 5 or 8 here)
      strmytemp = constrkosong
    else
      numtymp = 8 -- light blue (only 5 or 8 here)
      strmytemp = "number of<br>octet:s : " .. lfnumtodecbun(numinctx)
    end--if

    strret = constrtabu3 .. contabwar8na [numtymp] .. constrtabu4 .. strmytemp .. constrtabu5
    boopendlf = true -- the earliest one, "boopendlf" not assigned above

  end--if

  ---- PROCESS UTF8 AND GENERATE BIG BOXES ----

  -- incoming "strinctx" and "numinctx"

  -- we brew a private HTML table with just one cell for every single char

  -- this is done for both boobigbox (use generated string) and booutfblo
  -- only (discard generated string, "numchrlen" is the big prey)

  numchrlen = 0 -- counts UTF8 char:s, pass to below

  if ((not boocrap) and (boobigbox or booutfblo)) then

    do -- scope

      local varkop     = 0

      local strchname  = ''
      local strchkolr  = ''
      local strsngchar = '' -- one char with "span" background
      local strchrblok = '' -- prebrewed block with table for one char
      local strbunch   = '' -- full report with big boxes

      local numindx    = 0  -- counts octet:s
      local numreserv  = 0
      local numutfone  = 0  -- length of ONE UTF8 char
      local numdecode  = 0  -- decoded "codepoint" value
      local numoct     = 0  -- temp some char
      local numodt     = 0  -- temp some char
      local numoet     = 0  -- temp some char
      local numoft     = 0  -- temp some char
      local numwarna   = 0

      while true do

        if (numindx>=numinctx) then
          break
        end--if

        numreserv = numinctx - numindx -- at least 1
        numoct = string.byte (strinctx,(numindx+1),(numindx+1))
        numodt = 0
        numoet = 0
        numoft = 0
        if (numreserv>=2) then
          numodt = string.byte (strinctx,(numindx+2),(numindx+2))
        end--if
        if (numreserv>=3) then
          numoet = string.byte (strinctx,(numindx+3),(numindx+3))
        end--if
        if (numreserv>=4) then
          numoft = string.byte (strinctx,(numindx+4),(numindx+4))
        end--if

        tabutf8dec = lfutf8deko (numoct,numodt,numoet,numoft)
        numutfone = tabutf8dec [0] -- ZERO invalid or 1...4
        if (numutfone==0) then
          numdecode = -1 -- pseudo codepoint for invalid sequence
        else
          numdecode = tabutf8dec [1] -- have valid codepoint
        end--if

        varkop = contabcodepoints [numdecode] -- risk for type "nil"
        strchname = ''
        strchkolr = '' -- "R" or "Y" or "L"
        if (type(varkop)=='table') then
          strchname = varkop[1] or ''
          strchkolr = varkop[2] or ''
        end--if
        numwarna = numutfone -- preASSume, ZERO invalid or 1...4
        if (strchkolr=='R') then
          numwarna = 5 -- red on code ZERO or invalid sequence
        end--if
        if (strchkolr=='Y') then
          numwarna = 6 -- yellow on TAB CR NBSP ZWSP LRM RLM BOM
        end--if
        if (strchkolr=='L') then
          numwarna = 7 -- light yellow on LF SPACE
        end--if

        strchrblok = constrtabu3 .. contabwar8na [numwarna] .. constrtabu4 .. "<small>index</small> " .. lfnumtodecbun(numindx)
        strchrblok = strchrblok .. "<br><small>beg code</small> " .. lfhexdec (numoct)

        if (numutfone==0) then
          strchrblok = strchrblok .. "<br>" .. constrinvalid -- color sudah done before
        else
          strchrblok = strchrblok .. "<br><small>length</small> " .. tostring (numutfone)
          strsngchar = string.char (numoct) -- maybe we will need it
          if (numutfone>=2) then
            strchrblok = strchrblok .. "<br><small>extra</small> $" .. lfnuint8tohex (numodt)
            strsngchar = strsngchar .. string.char (numodt)
            if (numutfone>=3) then
              strchrblok = strchrblok .. ",$" .. lfnuint8tohex (numoet)
              strsngchar = strsngchar .. string.char (numoet)
            end--if
            if (numutfone==4) then
              strchrblok = strchrblok .. ",$" .. lfnuint8tohex (numoft)
              strsngchar = strsngchar .. string.char (numoft)
            end--if
            strchrblok = strchrblok .. "<br><small>codepoint</small> U+$" .. lfuint32tohex (numdecode)
            strchrblok = strchrblok .. "<br><small>dec</small> #" .. lfnumtodecbun(numdecode)
          end--if (numutfone>=2) then
          if (strchname~='') then
            strchrblok = strchrblok .. "<br>" .. strchname -- known by name
          else
            strchrblok = strchrblok .. "<br>" .. constrbkg3 -- begin char background
            if (numutfone==1) then
              strchrblok = strchrblok .. "&#" .. tostring (numoct) .. ";" -- dec-encode, give a F**K in "strsngchar"
            else
              strchrblok = strchrblok .. strsngchar -- let wiki software & browser bother
            end--if
            strchrblok = strchrblok .. constrbkg4 -- close char background
          end--if
        end--if (numutfone==0) else

        strchrblok = strchrblok .. constrtabu5 -- close table

        numindx = numindx + numutfone -- ZERO-based index
        numchrlen = numchrlen + 1 -- invalid char:s do count too, the big prey
        if (boobigbox) then
          strbunch = strbunch .. strchrblok -- later use or discard
        end--if

      end--while

      if (boobigbox) then -- else just discard it ;-)
        if (boopendlf) then
          strret = strret .. "<br>"
        end--if
        strret = strret .. strbunch
        boopendlf = true
      end--if

    end--do scope

  end--if ((not boocrap) and (boobigbox or booutfblo)) then

  ---- HARD NOWIKI ----

  -- incoming "strinctx" and "numinctx"

  -- boohardnw "true" from "1" "2" "3" -- do "hard nowiki"
  -- boohnwcol "true" from "2" "3" -- requested color
  -- boohnwspt "true" from "3" only -- split UTF8

  -- restrict the width to 100 char:s (HTML parser breaks on spaces and some
  -- other chars, but unreasonably long words cause trouble, we break at 100)

  if ((not boocrap) and boohardnw) then

    if (boopendlf) then
      strret = strret .. "<br>"
    end--if
    strret = strret .. "<big>" .. lfiultencode (strinctx,100,boohnwcol,boohnwspt) .. "</big>"
    boopendlf = true

  end--if

  ---- UTF8 BLOAT ----

  -- incoming "numchrlen" cannot be ZERO if "booutfblo" is "true"

  if ((not boocrap) and booutfblo) then

    if (boopendlf) then
      strret = strret .. "<br>" -- the last one, "boopendlf" not needed below
    end--if
    strmytemp = "number of UTF8<br>char:s : " .. lfnumtodecbun(numchrlen)
    strret = strret .. constrtabu3 .. contabwar8na [8] .. constrtabu4 .. strmytemp .. constrtabu5

  end--if

  ---- RETURN THE JUNK STRING ----

  return strret

end--function

  ---- RETURN THE JUNK LUA TABLE ----

return exporttable