Modulo:mutf8debug
Salti al navigilo
Salti al serĉilo
--[===[
MODULE "MUTF8DEBUG" (debug UTF8 text)
"eo.wiktionary.org/wiki/Modulo:mutf8debug" <!--2020-Dec-25-->
"id.wiktionary.org/wiki/Modul:mutf8debug"
Purpose: allows to debug an incoming UTF8 string (literally submitted or
generated by a template) by splitting it into isolated chars,
checking validity of the UTF8 stream and displaying chars and codes,
or by performing a "hard nowiki" and displaying all text including
spaces
Utilo: ebligas sencimigi enirantan UTF8 signocxenon (lauxlitere enigitan aux
generitan far sxablono) per dispecigo farigxante apartaj signoj ...
Manfaat: memungkinkan ...
Syfte: moejliggoer att debugga en inkommande UTF8 straeng (oevergiven ...
Used by templates / Uzata far sxablonoj:
- "debu" (for debugging, see below)
Required submodules / Bezonataj submoduloj:
- none / neniuj
This module can accept parameters whether sent to itself (own frame) or
to the caller (caller's frame). If there is a parameter "caller=true"
on the own frame then that own frame is discarded in favor of the
caller's one.
Incoming: - one anonymous obligatory parameter
- input string (empty is legal but not very useful, 64 KiO max)
- one anonymous optional parameter
- output type selection (4 digits, bool or fourstate)
- octet bloat ("0" or "1")
- big boxes for single char:s ("0" or "1")
- hard nowiki ("0" or "1" (no colour) or "2" (coloured)
or 3 (coloured and split UTF8))
- UTF8 char bloat ("0" or "1")
- default is "1101", "0000" is prohibited, "nw" is synonymous
with "0010", empty main input switches the type to "1000"
Returned: - large text with complicated wikicode
This module is unbreakable (when called with correct module name
and function name).
Cxi tiu modulo estas nerompebla (kiam vokita kun gxustaj nomo de modulo
kaj nomo de funkcio).
This module is special in that it can seem unused and useless. Do not
delete it just because no pages link to it. Its purpose is not to be linked
from article, lemma, appendix or whatever pages. It is to be used temporarily
when debugging UTF8 text, preferably from the sandbox. With the option
"hard nowiki" it can even be used for documentation and selftest of modules
and templates and the proxy template "debu" can be considered as a
documetation template.
Note that "<nowiki>" does NOT work in wikitext generated by a module. We
must DEC-encode instead. This works for the commmon problem char:s ":#*='[]"
(there is no problem with "{}"). But DEC-encoding does NOT work for UTF8
multi-octet char:s. So we DEC-encode only some ANSI/ASCII char:s $00...$7F
and leave the remaining ones pass unchanged (both for "big boxes" and "hard
nowiki"). Note that DEC-encoding does NOT work for LF either. In the "big
boxes" mode we catch LF separately.
In text coming from a module some evil stuff (broken encoding, ZERO,
FF/12, ZWSP, LRM, RLM) is replaced by U+$FFFD, whereas other (TAB, CR,
NBSP, BOM) survide.
Color coding of the result in the "big boxes" mode:
1 white ordinary ANSI/ASCII char
2 light grey valid 2-octet UTF8 with some exceptions
3 grey valid 3-octet UTF8 with some exceptions
4 dark grey valid 4-octet UTF8 (with no exceptions yet)
5 red code ZERO or invalid UTF8 sequence or empty main input
6 yellow dubious TAB CR NBSP ZWSP LRM RLM BOM
7 light yellow invisile LF SPACE
8 light blue initial (except empty main input) and final UTF8 bloat report
Error <<FATAL in "mutf8debug" : internal error or missing or invalid
parameter'>> is NOT included in the above list.
Some interesting UTF8 codepoints:
codepo codepo UTFG-8 visible silly
int HEX int DEC encoding name notes
-------- ---------- ----------------------- --------- ----------------------
$0000 #00'000 ZERO
$0009 #00'009 TAB
$000A #00'010 LF
$000D #00'013 CR
$0020 #00'032 SPACE
$007F #00'127 inclusive end of 1-oct
$0080 #00'128 $C2,$80 begin of 2-oct
$00A0 #00'160 $C2,$A0 NBSP (don't break me)
$00BF #00'191 $C2,$BF inclusive end of $C2,xx
$00C0 #00'192 $C3,$80 begin of $C3,xx
$00FF #00'255 $C3,$BF inclusive end of $C3,xx
$0100 #00'256 $C4,$80 begin of $C4,xx
$034F #00'847 COMBINING GRAPHEME JOINER
$0401 #01'025 $D0,$81 CCCP case delta $50
$0451 #01'105 $D1,$91 CCCP case delta $50
$07FF #02'047 $DF,$BF inclusive end of 2-oct
$0800 #02'048 $E0,$80,$80 begin of 3-oct
$200B #08'203 $E2,$80,$8B ZWSP ZERO WIDTH SPACE
$200C #08'204 $E2,$80,$8C ZWNJ ZERO WIDTH NON-JOINER
$200D #08'205 $E2,$80,$8D ZWJ ZERO WIDTH JOINER
$200E #08'206 $E2,$80,$8E LRM LEFT-TO-RIGHT MARK
$200F #08'207 $E2,$80,$8F RLM RIGHT-TO-LEFT MARK
$2060 #08'288 $E2,$81,$A0 (absurd "WORD JOINER")
$2068 #08'296 $E2,$81,$A8 FSI FIRST STRONG ISOLATE
$20AC #08'364 $E2,$82,$AC EURO (bank robbery)
$D800 #55'296 begin of banned range
$DFFF #57'343 inclusive end of banned range
$E000 #57'344 begin of legal range again
$FEFF #65'279 $EF,$BB,$BF 239,187,191 BOM (absurd "BOM" Sigi)
$FFFD #65'533 $EF,$BF,$BD 239,191,189 REPLACEMENT CHARACTER
$FFFE #65'534 $EF,$BF,$BE 239,191,190 invalid (last 2)
$FFFF #65'535 $EF,$BF,$BF 239,191,191 invalid (last 2), inclusive end of 3-oct
$01'0000 #65'536 $F0,$90,$80,$80 begin of 4-oct
$01'0348 #66'376 $F0,$90,$8D,$88 one of few somewhat known
$0F'FFFF #1'048'575 $F3,$BF,$BF,$BF one Mi almost reached
$10'0000 #1'048'576 $F4,$80,$80,$80 one Mi reached here and no end yet
$10'FFFE #1'114'110 $F4,$8F,$BF,$BE invalid (last 2)
$10'FFFF #1'114'111 $F4,$8F,$BF,$BF invalid (last 2), inclusive end of unicode
$11'0000 #1'114'112 ($F4,$90,$80,$80) invalid (finally out of range)
- UTF8 is defined by "RFC 3629" from 2003-Nov (but already used to
exist before, though)
- Absolute unicode range has 17 (seventeen !!!) planes per 65'536 values,
total 1'114'112, most of them are unused, plane ZERO is somewhat full, other
ones are almost or totally empty, official notation: "U+0000..U+10FFFF"
- Codepoint range ZERO to 31 is valid by RFC but useless, same for 127,
128 to 159, whereas 160 (AKA NBSP) is maybe useful
- Range "U+D800" to "U+DFFF" is invalid by RFC
- UTF8 starting octet can be only $C2 to $DF , $E0 to $EF , $F0 to $F4
giving a continuous range from $C2 to $F4 of size $33 = #51 values
- UTF8 subsequent octet's (1 or 2 or 3) can be only $80 to $BF
(6 bit:s, 64 possible values)
- The octet values $C0, $C1 and $F5 to $FF may never appear in a UTF8 file
Abs. char number range | UTF8 octet sequence | beginning octet
(hexadecimal) | (binary) |
-----------------------+--------------------------------+------------------
0000'0000 to 0000'007F | 0xxxxxxx | $00 to $7F
0000'0080 to 0000'07FF | 110xxxxx 10xxxxxx | $C0 -> $C2 to $DF
0000'0800 to 0000'FFFF | 1110xxxx 10xxxxxx 10xxxxxx | $E0 to $EF
0001'0000 to 0010'FFFF | 11110xxx 10xxxxxx 10xxxxxx ... | $F0 to $F7 -> $F4
]===]
local utf8debug = {}
------------------------------------------------------------------------
---- CONSTANTS ----
------------------------------------------------------------------------
-- constant strings (error circumfixes)
local constrkros = ' # # ' -- lagom -> huge circumfix
local constrelabg = '<span class="error"><b>' -- lagom whining begin
local constrelaen = '</b></span>' -- lagom whining end
-- HTML stuff for our tiny table and background around every char
local constrtabu3 = '<table style="display:inline-block; vertical-align:middle; margin:0.15em; padding:0.15em; border:0.15em solid #000000; text-align:center; background-color:#' -- missing color code and many char:s (only 3 ';">' to close element)
local constrtabu4 = ';"><tr><td>'
local constrtabu5 = '</td></tr></table>'
local constrbkg3 = '<span style="font-size:160%;background-color:#E0A0FF;"> '
local constrbkg4 = ' </span>'
local contabwarna = {}
contabwarna = {'FFFFFF','E8E8E8','D0D0D0','B8B8B8','FF6060','FFFF60','FFFFB0','C8C8FF'} -- (index 1...8)
-- constant strings EN vs EO vs ID vs SV
-- local constrkosong = 'empty string submitted' -- EN
local constrkosong = 'malplena signocxeno transdonita' -- EO
-- local constrkosong = 'string datang bersifat kosong' -- ID
-- local constrkosong = 'inkommen string aer tom' -- SV
-- local constrinvalid = 'invalid code sequence' -- EN
local constrinvalid = 'nevalida sekvo de kodoj' -- EO
-- local constrinvalid = 'rantai kode bersifat invalid' -- ID
-- local constrinvalid = 'ogiltig kodsekvens' -- SV
-- local constrinvalid = 'invalid optional parameter' -- EN
local constrinvalid = 'nevalida opcia parametro' -- EO
-- local constrinvalid = 'parameter opsional bersifat invalid' -- ID
-- local constrinvalid = 'ogiltig optional parameter' -- SV
------------------------------------------------------------------------
---- ORDINARY LOCAL MATH FUNCTIONS ----
------------------------------------------------------------------------
-- Local function MATHDIV
local function mathdiv (xdividend, xdivisor)
local resultdiv = 0 -- DIV operator lacks in LUA :-(
resultdiv = math.floor (xdividend / xdivisor)
return resultdiv
end--function mathdiv
-- Local function MATHMOD
local function mathmod (xdividendo, xdivisoro)
local resultmod = 0 -- MOD operator is "%" and bitwise AND operator lack too
resultmod = xdividendo % xdivisoro
return resultmod
end--function mathmod
------------------------------------------------------------------------
-- Local function MATHXOR
-- This sub depends on "MATH FUNCTIONS"\"mathdiv"
-- and "MATH FUNCTIONS"\"mathmod".
local function mathxor (xa, xb)
local resultxor = 0
local crap6 = 0
local crap7 = 0
local crap8 = 1 -- single bit value 1 -> 2 -> 4 -> 8 ...
while (true) do
if ((xa==0) and (xb==0)) then
break
end--if
crap6 = mathmod (xa,2) -- seize remainder before dividing
crap7 = mathmod (xb,2) -- seize remainder before dividing
xa = mathdiv (xa,2)
xb = mathdiv (xb,2)
if (crap6~=crap7) then
resultxor = resultxor + crap8
end--if
crap8 = crap8 * 2
end--while
return resultxor
end--function mathxor
------------------------------------------------------------------------
---- ORDINARY LOCAL STRING FUNCTIONS ----
------------------------------------------------------------------------
-- test whether char is an ASCII digit "0"..."9", return bool
local function lftestnum (numkaad)
local boodigit = false
boodigit = ((numkaad>=48) and (numkaad<=57))
return boodigit
end--function lftestnum
------------------------------------------------------------------------
-- test whether char is an ASCII uppercase letter, return bool
local function lftestuc (numkode)
local booupperc = false
booupperc = ((numkode>=65) and (numkode<=90))
return booupperc
end--function lftestuc
------------------------------------------------------------------------
-- test whether char is an ASCII lowercase letter, return bool
local function lftestlc (numcode)
local boolowerc = false
boolowerc = ((numcode>=97) and (numcode<=122))
return boolowerc
end--function lftestlc
------------------------------------------------------------------------
-- Local function LFIS62SAFE
-- Test whether incoming ASCII char is very safe (0...9 A...Z a...z).
-- This sub depends on "STRING FUNCTIONS"\"lftestnum" and
-- "STRING FUNCTIONS"\"lftestuc" and "STRING FUNCTIONS"\"lftestlc".
local function lfis62safe (numcxair)
local booguud = false
booguud = lftestnum (numcxair) or lftestuc (numcxair) or lftestlc (numcxair)
return booguud
end-- function lfis62safe
------------------------------------------------------------------------
---- ORDINARY LOCAL CONVERSION FUNCTIONS ----
------------------------------------------------------------------------
-- Local function LFDEC1DIGLM
-- Convert 1 decimal ASCII digit to UINT8 with inclusive upper limit.
-- Use this for single-digit conversions with range and for pseudo-bool
-- (0,1) and for genuine bool (false,true) via "boosplitit=(numcrap==1)".
local function lfdec1diglm (num1dygyt, num1lim)
num1dygyt = num1dygyt - 48 -- may become invalid ie negative
if ((num1dygyt<0) or (num1dygyt>num1lim)) then
num1dygyt = 255
end--if
return num1dygyt
end--function lfdec1diglm
------------------------------------------------------------------------
-- Local function LFUINT8TOHEX
-- Convert UINT8 (0...255) to 2-digit hex
-- This sub depends on "MATH FUNCTIONS"\"mathdiv"
-- and "MATH FUNCTIONS"\"mathmod".
local function lfuint8tohex (numinclow)
local strheksulo = ''
local numhajhaj = 0
numhajhaj = mathdiv (numinclow,16)
numinclow = mathmod (numinclow,16)
if (numhajhaj>9) then
numhajhaj = numhajhaj + 7 -- now 0...9 or 17...22
end--if
if (numinclow>9) then
numinclow = numinclow + 7 -- now 0...9 or 17...22
end--if
strheksulo = string.char (numhajhaj+48) .. string.char (numinclow+48)
return strheksulo
end--function lfuint8tohex
------------------------------------------------------------------------
-- Local function LFUINT32TOHEX
-- Convert UINT32 (0 ... $FFFF'FFFF = #4'294'967'295) to
-- (2 or 4 or 6 or 8)-digit hex
-- We need crass functions "mathdiv" and "mathmod"
local function lfuint32tohex (numincom)
local strheksulego = ''
while (true) do
strheksulego = lfuint8tohex ( mathmod (numincom,256) ) .. strheksulego
numincom = mathdiv (numincom,256)
if (numincom==0) then
break
end--if
end--while
return strheksulego
end--function lfuint32tohex
------------------------------------------------------------------------
-- Local function LFHEXDEC
-- Example output : "$FE=#254" (we have to save text with)
-- Depends on "lfuint8tohex"
local function lfhexdec (numkodo)
local strrezulto = ''
strrezulto = "$" .. lfuint8tohex (numkodo) .. "=#" .. tostring (numkodo)
return strrezulto
end--function lfhexdec
------------------------------------------------------------------------
-- Local function LFBUNCH
-- Add digit bunching to raw decimal number string
local function lfbunch (strnomorin)
local strnomorut = ""
local numlenn = 0
local numindeex = 0 -- ZERO-based counts up
local numcaar = 0 -- char of string
numlenn = string.len(strnomorin)
while (true) do
if (numindeex==numlenn) then
break
end--if
numcaar = string.byte(strnomorin,(numlenn-numindeex),(numlenn-numindeex))
if ((mathmod(numindeex,3)==0) and (numindeex~=0)) then
strnomorut = "'" .. strnomorut -- apo
end--if
strnomorut = string.char(numcaar) .. strnomorut
numindeex = numindeex + 1 -- index counts up but we go back
end--while
return strnomorut
end--function lfbunch
------------------------------------------------------------------------
---- ORDINARY LOCAL UTF8 FUNCTIONS ----
------------------------------------------------------------------------
-- Local function LFUTF8LENGTH
-- Measure length of a single UTF8 char, return ZERO if invalid
-- Does NOT thoroughly check the validity, looks at 1 octet only
-- Input : - numbgoctet (beginning octet of a UTF8 char)
-- Output : - numlen1234x (1...4 or ZERO if invalid)
local function lfutf8length (numbgoctet)
local numlen1234x = 0
if (numbgoctet<128) then
numlen1234x = 1 -- $00...$7F -- ANSI/ASCII
end--if
if ((numbgoctet>=194) and (numbgoctet<=223)) then
numlen1234x = 2 -- $C2 to $DF
end--if
if ((numbgoctet>=224) and (numbgoctet<=239)) then
numlen1234x = 3 -- $E0 to $EF
end--if
if ((numbgoctet>=240) and (numbgoctet<=244)) then
numlen1234x = 4 -- $F0 to $F4
end--if
return numlen1234x
end--function lfutf8length
------------------------------------------------------------------------
-- Local function LFUTF8DEKO
-- Decode a UTF8 char, return ZERO length if invalid
-- Result is a table: [0] length and [1] codepoint
-- This sub depends on "MATH FUNCTIONS"\"mathxor".
local function lfutf8deko (num0, num1, num2, num3)
local tabresult = {}
local numlength = 0 -- preassume invalid
local numkodepoin = 0 -- preassume invalid
num1 = mathxor (num1,128) -- XOR 3 of 4
num2 = mathxor (num2,128) -- XOR 3 of 4
num3 = mathxor (num3,128) -- XOR 3 of 4
while (true) do -- fake loop
if ((num0>193) and (num1>63)) then
break -- to join mark
end--if
if ((num0>223) and (num2>63)) then
break -- to join mark
end--if
if ((num0>239) and (num3>63)) then
break -- to join mark
end--if
if (num0<128) then -- ZERO to $7F
numkodepoin = num0
numlength = 1
break -- to join mark
end--if
if ((num0>193) and (num0<224)) then -- $C0 # $C2 to $DF
numkodepoin = (mathxor(num0,192)) * 64 + num1
if ((numkodepoin>127) and (numkodepoin<2048)) then
numlength = 2
end--if
break -- to join mark
end--if
if ((num0>223) and (num0<240)) then -- $E0 to $EF
numkodepoin = (mathxor(num0,224)) * 4096 + num1 * 64 + num2
if (((numkodepoin>2047) and (numkodepoin<55296)) or ((numkodepoin>57343) and (numkodepoin<65536))) then
numlength = 3
end--if
break -- to join mark
end--if
if ((num0>239) and (num0<245)) then -- $F0 to $F7 # $F4
numkodepoin = (mathxor(num0,240)) * 262144 + num1 * 4096 + num2 * 64 + num3
if ((numkodepoin>65535) and (numkodepoin<1114112)) then
numlength = 4
end--if
break -- to join mark
end--if
break -- finally to join mark
end--while -- fake loop -- join mark
tabresult [0] = numlength
tabresult [1] = numkodepoin
return tabresult
end--function lfutf8deko
------------------------------------------------------------------------
-- Local function LFULTENCODE
-- Our cool module has brewed something with "[["..."]]" and repeated
-- spaces but we want to see plain text for debugging purposes. Thus we
-- dec-encode it, use NBSP to fix spaces and maybe add colour.
-- This hepls with "[["..."]]", "["..."]", "*", "#", ":" and with
-- multiple spaces. They are no longer reduced to one piece. Note that
-- there is no problem with "{{"..."}}".
-- We must be UTF8-aware. A UTF8 char must be either split in a controlled
-- way, or completely preserved ie not split or encoded at all.
-- Note that this causes BLOAT. The caller is responsible for
-- adding "<big>"..."</big>" if desired.
-- Input : * "strkrampuj" : string, empty tolerable, but "nil" type is NOT
-- * "boowarrna" : color enable
-- * "boosplit" : split UTF8 char:s into hex numbers
-- Output : * "strkood" : string, empty in worst case
-- This sub depends on "MATH FUNCTIONS"\"mathdiv"
-- and "MATH FUNCTIONS"\"mathmod" and "CONVERSION FUNCTIONS"\"lfuint8tohex"
-- and "STRING FUNCTIONS"\"lfis62safe" and "UTF8 FUNCTIONS"\"lfutf8length".
local function lfultencode (strkrampuj,boowarrna,boosplit)
local stronechar = ""
local strkolorr = ""
local strkood = ""
local numstrlne = 0
local numpeekynx = 1 -- ONE-based index
local numcahr = 0
local numutf8len = 0
local numcolour = 0 -- 0,1,2,3 -- R,G,B,Y
local boonbsp = false
local boosplnow = false
numstrlne = string.len (strkrampuj)
while (true) do -- genuine loop
if (numpeekynx>numstrlne) then
break
end--if
numcahr = string.byte (strkrampuj,numpeekynx,numpeekynx)
numpeekynx = numpeekynx + 1 -- ONE-based index
while (true) do -- fake loop
if (numcahr==32) then
if (boonbsp) then
stronechar = " " -- this prevents space reduction
else
stronechar = " "
end--if
boonbsp = not boonbsp
break -- to join mark
end--if
if (numcahr>127) then
boosplnow = boosplit
numutf8len = lfutf8length (numcahr)
if (numutf8len==0) then
boosplnow = true -- forced split for broken UTF8 sequence
else
numutf8len = numutf8len - 1 -- more char:s to pick
end--if
if ((numpeekynx+numutf8len)>(numstrlne+1)) then
boosplnow = true -- forced split for broken UTF8 sequence
end--if
if (boosplnow) then
stronechar = "{$" .. lfuint8tohex (numcahr) .. "}"
else
stronechar = string.char (numcahr)
while (true) do
if (numutf8len==0) then
break
end--if
numcahr = string.byte (strkrampuj,numpeekynx,numpeekynx)
numpeekynx = numpeekynx + 1
numutf8len = numutf8len - 1
stronechar = stronechar .. string.char (numcahr)
end--while
end--if
break -- to join mark
end--if
if (lfis62safe(numcahr)) then -- safe ASCII ie 0...9 A...Z a...z
stronechar = string.char (numcahr) -- do NOT encode safe char:s
break -- to join mark
end--if
stronechar = "&#" .. tostring (numcahr) .. ";"
break -- finally to join mark
end--while -- fake loop -- join mark
if (boowarrna) then
if (numcolour==0) then
strkolorr = "FFA0A0" -- red
end--if
if (numcolour==1) then
strkolorr = "D0FFD0" -- light green
end--if
if (numcolour==2) then
strkolorr = "A0A0FF" -- blue
end--if
if (numcolour==3) then
strkolorr = "D0D0D0" -- light grey
end--if
numcolour = mathmod ((numcolour+1),4)
strkood = strkood .. '<span style="background-color:#' .. strkolorr .. ';">' .. stronechar .. '</span>'
else
strkood = strkood .. stronechar
end--if
end--while
return strkood
end--function lfultencode
------------------------------------------------------------------------
---- ORDINARY LOCAL HIGH LEVEL FUNCTIONS ----
------------------------------------------------------------------------
-- Local function LFWARNA
-- Convert integer 1...8 (must be valid) to 6 digits hex color.
-- fill the gap between "constrtabu3" and "constrtabu4" always with help of
-- this sub, do NOT put hardcoded color values there
-- we use "contabwarna"
-- 1 white default, 2...4 grey getting darker,
-- 5 red, 6 yellow, 7 light yellow, 8 light blue
local function lfwarna (indexofcolor)
local strfaerg = ''
strfaerg = contabwarna [indexofcolor]
return strfaerg
end--function lfwarna
------------------------------------------------------------------------
-- Local function LFIGATEYELLOW
-- Detect TAB CR NBSP ZWSP LRM RLM BOM -- "yellow class error"
-- ZERO is "red class error" and not included here
local function lfigateyellow (numcodepoint)
local strnamev = ''
if (numcodepoint== 9) then
strnamev = 'TAB'
end--if
if (numcodepoint== 13) then
strnamev = 'CR'
end--if
if (numcodepoint== 160) then
strnamev = 'NBSP'
end--if
if (numcodepoint== 8203) then
strnamev = 'ZWSP'
end--if
if (numcodepoint== 8206) then
strnamev = 'LRM'
end--if
if (numcodepoint== 8207) then
strnamev = 'RLM'
end--if
if (numcodepoint==65279) then
strnamev = 'BOM'
end--if
return strnamev
end--function lfigateyellow
------------------------------------------------------------------------
-- Local function LFIGATESPECIAL
-- Detect LF SPACE -- "light yellow class char"
local function lfigatespecial (numcoodepoint)
local strnme = ''
if (numcoodepoint== 10) then
strnme = 'LF'
end--if
if (numcoodepoint== 32) then
strnme = 'SPACE'
end--if
return strnme
end--function lfigatespecial
------------------------------------------------------------------------
---- MAIN EXPORTED FUNCTION ----
------------------------------------------------------------------------
function utf8debug.ek (arxframent)
-- general unknown type
local vartmp = 0 -- variable without type
-- special type "args" AKA "arx"
local arxsomons = 0 -- metaized "args" from our own or caller's "frame"
-- general "tab"
local tabutf8dec = {}
-- general "str"
local strinc = "" -- incoming text
local strctrl = "" -- from optional parameter
local strnamevil = "" -- name of a bad char, for example "CR" "ZWSP"
local strnamechr = "" -- name of an invisile char, for example "LF" "SPACE"
local strsngchar = "" -- one char with "span" background
local strchrblok = "" -- prebrewed block with table for one char
local strret = "" -- final output string
-- general "num"
local numlongtx = 0 -- length of incoming parameter
local numlung = 0 -- temp
local numindx = 0
local numreserv = 0
local numwarna = 0
local numoct = 0 -- temp some char
local numodt = 0 -- temp some char
local numoet = 0 -- temp some char
local numoft = 0 -- temp some char
local numutflen = 0
local numdecode = 0 -- decoded "codepoint" value
local numchrlen = 0 -- number of UTF8 char:s
-- general "boo"
local boocrap = false
local boooktblo = false
local boobigbox = false -- big boxes
local boohardnw = false -- "true" from "1" or "2" or "3"
local boohnwcol = false -- "true" from "2" or "3" only
local boohnwspt = false -- "true" from "3" only
local booutfblo = false -- report UTF8 char bloat
---- GUARD AGAINST INTERNAL ERROR ----
-- "constrkosong" and "constrinvalid" must be uncommented and assigned
-- note that reporting of this error may NOT depend of uncommentable strings
if ((type(constrkosong)~="string") or (type(constrinvalid)~="string")) then
boocrap = true
end--if
---- GET THE ARX (ONE OF TWO) ----
if (boocrap==false) then
arxsomons = arxframent.args -- "args" from our own "frame"
vartmp = arxsomons ["caller"]
if (vartmp=="true") then
arxsomons = arxframent:getParent().args -- "args" from caller's "frame"
end--if
end--if
---- SEIZE 1 OBLIGATORY ANONYMOUS PARAMETER ----
-- on success assign "strinc" and "numlongtx" (not to be touched later)
if (boocrap==false) then
vartmp = arxsomons [1]
if (type(vartmp)~="string") then
boocrap = true -- empty string is legal, missing parameter is NOT legal
else
numlongtx = string.len (vartmp)
if (numlongtx>65536) then
boocrap = true -- this causes bloat, we can never enocode such big
else
strinc = vartmp
end--if
end--if (type(vartmp)~="string") else
end--if
---- SEIZE AND PRECHECK 1 OPTIONAL ANONYMOUS PARAMETER ----
-- default is "1101", "0000" is prohibited, "nw" is synonymous
-- with "0010", empty main input switches the type to "1000"
if (boocrap==false) then
strctrl = "1101"
vartmp = arxsomons [2]
if (type(vartmp)=="string") then
if (vartmp=="0000") then
vartmp = "-" -- invalid
end--if
if (vartmp=="nw") then
vartmp = "0010"
end--if
numlung = string.len (vartmp)
if (numlung~=4) then
boocrap = true
else
strctrl = vartmp
end--if
end--if (type(vartmp)=="string") then
if (numlongtx==0) then
strctrl = "1000" -- empty main input switches the type to "1000"
end--if
end--if
---- PROCESS OPTIONAL PARAMETER ----
if (boocrap==false) then
while (true) do -- fake loop
numoft = lfdec1diglm(string.byte(strctrl,1,1),1) -- 255 if invalid
if (numoft==255) then
boocrap = true
break -- to join mark
end--if
boooktblo = (numoft==1) -- octet bloat
numoft = lfdec1diglm(string.byte(strctrl,2,2),1) -- 255 if invalid
if (numoft==255) then
boocrap = true
break -- to join mark
end--if
boobigbox = (numoft==1) -- big boxes
numoft = lfdec1diglm(string.byte(strctrl,3,3),3) -- 255 if invalid
if (numoft==255) then
boocrap = true
break -- to join mark
end--if
boohardnw = (numoft~=0) -- "true" from "1" or "2" or "3"
boohnwcol = (numoft>1) -- "true" from "2" or "3" only
boohnwspt = (numoft==3) -- "true" from "3" only
numoft = lfdec1diglm(string.byte(strctrl,4,4),1) -- 255 if invalid
if (numoft==255) then
boocrap = true
break -- to join mark
end--if
booutfblo = (numoft==1) -- UTF8 char bloat
break -- finally to join mark
end--while -- fake loop -- join mark
end--if
---- WHINE IF YOU MUST ----
-- note that reporting of this error may NOT depend of uncommentable strings
if (boocrap) then
strchrblok = 'FATAL in "mutf8debug" : internal error or missing or invalid parameter'
strret = constrkros .. constrelabg .. strchrblok .. constrelaen .. constrkros
end--if
---- OCTET BLOAT ----
if ((boocrap==false) and boooktblo) then
if (numlongtx==0) then
numwarna = 5 -- red on empty string (only 5 or 8 here)
strsngchar = constrkosong
else
numwarna = 8 -- light blue (only 5 or 8 here)
strsngchar = "number of<br>octet:s : " .. lfbunch (tostring (numlongtx) )
end--if
strret = constrtabu3 .. lfwarna (numwarna) .. constrtabu4 .. strsngchar .. constrtabu5 .. "<br>"
end--if
---- BIG BOXES ----
-- we have "strinc" and "numlongtx"
-- we brew a private table with just one cell for every single char
if ((boocrap==false) and boobigbox) then
numindx = 0 -- counts octet:s
numchrlen = 0 -- counts UTF8 char:s
while (true) do
if (numindx>=numlongtx) then
break
end--if
numreserv = numlongtx - numindx -- at least 1
numoct = string.byte (strinc,(numindx+1),(numindx+1))
numodt = 0
numoet = 0
numoft = 0
if (numreserv>=2) then
numodt = string.byte (strinc,(numindx+2),(numindx+2))
end--if
if (numreserv>=3) then
numoet = string.byte (strinc,(numindx+3),(numindx+3))
end--if
if (numreserv>=4) then
numoft = string.byte (strinc,(numindx+4),(numindx+4))
end--if
tabutf8dec = lfutf8deko (numoct,numodt,numoet,numoft)
numutflen = tabutf8dec [0]
numdecode = tabutf8dec [1]
strnamevil = '' -- preassume, NOT reporting any name -- yellow
strnamechr = '' -- preassume, NOT reporting any name -- light yellow
if (numutflen~=0) then
strnamevil = lfigateyellow (numdecode) -- re empty string if no hit
strnamechr = lfigatespecial (numdecode) -- re empty string if no hit
end--if
numwarna = numutflen -- preassume, ZERO to 4, ZERO is invalid
if ((numoct==0) or (numutflen==0)) then
numwarna = 5 -- red on code ZERO or invalid sequence
if (numoct==0) then
strnamevil = "ZERO"
end--if
end--if
if (strnamevil~='') then
numwarna = 6 -- yellow on TAB CR NBSP ZWSP LRM RLM BOM
end--if
if (strnamechr~='') then
numwarna = 7 -- light yellow on LF SPACE
end--if
strchrblok = constrtabu3 .. lfwarna (numwarna) .. constrtabu4 .. "<small>index</small> " .. lfbunch (tostring (numindx) )
strchrblok = strchrblok .. "<br><small>beg code</small> " .. lfhexdec (numoct)
if (numutflen==0) then
strchrblok = strchrblok .. "<br>" .. constrinvalid -- color sudah done before
else
strchrblok = strchrblok .. "<br><small>length</small> " .. tostring (numutflen)
strsngchar = string.char (numoct) -- maybe we will need it
if (numutflen>=2) then
strchrblok = strchrblok .. "<br><small>extra</small> $" .. lfuint8tohex (numodt)
strsngchar = strsngchar .. string.char (numodt)
if (numutflen>=3) then
strchrblok = strchrblok .. ",$" .. lfuint8tohex (numoet)
strsngchar = strsngchar .. string.char (numoet)
end--if
if (numutflen==4) then
strchrblok = strchrblok .. ",$" .. lfuint8tohex (numoft)
strsngchar = strsngchar .. string.char (numoft)
end--if
strchrblok = strchrblok .. "<br><small>codepoint</small> U+$" .. lfuint32tohex (numdecode)
strchrblok = strchrblok .. "<br><small>dec</small> #" .. lfbunch (tostring (numdecode) )
end--if (numutflen>=2) then
if (strnamevil~='') then
strchrblok = strchrblok .. "<br>" .. strnamevil -- whine only if reason to
end--if
if (strnamechr~='') then
strchrblok = strchrblok .. "<br>" .. strnamechr -- boast only if reason to
end--if
if ((strnamevil..strnamechr)=='') then
strchrblok = strchrblok .. "<br>" .. constrbkg3 -- begin char background
if (numutflen==1) then
strchrblok = strchrblok .. "&#" .. tostring (numoct) .. ";" -- give a F**K in "strsngchar"
else
strchrblok = strchrblok .. strsngchar -- let wiki software & browser bother
end--if
strchrblok = strchrblok .. constrbkg4 -- close char background
end--if
end--if (numutflen==0) else
strchrblok = strchrblok .. constrtabu5 -- close table
numindx = numindx + numutflen -- ZERO-based index
numchrlen = numchrlen + 1 -- invalid char:s do count too
strret = strret .. strchrblok
end--while
strret = strret .. "<br>"
end--if ((boocrap==false) and boobigbox) then
---- HARD NOWIKI ----
-- we have "strinc" and "numlongtx"
-- boohardnw "true" from "1" or "2" -- do "hard nowiki"
-- boohnwcol "true" from "2" only -- requested colour
if ((boocrap==false) and boohardnw) then
strret = strret .. "<big>" .. lfultencode (strinc,boohnwcol,boohnwspt) .. "</big><br>"
end--if
---- UTF8 BLOAT ----
-- incoming "numchrlen" cannot be ZERO if "booutfblo" is "true"
if ((boocrap==false) and booutfblo) then
strsngchar = "number of UTF8<br>char:s : " .. lfbunch (tostring (numchrlen) )
strret = strret .. constrtabu3 .. lfwarna (8) .. constrtabu4 .. strsngchar .. constrtabu5 .. "<br>"
end--if
---- RETURN THE JUNK STRING ----
return strret
end--function
---- RETURN THE JUNK LUA TABLE ----
return utf8debug