This was supposed to be a comment on Nick Pelling’s blog ciphermysteries, but turned out too lengthy. So I decided to turn it into a blog post. It is of the “all out” type..
CHARACTERS VS. GLYPHS VS. HIEROGLYPHS
I am wondering about the use of the term “glyph” in voynich world. I suspect it is understood slightly different as in Unicode terminology, i.e. “characters vs. glyphs“. More like in “hiero-glyphs”, I’m afraid.
I think EVA’s “bad” is in that it partially tries to transport meaning in shapes instead of code points, and that it encodes in (E)ASCII instead of Unicode.
If we look at the Copiale Chiffre decipherment, the simple complication of “tokenising” (substituting) german bi- & trigraphs “ch”, “tz”, “sch”, and doubles “ll” and so on in separate homophone sets disrupts statistical language guessing and more.
Now EVA does exactly that when it introduces special characters which could be mapped otherwise in Unicode (like diacritics, dialectic variants, tironian notes, latin & scribal abbreviations, phonotactics [variable spacing] etc.) or be expressed as ligatures. Much worse, the mapping is to high ASCII chars instead of the Unicode Private Use Area. So e.g. we throw an inofficial &163 (£, english pound sign) at the stats where it would eventually need an U+2184, ↄ Latin Small Letter Reversed C, etc. The neglecting of ligatures, abbreviations and spacing does it’s deed.
While otherwise little regarded features of medieval crypto techniques of Linguistic Obfuscation, for example Code Switching (using latin, vernacular french, italian etc. intermittently) and layering of these “weak” methods certainly pose an obstacle, some statistical methods are not impeded by the uncertainty of existing transcriptions, like the fundamental note, that there are no capitalizations in, sorry for the pun, “so-called voynichese script”.
The voynichese character set is not an “in situ” creation, meaning it is not invented out of the blue, as this is an almost impossible task for reasons I cannot outline here.
“The script uses many ligatures and has many unique scribal abbreviations, along with many borrowings from Tironian notes” would describe it rather well, while this quote is nicked from the Wikipedia article about Insular Script.
Prescribed practice hasn’t been tried, so far. The most difficult problem of “no language, no alphabet” could be digitally tackled with a graphemic transliteration table first, followed by an allographic analysis (comparing mean distribution of possible variants), encoding spatial positions on a character level, encoding emanation types (e.g. inking density for writing order definition), ambiguities etc, etc. Of course encoding the imagery, marginalia, physical properties etc. would be part of the VMS ontology, no matter if TEIP5 or standoff-property style.
Multi-level is the keyword, but I realise this is getting much too lengthy for now while not even beginning to outline the task completely. It means tons of work. I would like to avoid a certain proverb I find ghastly, but it is true:
A lot of bathtubs will have to be unplugged.