Monday, December 29, 2014


Wayback Machine   
  
APR     OCT     AUG
Previous capture     9     Next capture
2002     2003     2009
14 captures
3 Oct 02 - 21 Aug 14
  
sparklines
    Close Help
Understanding the Second-Order Entropies of Voynich Text
by Dennis J. Stallings
May 11, 1998
Abstract
    The anomalous second-order entropies of Voynich text are among its most puzzling features. h1-h2, the difference between conditional first- and second order entropies, equals the difference H1-h2, the difference between the first-order absolute entropy and the second- order conditional entropy. h1-h2 or H1-h2 is a theoretically significant number; it denotes the average information carried by the first character in a digraph about the second one. Therefor it was chosen as a simple measure of what is being sought, although the whole entropy profile of text samples was considered.
    Tests show that Voynich text does not have its low h2 measures solely because of a repetitious underlying text, that is, one that often repeats the same words and phrases. Tests also show that the low h2 measures are probably not due to an underlying low-entropy natural language. A verbose cipher, one which substitutes several ciphertext characters for one plaintext character, can produce the entropy profile of Voynich text.
Table of Contents
    Introduction
    Measures of Relative Second-Order Entropy
    Entropies of Voynich Texts
    Verbose Ciphers
    Repetitive Texts
    Schizophrenic Language
    Low-Entropy Natural Languages
        Japanese
        Hawaiian
        Discussion of Phonemic versus Syllabic Notation
            The Size of the Character Set
            The Effect of Word Divisions
            Redundancy
            The Effect of Syllable Divisions
        Final Thoughts on Low-Entropy Natural Languages
    Suggestions for Further Work
    Acknowledgments
    References for Electronic Texts
    Printed References
Introduction
    William Ralph Bennett first applied the entropy concept to the study of the Voynich Manuscript in his Scientific and Engineering Problem Solving with the Computer (Englewood Cliffs: Prentice-Hall, 1976). His book has introduced many people to the VMs.
    The repetitive nature of VMs text is obvious to casual examination. Entropy is one possible numerical measure of a text's repetitiousness. The higher the text's repetitiousness, the lower the second-order entropy (information carried in letter pairs). Bennett noted that only some Polynesian languages have second-order entropies as low as VMs text. Typical ciphers do not have a low second-order entropy either.
    This paper examines other possible reasons for the low second- order entropy of Voynich texts: a verbose cipher or a repetitious underlying text. It also examines the low-entropy natural languages Hawaiian and Japanese for further insight into that hypothesis.
Measures of Relative Second-Order Entropy
    Jacques Guy's MONKEY program was used to calculate second-order entropies. (Note: the bug-free, "sensible" MONKEY on the EVMT Project Home Page was used; the author believes that the version of MONKEY on Garbo as of this writing has bugs.) Note that MONKEY in its present form only takes the first 32,000 characters in a file. Some long texts were divided up into portions so that MONKEY could analyze them separately.


    The conditional entropies were used, as is customary on the Voynich E-mail list. Say that H1 is the absolute first-order entropy and H2 is the absolute second-order entropy. Then h1 and h2 are the first- and second-order conditional entropies. h2 = H2-H1, since it is conditional on more than one character. h1 = H1, since it depends on only single characters; thus h1 is really not conditional.
    The following measures were considered:

    h0: zero-order entropy (log2 of the number of different characters)
    h1: first-order conditional or absolute entropy
    h2: second-order conditional entropy
    h1-h2: difference between conditional first- and second order entropies, which equals the difference -
    H1-h2: the difference between the first-order absolute entropy and the second-order conditional entropy.


    As will be seen, there is a need here to compare systems with very different numbers of characters, to scale the statistics somehow to the size of the character set. h1-h2 or H1-h2 is a theoretically significant number; it denotes the average information carried by the first character in a digraph about the second one. It is perhaps the best single, simple measure of what is being sought.
    The % of the second-order maximum absolute entropy might have been used. One could calculate the % of H2 from the total H2 that could be delivered by each alphabet. Using digraphs with an alphabet of m characters, H2(max) is:


    log2(m^2)
and the %H2(max) is:

    (H2/log2(m^2))/100

    However, the H2(max) depends tremendously on m, the size of the character set chosen. For Voynich text, Currier has 36 characters and Basic Frogguy has 23 characters. Characters that are hardly ever used have little effect on h1 and h2, but could make a tremendous difference in H2(max). Therefore, this measure was not used.
    To start the discussion, here are some data from the English King James Bible:









Table 1:
English King James Bible - 1 Kings



Passage Beginning at   
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
1:1   
27
  
32000
  
4.755
  
4.022
  
3.068
  
0.953
8:19   
27
  
32000
  
4.755
  
4.028
  
3.090
  
0.939
15:27   
27
  
32000
  
4.755
  
3.998
  
3.092
  
0.906
Average of three   
27
  
96000
  
4.755
  
4.016
  
3.083
  
0.933
    The h1-h2 range for different portions of the same text is 0.906-0.953.
    And here are data on the corresponding portions of the Latin Vulgate Bible:







Table 2:
Latin Vulgate Bible - 1 Kings



Passage Beginning at   
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
1:1   
24
  
32000
  
4.585
  
4.002
  
3.309
  
0.692
8:19   
24
  
32000
  
4.585
  
3.994
  
3.287
  
0.707
15:27   
24
  
32000
  
4.585
  
4.005
  
3.304
  
0.700
Average of three   
24
  
96000
  
4.585
  
4.000
  
3.300
  
0.700
    The average h1-h2 is 0.700, compared to 0.933 for the English text. This is undoubtedly due to the fact that English uses more combinations of two or more letters to represent single phonemes than Latin does. The range of h1-h2 for the Latin text is 0.692-0.707, narrower than for the English text.


    The next table shows the h1-h2 statistic for assorted files in various languages and notations. This shows how the h1-h2 statistic sometimes shows unexpected information. For instance, Hawaiian and Japanese have low h2 values, approaching Voynich text, in phonemic notation. However, the h1-h2 values for Hawaiian and Japanese are far less than Voynich text.





Table 3:
h1-h2 Statistics for Selected Texts



File   
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Latin - Vulgate Bible, 1 Kings, first 32K   
24
  
32000
  
4.585
  
4.002
  
3.309
  
0.692
Hawaiian (Bennett, limited phonemic)   
13
  
15000
  
3.700
  
3.200
  
2.454
  
0.746
Hawaiian newspaper (full phonemic)   
19
  
13473
  
4.248
  
3.575
  
2.650
  
0.925
English - King James Bible - Genesis, first 32K   
27
  
32000
  
4.755
  
3.969
  
3.020
  
0.949
Japanese Tale of Genji - Section 1 (romaji)   
22
  
32000
  
4.459
  
3.763
  
2.677
  
1.086
Japanese Tale of Genji - Section 1 (kana)    
71
  
20622
  
6.150
  
4.764
  
3.393
  
1.370
Voynich Herbal-B (Currier)   
34
  
13858
  
5.087
  
3.796
  
2.267
  
1.529
Voynich Herbal-B (EVA)   
21
  
16061
  
4.392
  
3.859
  
2.081
  
1.778
Entropies of Voynich Texts
    Here are entropy results for Voynich texts, a sample of Herbal-A and Herbal-B. The Herbal-A sample's h1-h2 ranges 1.479-1.945, depending on which transcription alphabet is used. The Herbal-B sample's h1-h2 ranges 1.529-1.897. All these are far greater than the 0.93 for English and 0.70 for Latin.
    The choice of transcription alphabet also makes an enormous difference. From Currier to Frogguy the range of h1-h2 is 1.5-1.9. The direction is what one would expect. Currier is the most synthetic, while Frogguy is the most analytical, decomposing single Currier characters into several Frogguy characters. Thus Currier Q = Frogguy cqpt.



Table 4:
Voynich Texts



Type of Voynich Text   
Transcription Alphabet
  
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Herbal-A   
Currier
  
33
  
9804
  
5.044
  
3.792
  
2.313
  
1.479
Herbal-A   
FSG
  
24
  
10074
  
4.585
  
3.801
  
2.286
  
1.515
Herbal-A   
EVA
  
21
  
12218
  
4.392
  
3.802
  
1.990
  
1.812
Herbal-A   
Frogguy
  
21
  
13479
  
4.392
  
3.826
  
1.882
  
1.945
Herbal-B   
Currier
  
34
  
13858
  
5.087
  
3.796
  
2.267
  
1.529
Herbal-B   
FSG
  
24
  
14203
  
4.585
  
3.804
  
2.244
  
1.560
Herbal-B   
EVA
  
21
  
16061
  
4.392
  
3.859
  
2.081
  
1.778
Herbal-B   
Frogguy
  
21
  
17909
  
4.392
  
3.846
  
1.949
  
1.897
    The samples of Voynich text are relatively small. The following statistics of samples of a single known Latin text gives some idea of how much difference this might make.









Table 5:
Texts from Latin Vulgate Bible, 1 Kings, For Study of Effect of Sample Size on Entropy Data. Passages All Begin at 1:1



Passage Ending at    
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
2:18   
23
  
8929
  
4.524
  
3.994
  
3.263
  
0.731
4:21   
24
  
18623
  
4.585
  
3.995
  
3.298
  
0.697
7:17   
24
  
29647
  
4.585
  
4.003
  
3.309
  
0.694
    It is doubtful whether h1-h2 or any other single measure can tell us all we want. However, the representation system is probably the heart of the issue. The following discussion of verbose ciphers is a case in point.
Verbose Ciphers
    A verbose cipher, one that substitutes several ciphertext characters for one plaintext character, can produce the entropy profile seen with Voynich text. Such a system is Cat Latin C, which is to be applied to Latin plaintext. Vowels and consonants were added roughly in proportion to their occurence in Latin. This keeps the h1 roughly the same as with Latin and Voynich FSG. The repeated digraphs are what reduce h2 to where it is desired. If q is followed by u, it is as with normal Latin; otherwise it fits one of the consonant patterns. So this scheme is unambiguous. This scheme does produce VMs-like entropies!


    This table shows the Cat Latin verbose cipher:





Table 6:
Cat Latin C



Plaintext     Ciphertext
a     a
b     bqbababa
c     c
d     dqdede
e     e
f     fqfififi
g     gqgogogo
h     h
i     i
j     jqjajaja
k     k
m     mqmememe
n     nqninini
o     o
p     pqpopopo
qu     qu
r     rqrarara
s     sqsesese
t     tqtititi
u     u
v     v
w     w
x     xqxoxoxo
y     y
z     zqzazaza
    For comparison here are VMs results in FSG, since the size of that character set is closest to Latin.



Table 7:
Verbose Cipher Compared to Voynich Text



File   
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Voynich Herbal-A (FSG)   
24
  
10074
  
4.585
  
3.801
  
2.286
  
1.515
Voynich Herbal-B (FSG)   
24
  
14203
  
4.585
  
3.804
  
2.244
  
1.560
Latin Vulgate, 1 Kings, 1:1 - 2:11   
23
  
8232
  
4.524
  
3.996
  
3.262
  
0.734
Above passage, Cat Latin C   
23
  
28754
  
4.524
  
3.873
  
2.278
  
1.595
    However, it's clear that this is not the same pattern as Voynich text. It might be best to look for patterns subjectively. Here are some text samples.
    The start of the Voynich Herbal-A sample file (f29v, lines 1- 9), in EVA:
kshol qoocph shor pshocph shepchy qoty dy shory
ykcholy qoty chy dy qokchol chor tchy qokchody cheor o
chor chol chy choiin
tshoiin cheor chor o chty qotol sheol shor daiin qoty
otol chol daiin chkaiin shoiin qotchey qotshey daiiin
daiin chkaiin
pchol oiir chol tsho daiin sho teo chy chtshy dair am
okain chan chain cthor dain yk chy daiin cthol
sot chear chl s choly dar

    The beginning of a Hawaiian sample file, from a Hawaiian newspaper, to be discussed later:
    kepakemapa mei puke kepakemapa mei mahalo 'ia ka 'Olelo hawai'i e nA mAka' na ho'Olanani kim ma ka lA o malaki ua noa ka pAka 'o kapi'olani no ke anaina na lAkou ke kuleana 'o ka mAlama 'ana ma ka 'Olelo 'ana aku i ka 'Olelo hawai'i ma laila nO i 'Akoakoa ai ka po'e haumAna ka po'e kumu ka po'e mAkua a me ka po'e hoa o kElA 'ano kEia 'ano o ka 'Olelo hawai'i a ma laila nO ho'i i launa ai ka po'e ma o ka 'Olelo hawai'i kapa 'ia kEia lA hoihoi 'o ka lA 'ohana
    Finally, the beginning of the Latin Vulgate 1 Kings in Cat Latin C:
    etqtititi rqrararaexqxoxoxo dqdedeavidqdede sqseseseenqnininiuerqrararaatqtititi habqbababaebqbababaatqtititique aetqtititiatqtititiisqsesese pqpopopolurqrararaimqmememeosqsesese dqdedeiesqsesese cumqmememeque opqpopopoerqrararairqrararaetqtititiurqrarara vesqsesesetqtititiibqbababausqsesese nqnininionqninini calefqfififiiebqbababaatqtititi dqdedeixqxoxoxoerqrararaunqnininitqtititi erqrararagqgogogoo ei sqseseseerqrararavi ...
    Look at these samples and think about the kind of repetition involved in each case! The "Cat Latin C" verbose cipher is clearly not the same thing as Voynichese.
    Here are the entropy values for these samples:







Table 8:
Statistics on Text Samples



File   
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Voynich Herbal-A (EVA)   
21
  
12218
  
4.392
  
3.802
  
1.990
  
1.812
Hawaiian newspaper (full phonemic)   
19
  
13473
  
4.248
  
3.575
  
2.650
  
0.925
Latin Vulgate, 1 Kings, 1:1 - 2:11, Cat Latin C   
23
  
28754
  
4.524
  
3.873
  
2.278
  
1.595
    The author's personal opinion is that the rigid internal structure of Voynich text accounts for the low h2 measures. The majority of Voynich "words" follow a paradigm. Robert Firth (Work Note #24) and Jorge Stolfi (Voynich Page) both have identified paradigms. Captain Prescott Currier (Currier's Papers ) identified several other kinds of internal structure in Voynich text.
Repetitive Texts
    From time to time, some have suggested that the Voynich Manuscript is simply a very repetitious text. Here is a magical spell in medieval High German that is repetitious:

         eiris sazun idisi             sazun her duoder
         suma hapt heptidun            suma heri lezidun
         suma clubodun                 umbi cuoniouuidi
         insprinc haptbandun           inuar uigandun
         phol ende uuodan              uuorun zi holza
         du uuart demo balderes uolon  sin uuoz birenkit
         thu biguol en sinthgunt       sunna era suister
         thu biguol en friia           uolla era suister
         thu biguol en uuodan          so he uuola conda
         sose benrenki                 sose bluotrenki
         sose lidirenki
         ben zi bena                   bluot zi bluoda
         lid zi geliden                sose gelimida sin
    Merseburger Zaubersprüche (Magic Spells from Merseburg) in Old High German. Note: 'uu' = 'w'.
    An experiment to test this idea is to take samples of known repetitious texts (food recipes, religious texts, catalogs) and compare their second-order entropies with those of known texts that should be less repetitious (prose fiction, essays).
    Note that some long texts were larger than MONKEY's 32,000 character limit; in those cases MONKEY just took the first 32,000 characters. Some long texts were divided up into separate portions that MONKEY could analyze.
    Jacobean English. Ever since its publication, many commentators have noted how repetitious the Book of Mormon is. The Bible itself is, of course, somewhat repetitious. A (relatively) non-repetitious text in Jacobean English is the Essays of Sir Francis Bacon.
    The Book of Mormon appears to be the most repetitious. h1- h2 for the Book of Mormon excerpts range 0.931-0.980. The King James Bible is next, 0.904-0.983. The non-repetitious Essays of Francis Bacon have 0.827-0.837. Taking averages, the difference for h1-h2 between the most repetitious text and the least is 0.951 versus 0.831, a difference of 0.120.





Table 9:
Elizabethan English Texts of Varying Repetition



File   
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Book of Mormon - 1 Nephi   
27
  
32000
  
4.755
  
4.033
  
3.090
  
0.942
Book of Mormon - Alma   
27
  
32000
  
4.755
  
4.041
  
3.109
  
0.931
Book of Mormon - Ether   
27
  
32000
  
4.755
  
4.009
  
3.029
  
0.980
King James Bible - Genesis   
27
  
32000
  
4.755
  
3.969
  
3.020
  
0.949
King James Bible -Joshua   
27
  
32000
  
4.755
  
4.012
  
3.029
  
0.983
King James Bible -Acts   
27
  
32000
  
4.755
  
4.041
  
3.137
  
0.904
Francis Bacon's Essays, Part 1   
27
  
32000
  
4.755
  
4.048
  
3.220
  
0.827
Francis Bacon's Essays, Part 2   
27
  
32000
  
4.755
  
4.042
  
3.214
  
0.828
Francis Bacon's Essays, Part 3   
27
  
32000
  
4.755
  
4.066
  
3.229
  
0.837
    Latin (Late Classical). Samples of the Vulgate Bible and Boethius' Consolations of Philosophy were analyzed. There is little difference in the statistics between the Vulgate Bible and the presumably less repetitious Consolatio Philosophiae.









Table 10:
Latin Texts of Varying Repetition





File   
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
1 Kings, Vulgate, 1:1   
24
  
32000
  
4.585
  
4.002
  
3.309
  
0.692
1 Kings, Vulgate, 8:19   
24
  
32000
  
4.585
  
3.994
  
3.287
  
0.707
1 Kings, Vulgate, 15:27   
24
  
32000
  
4.585
  
4.005
  
3.304
  
0.700
Boethius - Consolatio Philosophiae - Books 3 & 4   
25
  
32000
  
4.644
  
3.971
  
3.272
  
0.699
    Modern English. Repetitive texts: food recipes (chicken and Cajun), a catalog of technical standards, and a Roman Catholic litany. For a non-repetitious text: a short story, "The Blue Hotel" by Stephen Crane.
    The non-repetitious short story "The Blue Hotel" has an h1-h2 of 0.826, while the repetitious Roman Catholic Litany has an h1-h2 of 0.968. The difference is 0.968 - 0.826 = 0.142. The other texts mostly fall in between, although the presumably repetitious Cajun recipe has an h1-h2 of 0.827, almost identical to the short story.







Table 11:
Modern English Texts of Varying Repetition





File   
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Modern English - Roman Catholic litany   
26
  
9492
  
4.700
  
4.071
  
3.103
  
0.968
Modern English - ISO 14000 catalog   
27
  
6696
  
4.755
  
4.076
  
3.137
  
0.939
Modern English - The Blue Hotel by Stephen Crane (short story)   
27
  
32000
  
4.755
  
4.073
  
3.247
  
0.826
Modern English - Cajun recipe   
27
  
27363
  
4.755
  
4.124
  
3.297
  
0.827
Modern English- Chicken recipe   
27
  
18461
  
4.755
  
4.131
  
3.193
  
0.938
    For comparison, here are data for Voynich texts in FSG, which has the character set closest in size to the ordinary Latin alphabet.





Table 12:
Voynich Texts in FSG



Type of Voynich Text   
Transcription Alphabet
  
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Herbal-A   
FSG
  
24
  
10074
  
4.585
  
3.801
  
2.286
  
1.515
Herbal-B   
FSG
  
24
  
14203
  
4.585
  
3.804
  
2.244
  
1.560
    When one compares the h1-h2 values of Voynich text with the differences due to repetition in English texts (0.968 - 0.826 = 0.142 for modern English and 0.951 - 0.831 = 0.120 for Jacobean English) with the h1- h2 values for Voynich text (1.515 or 1.560), it becomes clear that repetitious underlying format or subject matter could not change a text in a normal European language to a Voynich text! Thus, Voynich text does clearly not have its low h2 measures solely because of a repetitious underlying text, that is, one that often repeats the same words and phrases.
Schizophrenic Language
    In an important paper that discusses the Voynich Manuscript, Professor Sergio Toresella says that the VMs author had a psychiatric disturbance. In one of the works cited by Toresella in this connection, Creativity by Silvano Arieti, Arieti talks about the distorted language of schizophrenics but not other language phenomena.
    At the Kooks Museum, there is a sample of schizophrenic language. In the Schizophrenic Wing, there is a transcript of flyers by Francis E. Dec, containing two Rants:
    Kooks Museum
    Francis E. Dec, Esquire
    Transcripts of flyers

    Here is an excerpt from Rant #2:


    "Computer God computerized brain thinking sealed robot operating arm surgery cabinet machine removal of most of the frontal command lobe of the brain, gradually, during lifetime and overnight in all insane asylums after Computer God kosher bosher one month probation period creating helpless, hopeless Computer God Frankenstein Earphone Radio parroting puppet brainless slaves, resulting in millions of hopeless helpless homeless derelicts in all Jerusalem, U.S.A. cities and Soviet slave work camps. Not only the hangman rope deadly gangster parroting puppet scum-on-top know this top medical secret, even worse, deadly gangster Jew disease from deaf Ronnie Reagan to U.S.S.R. Gorbachev know this oy vay Computer God Containment Policy top secret. Eventual brain lobotomization of the entire world population for the Worldwide Deadly Gangster Communist Computer God overall plan, an ideal worldwide population of light-skinned, low hopeless and helpless Jew-mulattos, the communist black wave of the future."
    The samples and discussion of schizophrenic talk in Arieti resemble Francis Dec's, in repeated but disconnected ideas, alliteration, etc.
    MONKEY was run on the two Rants and the results were compared with examples of normal English text:







Table 13:
Schizophrenic Rant Compared to Other English Texts





File   
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Schizophrenic rant   
27
  
12967
  
4.755
  
4.182
  
3.428
  
0.755
King James Bible - Genesis   
27
  
32000
  
4.755
  
3.969
  
3.020
  
0.949
Francis Bacon's Essays, Part 1   
27
  
32000
  
4.755
  
4.048
  
3.220
  
0.827
Modern English - Roman Catholic litany   
26
  
9492
  
4.700
  
4.071
  
3.103
  
0.968
Modern English - The Blue Hotel by Stephen Crane (short story)   
27
  
32000
  
4.755
  
4.073
  
3.247
  
0.826
    The second-order entropy of the schizophrenic rants is definitely higher, and h1-h2 lower, than any of the ordinary texts. As with the repetitive texts, the nature of the text itself would not by itself explain the puzzling nature of VMs text.
Low-Entropy Natural Languages
    One may write Japanese in Latin characters (romaji) or in syllabic scripts (hiragana and katakana, the kana). In romaji Japanese is a low-entropy language because of a relatively low phonemic inventory and severe phonotactic constraints. A Japanese syllable may begin in zero or one consonant (counting ts, ry, and ky as one consonant), have one vowel, and end with nothing or -n (although the following syllable's consonant may be doubled). (There are at least some long and short vowels in Japanese, which complicates this a little.)
    However, the very fact of these severe phonotactic constraints makes only a limited number of syllables possible in Japanese and therefore makes a syllabic script such as kana feasible. One would expect Japanese in kana to have a higher relative h2 (lower h1- h2) than Japanese in romaji.
    Hawaiian has even more severe phonotactic constraints, and thus one ought to be able to write Hawaiian in a syllabic script. In Hawaiian a syllable may begin in zero or one consonant, have only one vowel, and may only end in nothing! Hawaiian has a much more limited phonemic inventory than Japanese. Hawaiian is especially significant because Bennett compared Voynichese to Hawaiian and noted that they had similar second-order entropies. Bennett said that some Polynesian languages are the only natural languages with second-order entropies as low as Voynichese.
    Therefore, in order to gain insight on these issues, Hawaiian and Japanese are compared in syllabic as well as phonemic notation.
Japanese
    The classic Japanese novel Tale of Genji is written almost entirely in kana. Gabriel Landini kindly adapted this both into romaji and into a kana notation that MONKEY could analyze.









Table 14:
Entropies of Japanese in Romaji and Kana



File   
Orthography
  
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Tale of Genji - Section 1   
Romaji
  
22
  
32000
  
4.459
  
3.763
  
2.677
  
1.086
Tale of Genji - Section 2   
Romaji
  
20
  
31505
  
4.322
  
3.751
  
2.627
  
1.124
Tale of Genji - Section 3   
Romaji
  
20
  
29474
  
4.322
  
3.749
  
2.639
  
1.110
Tale of Genji - Section 4   
Romaji
  
20
  
32000
  
4.322
  
3.750
  
2.641
  
1.109
Tale of Genji - Section 5   
Romaji
  
20
  
27064
  
4.322
  
3.744
  
2.630
  
1.114
Tale of Genji - Overall   
Romaji
  
22
  
152043
  
4.459
  
3.751
  
2.643
  
1.108
Tale of Genji - Section 1   
Kana
  
71
  
20622
  
6.150
  
4.764
  
3.393
  
1.370
Tale of Genji - Section 2   
Kana
  
71
  
20622
  
6.150
  
4.764
  
3.393
  
1.370
Tale of Genji - Section 3   
Kana
  
70
  
18574
  
6.129
  
4.709
  
3.410
  
1.298
Tale of Genji - Section 4   
Kana
  
70
  
20386
  
6.129
  
4.716
  
3.464
  
1.252
Tale of Genji - Section 5   
Kana
  
70
  
17096
  
6.129
  
4.698
  
3.362
  
1.337
Tale of Genji - Overall   
Kana
  
71
  
97300
  
6.150
  
4.730
  
3.404
  
1.326
    As one would expect, the absolute h0, h1, and h2 numbers for kana are much higher than those for romaji. However, the differences for h1-h2 are consistently higher for kana, which one would not expect.
Hawaiian
    Bennett did his Hawaiian study with a limited Hawaiian orthography that did not recognize vowel length or the glottal stop. Therefore, statistics were run both on Hawaiian in limited phonemic and syllabic spellings, with long/short vowels not separated and glottal stop not indicated, and in full phonemic and syllabic notation.


    Hawaiian has the following phonemes:


    Consonants: h k l m n p w '(glottal stop)
    Vowels: a e i o u A E I O U (cap's means long)

    Bennett used a "lossy" Hawaiian orthography that did not distinguish the long vowels and did not write the glottal stop (call this Hawaiian limited phonemic). He also had his own Voynich transcription alphabet. Finally, he only compared the absolute h2 values and not relative measures such as h1-h2. It's as good as any an illustration of the problems here.
    Here is a sample of the Hawaiian newspaper text used in this paper for statistics in Bennett's notation:
    ma ka la o malaki ua noa ka paka o kapiolani no ke anaina na lakou ke kuleana o ka malama ana ma ka olelo ana aku i ka olelo hawaii ma laila no i Akoakoa ai ka poe haumana ka


    And here is the same text in full phonemic notation:
    ma ka lA o malaki ua noa ka pAka 'o kapi'olani no ke anaina na lAkou ke kuleana 'o ka mAlama 'ana ma ka 'Olelo 'ana aku i ka 'Olelo hawai'i ma laila nO i 'Akoakoa ai ka po'e haumAna ka


    Here are the entropy values.





Table 15:
Entropies of Hawaiian Texts in Different Orthographies





File   
Orthography
  
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Hawaiian (Bennett)   
limited phonemic
  
13
  
15000
  
3.700
  
3.200
  
2.454
  
0.746
Hawaiian newspaper   
limited phonemic
  
13
  
13097
  
3.700
  
3.224
  
2.437
  
0.787
Hawaiian newspaper   
limited syllabic
  
39
  
9533
  
5.285
  
3.816
  
2.929
  
0.887
Hawaiian newspaper   
full phonemic
  
19
  
13473
  
4.248
  
3.575
  
2.650
  
0.925
Hawaiian newspaper   
full syllabic
  
77
  
9160
  
6.267
  
4.361
  
3.162
  
1.200
    And here are data for Bennett's and this paper's Voynich texts for comparison:







Table 16:
Voynich Texts for Comparison with Hawaiian



Type of Voynich Text   
Transcription Alphabet
  
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Voynich (Bennett)   
Bennett
  
21
  
10000
  
4.392
  
3.660
  
2.220
  
1.440
Herbal-A   
Currier
  
33
  
9804
  
5.044
  
3.792
  
2.313
  
1.479
Herbal-A   
FSG
  
24
  
10074
  
4.585
  
3.801
  
2.286
  
1.515
Herbal-A   
EVA
  
21
  
12218
  
4.392
  
3.802
  
1.990
  
1.812
Herbal-A   
Frogguy
  
21
  
13479
  
4.392
  
3.826
  
1.882
  
1.945
Herbal-B   
Currier
  
34
  
13858
  
5.087
  
3.796
  
2.267
  
1.529
Herbal-B   
FSG
  
24
  
14203
  
4.585
  
3.804
  
2.244
  
1.560
Herbal-B   
EVA
  
21
  
16061
  
4.392
  
3.859
  
2.081
  
1.778
Herbal-B   
Frogguy
  
21
  
17909
  
4.392
  
3.846
  
1.949
  
1.897
    Bennett compared his Voynich text in a 21-character transcription to Hawaiian in a 13-character orthography (including the space character). He got h2 values of 2.220 for Voynich text and 2.454 for his Hawaiian text. However, a sample of Hawaiian text in a full phonemic orthography, with 19 characters including spaces, has h2 of 2.650, even higher. A comparison of h1-h2 values shows a dramatic difference between Hawaiian and Japanese on one hand and Voynichese on the other. h1-h2 equals 1.8 for Voynichese in EVA. h1-h2 is 0.746 for Bennett's Hawaiian data, 0.925 for Hawaiian in full phonemic notation, and 1.1 for Japanese romaji. These figures are all very different from Voynichese.
Discussion of Phonemic versus Syllabic Notation
    While perhaps not germane to the Voynich Manuscript problem, it is odd that h1-h2 increases from phonemic to syllabic notation, both for Japanese and Hawaiian. In syllabic notation, given the first character, the second character is more predictable than it is in phonemic notation. This is quite puzzling. How can we explain these results for Hawaiian and Japanese?
The Size of the Character Set
    In going from phonemic to syllabic, the text becomes shorter, more information is packed into fewer characters --but that is accomplished by using a larger character set. The numbers of characters for the syllabic notations are more than three times those for the phonemic notations. The measure h1-h2 was chosen to minimize the effect of the size of the character set, but surely is not entirely successful in doing that.


The Effect of Word Divisions
    Perhaps one loses predictability because the number of space characters in relation to the total is greater for syllabic notation than for phonemic. If that were the case, leaving out the spaces ought to decrease h1-h2 for syllabic notation more than for phonemic notation. MONKEY runs were made leaving out the spaces to test this. However, the h1-h2 results for syllabic notation decrease less than those for phonemic notation do.





Table 17:
The Effect of Word Divisions on Statistics for Japanese and Hawaiian







File   
Orthography
  
Spaces Included
  
# ch.
  
File Size
  
h0
  
h1
  
h2
  
h1-h2
Japanese Tale of Genji - Section 1    
Romaji
  
Yes
  
22
  
32000
  
4.459
  
3.763
  
2.677
  
1.086
Japanese Tale of Genji - Section 1    
Romaji
  
No
  
21
  
26106
  
4.392
  
3.803
  
2.935
  
0.868
Japanese Tale of Genji - Section 1    
Kana
  
Yes
  
71
  
20622
  
6.150
  
4.764
  
3.393
  
1.370
Japanese Tale of Genji - Section 1    
Kana
  
No
  
70
  
14051
  
6.129
  
5.666
  
4.330
  
1.337
Hawaiian newspaper   
Full Phonemic
  
Yes
  
19
  
13473
  
4.248
  
3.575
  
2.650
  
0.925
Hawaiian newspaper   
Full Phonemic
  
No
  
18
  
10433
  
4.170
  
3.622
  
2.935
  
0.687
Hawaiian newspaper   
Full Syllabic
  
Yes
  
77
  
9160
  
6.267
  
4.361
  
3.162
  
1.200
Hawaiian newspaper   
Full Syllabic
  
No
  
76
  
6120
  
6.248
  
5.156
  
3.982
  
1.174
Redundancy
    Gabriel Landini, who did graduate studies in Japan, noted that the redundancy of Japanese is only apparent, that it is actually rather ambiguous. In writing this is overcome with ideographs (kanji), while in speech it is overcome with the context of the speech and with rigid structures (phrases and expressions).
    However, Jacques Guy (doctorate in Polynesian languages, was once fluent in Tahitian) notes that Tahitian (similar to Hawaiian) is no more ambiguous than English or French! So redundancy is not likely the explanation.
The Effect of Syllable Divisions
    Could the (relatively) high h1-h2 values for syllabic Hawaiian and Japanese mean that combinations of two syllables (eg. yama in Japanese, wiki in Hawaiian) are as repetitious and fixed as combinations of phonemes within syllables?
    The phonemic vs. syllabic problem here is more complex than this. Take "yamamoto" in romaji and in kana: (ya)(ma)(mo)(to). When we are analysing the second-order entropy in romaji, one is looking for the distributions of "ya" "am" "mo" "ot" "to", while for kana it is "(ya)(ma)" "(ma)(mo)" "(mo)(to)". For half (or so) of the romaji, one deals with combinations of letters ("am", "ot") that are never represented in kana. So the second-order entropy in one type of text is not strictly comparable with the second-order entropy in the other. The second-order entropy order of the romaji text is in principle "near" in meaning to the first-order entropy of the kana, but about only half of the digraphs correspond to kana.
    While the differences in statistics between syllabic and phonemic notation are interesting, they are not necessarily relevant to the Voynich Manuscript. They are chiefly interesting in raising questions about the use of the entropy concept.


Final Thoughts on Low-Entropy Natural Languages
    Consider again the start of the Herbal-A sample file (f29v, lines 1-9), in EVA:
    kshol qoocph shor pshocph shepchy qoty dy shory
    ykcholy qoty chy dy qokchol chor tchy qokchody cheor o
    chor chol chy choiin
    tshoiin cheor chor o chty qotol sheol shor daiin qoty
    otol chol daiin chkaiin shoiin qotchey qotshey daiiin
    daiin chkaiin
    pchol oiir chol tsho daiin sho teo chy chtshy dair am
    okain chan chain cthor dain yk chy daiin cthol
    sot chear chl s choly dar

    And then the beginning of the Hawaiian newspaper sample file:


    kepakemapa mei puke kepakemapa mei mahalo 'ia ka 'Olelo hawai'i e nA mAka' na ho'Olanani kim ma ka lA o malaki ua noa ka pAka 'o kapi'olani no ke anaina na lAkou ke kuleana 'o ka mAlama 'ana ma ka 'Olelo 'ana aku i ka 'Olelo hawai'i ma laila nO i 'Akoakoa ai ka po'e haumAna ka po'e kumu ka po'e mAkua a me ka po'e hoa o kElA 'ano kEia 'ano o ka 'Olelo hawai'i a ma laila nO ho'i i launa ai ka po'e ma o ka 'Olelo hawai'i kapa 'ia kEia lA hoihoi 'o ka lA 'ohana
    One sees that the low h2's of Hawaiian and Japanese are due to their very strict consonant-vowel alternation. The EVA Voynich sample shows that the consonant-vowel alternation of Voynichese (as determined by the Sukhotin vowel-recognition algorithm) is not as strict.
    Once again, h1-h2 equals 1.8 for Voynichese in EVA. h1-h2 is 0.746 for Bennett's Hawaiian data, 0.925 for Hawaiian in full phonemic notation, and 1.1 for Japanese romaji. These figures are all very different from Voynichese.
    For these reasons, it seems unlikely that an underlying low- entropy natural language explains the low h2 measures of Voynich text.


Suggestions for Further Work
    The various h2 measures are only crude, partial measures of all the factors that interest us. However, the entropy measure will continue to be useful. It would be nice to have a program that would calculate the entropies of files larger than 32K and calculate higher- order entropies more accurately.


    The author believes that the "paradigms" and other structural restrictions of Voynichese explain the low h2 measures. Further study of these structural constraints will be most useful.
Acknowledgments
    Many of these ideas and data were previously discussed on the Voynich E-mail list. A special thanks to Gabriel Landini and Rene Zandbergen for their assistance.


References for Electronic Texts
    Voynich Text
        Rene Zandbergen kindly provided samples of Herbal-B and Herbal-A from voynich.now.
        Herbal-B: 26r, 26v, 31r, 31v, 33r, 33v, 34r, 34v, 39r, 39v, 40r, 40v, 41r, 41v, 43r, 43v, 46r, 46v, 48r, 48v, 50r, 50v, 55r, 55v, 57r
   
   
        Selected Herbal-A: 28v, 29r, 29v, 30r, 30v, 32r, 32v, 35r, 35v, 36r, 36v, 37r, 37v, 38r, 38v, 42r, 42v, 44r, 44v, 45r, 45v, 47r, 47v, 49r, 49v
   
   
    Jacobean English
    Book of Mormon
    Bible, KJV
    Sir Francis Bacon, Essays
    Late Classical Latin Vulgate Latin Bible
    Estragon
    or
    Gopher
    Boethius: Consolatio Philosophiae: Book 3 & Book 4
    Modern English
    Catholic Litany
    ISO Standard Catalog
    "The Blue Hotel", by Stephen Crane
    Chicken Recipe
    Cajun Recipes, Part 1 and Part 2
    Japanese Text
        Gabriel Landini kindly prepared this. The text is from the Genji monogatari's [Tale of Genji, a classic Japanese novel mostly written in hiragana] first 4 parts: 01 Kiritsubo 02 Hahakigi 03 Utsusemi 04 Yugao.
        The "kana" output is not kana, of course, but an arbitrary substitution for kana so that MONKEY could be applied.
   
   
    Hawaiian
        The author prepared the Hawaiian texts. Hawaiian has the following phonemes:
   
   
        Consonants: h k l m n p w '(glottal stop)
        Vowels: a e i o u A E I O U (cap's means long)
   
        However, the difference between long and short vowels is often not indicated. Also, the glottal stop is often not written. Obviously both of these things need to be written, since even with them Hawaiian has a rather limited phonemic inventory!
   
   
        The Hawaiian text came from all the articles in this issue of a Hawaiian newspaper:
     Na Maka o Kana
    Puke 5, Pepa 5
    15 Malaki, 1997
   
        The text was changed to the notation above. All numbers, English, Japanese, and other foreign words were removed until the character set (the number of characters MONKEY showed) matched the Hawaiian notation. A syllabic script for Hawaiian using characters that MONKEY recognizes was devised.
    Schizophrenic Language
        At the Kooks Museum, in the Schizophrenic Wing, there is a transcript of flyers by Francis E. Dec, containing two schizophrenic Rants:
     Francis E. Dec, Esquire
    Transcripts of flyers
   
Printed References
    Arieti, Silvano. Creativity : the magic synthesis. New York : Basic Books, c1976. Library of Congress call number: BF408.A64
   
   
    Bennett, William Ralph. Scientific and Engineering Problem Solving with the Computer. Englewood Cliffs: Prentice-Hall, 1976. [Contains a chapter on VMS.]
    D'Imperio, M. E. The Voynich Manuscript--An Elegant Enigma. National Security Agency, 1978. Aegean Park Press, 1978?
    Toresella, Sergio. ``Gli erbari degli alchimisti.'' [Alchemical herbals.] In Arte farmaceutica e piante medicinali -- erbari, vasi, strumenti e testi dalle raccolte liguri, [Pharmaceutical art and medicinal plants -- herbals, jars, instruments and texts of the Ligurian collections.] Liana Saginati, ed. Pisa: Pacini Editore, 1996, pp.31-70. [Profusely illustrated. Fits the VMS into an ``alchemical herbal'' tradition.]
   
   
Copyright © 1998 by Dennis J. Stallings, all rights reserved.
1

No comments:

Post a Comment