Appendix 1: Details of the Procedure

In this Appendix we describe the procedure in sufficient detail to enable the reader to repeat the computations precisely. Some motivation for the various definitions is also provided.

In Section A.1, a "raw" measure of distance between words is defined. Section A.2 explains how we normalize this raw measure to correct for factors like the length of a word and its composition (the relative frequency of the letters occurring in it). Section A.3 provides the list of personalities p with their dates p' and explains how the sample of word pairs (w, w') is constructed from this list.Section A.4 identifies the precise text of Genesis that we used. In Section A.5, we define and motivate the four summary statistics P1, P2, P3 and P4. Finally, Section A.6 provides the details of the randomization.

Sections A.1 and A.3 are relatively technical; to gain an understanding of the process, it is perhaps best to read the other parts first. 

A.1 The Distance between Words 

To define the "distance" between words, we must first define the distance between ELS's representing those words; before we can do that, we must define the distance between ELS's in a given array; and before we can do that, we must define the distance between individual letters in the array.

As indicated in Section 1, we think of an array as one long line that spirals down on a cylinder; its row length h is the number of vertical columns. To define the distance between two letters x and x', cut the cylinder along a vertical line between two columns. In the resulting plane each of x and x' has two integer coordinates, and we compute the distance between them as usual, using these coordinates. In general, there are two possible values for this distance, depending on the vertical line that was chosen for cutting the cylinder; if the two values are different, we use the smaller one.

Next, we define the distance between fixed ELS's e and e' in a fixed cylindrical array. Set 

f := the distance between consecutive letters of e
f' := the distance between consecutive letters of e', 
l := the minimal distance between a letter of e and one of e', 

and define d(e, e') := f 2 + f ' 2 + l 2. We call d(e, e') the distance between the ELS's e and e' in the given array; it is small if both fit into a relatively compact area. For example, in Figure 3 we have f = 1, f' =Ö5, l = Ö34 and d = 40.

Now there are many ways of writing Genesis as a cylindrical array, depending on the row length h. Denote by dh (e, e') the distance d(e, e') in the array determined by h, and set mh (e, e') := 1/dh (e, e'); the larger mh (e, e') is, the more compact is the configuration consisting of e and e' in the array with row length h. Set e = (n,d,k) (recall that d is the skip) and e' = (n',d',k'). Of particular interest are the row lengths h = h1,h2,..., where hi is the integer nearest to |d|/i (1/2 is rounded up). Thus when h = h1 = |d|, then e appears as a column of adjacent letters (as in Figure 1); and when h = h2, then e appears either as a column that skip alternate rows (as in Figure 2) or as a straight line of knight's moves (as in Figure 3). In general, the arrays in which e appears relatively compactly are those with row length hi with i "not too large."

Define hi' analogously to hi. The above discussion indicates that if there is an array in which the configuration (e,e') is unusually compact, it is likely to be among those whose row length is one of the first 10 hi or one of the first 10 hi'. (Here and in the sequel 10 is an arbitrarily selected "moderate" number.) So setting 

s (e, e') := 
 (e, e') +    
 (e, e'),

we conclude that s (e, e') is a reasonable measure of the maximal "compactness" of the configuration (e, e') in any array. Equivalently, it is an inverse measure of the minimum distance between e and e'.

Next, given a word w, we look for the most "noteworthy" occurrence or occurrences of w as an ELS in G. For this, we chose those ELS's e = (n,d,k) with |d| >= 2 that spell out w for which |d| is minimal over all of G, or at least over large portions of it. Specifically, define the domain of minimality of e as the maximal segment Te of G that includes e and does not include any other 

  ^   ^ ^ ^   ^
ELS e = ( n, d, k) for w with d < d.

If e' is an ELS for another word w', then Te Ç Te' is called the domain of simultaneous minimality of e and e'; the length of this domain, relative to the whole of G, is the "weight" we assign to the pair (e,e'). Thus we define w(e,e') := l(e, e')/l(G), where l(e, e') is the length of Te Ç Te', and l(G) is the length of G. For any two words w and w', we set 

W (w, w') := S w (e, e') s (e, e'),

where the sum is over all ELS's e and e' spelling out w and w', respectively. Very roughly, W (w, w') measures the maximum closeness of the more noteworthy appearances of w and w' as ELS's in Genesis--the closer they are, the larger is W (w, w').

When actually computing W (w, w'), the sizes of the lists of ELS's for w and w' may be impractically large (especially for short words). It is clear from the definition of the domain of minimality that ELS's for w and w' with relatively large skips will contribute very little to the value of W (w, w') due to their small weight. Hence, in order to cut the amount of computation we restrict beforehand the range of the skip |d| <= D(w) for w so that the expected number of ELS's for w will be 10. This expected number equals the product of the relative frequencies (within Genesis) of the letters constituting w multiplied by the total number of all equidistant letter sequences with 2 <= |d| <= D. [The latter is given by the formula (D-1)(2L-(k-1)(D+2)), where L is the length of the text and k is the number of letters in w.] The same restriction applies also to w' with a corresponding bound D(w'). Abusing our notation somewhat, we continue to denote this modified function by W (w, w'). 

Click Here to Go To Appendix A.2.

Protected by Copyscape Originality Check