## 2. Outline of the Procedure

In this section we describe the test in outline. In the Appendix, sufficient details are provided to enable the reader to repeat the computations precisely, and so to verify their correctness. The authors will provide, upon request, at cost, diskettes containing the program used and the texts G, I, R, T, U, V and W (see Section 3).

We test the significance of the phenomenon on samples of pairs of related words (such as hammer-anvil and Zedekia-Matanya). To do this we must do the following:

1. define the notion of "distance" between any two words, so as to lend meaning to the idea of words in "close proximity";
2. define statistics that express how close, "on the whole," the words making up the sample pairs are to each other (some kind of average over the whole sample);
3. choose a sample of pairs of related words on which to run the test;
4. determine whether the statistics defined in (ii) are "unusually small" for the chosen sample.

Task (I) has several components. First, we must define the notion of "distance" between two given ELS's in a given array; for this we use a convenient variant of the ordinary Euclidean distance. Second, there are many ways of writing a text as a two-dimensional array, depending on the row length; we must select one or more of these arrays and somehow amalgamate the results (of course, the selection and/or amalgamation must be carried out according to clearly stated, systematic rules). Third, a given word may occur many times as an ELS in a text; here again, a selection and amalgamation process is called for. Fourth, we must correct for factors such as word length and composition. All this is done in detail in Sections A.1 and A.2 of the Appendix.

We stress that our definition of distance is not unique. Although there are certain general principles (like minimizing the skip d) some of the details can be carried out in other ways. We feel that varying these details is unlikely to affect the results substantially. Be that as it may, we chose one particular definition, and have, throughout, used only it, that is, the function c (w, w') described in Section A.2 of the Appendix had been defined before any sample was chosen, and it underwent no changes. [Similar remarks apply to choices made in carrying out task (II).]

Next, we have task (II), measuring the overall proximity of pairs of words in the sample as a whole. For this, we used two different statistics P1and P2 , which are defined and motivated in the Appendix (Section A.5). Intuitively, each measures overall proximity in a different way. In each case, a small value of Pi indicates that the words in the sample pairs are, on the whole, close to each other. No other statistics were ever calculated for the first, second or indeed any sample.

In task (III), identifying an appropriate sample of word pairs, we strove for uniformity and objectivity with regard to the choice of pairs and to the relation between their elements. Accordingly, our sample was built from a list of personalities (p) and the dates (Hebrew day and month) (p') of their death or birth. The personalities were taken from the Encyclopedia of Great Men in Israel. (MARGALIOTH, M., ed. (1961). Encyclopedia of Great Men in Israel; a Bibliographical Dictionary of Jewish Sages and Scholars from the 9th to the End of the 18th Century 1-4. Joshua Chachik, Tel Aviv).

At first, the criterion for inclusion of a personality in the sample was simply that his entry contain at least three columns of text and that a date of birth or death be specified. This yielded 34 personalities (the first list--Table 1). In order to avoid any conceivable appearance of having fitted the tests to the data, it was later decided to use a fresh sample, without changing anything else. This was done by considering all personalities whose entries contain between 1.5 and 3 columns of text in the Encyclopedia; it yielded 32 personalities (the second list--Table 2). The significance test was carried out on the second sample only.

Note that personality-date pairs (p, p') are not word pairs. The personalities each have several appellations, there are variations in spelling and there are different ways of designating dates. Thus each personality-date pair (p, p') corresponds to several word pairs (w, w'). The precise method used to generate a sample of word pairs from a list of personalities is explained in the Appendix (Section A.3).

The measures of proximity of word pairs (w, w') result in statistics P1and P2 . As explained in the Appendix (Section A.5), we also used a variant of this method, which generates a smaller sample of word pairs from the same list of personalities. We denote the statistics P1 and P2, when applied to this smaller sample, by P3 and P4.

Finally, we come to task (iv), the significance test itself. It is so simple and straightforward that we describe it in full immediately.

The second list contains of 32 personalities. For each of the 32! permutations p of these personalities, we define the statistic P1pobtained by permuting the personalities in accordance with p, so that Personality i is matched with the dates of Personality p(i). The 32! numbers P1p are ordered, with possible ties, according to the usual order of the real numbers. If the phenomenon under study were due to chance, it would be just as likely that P1 occupies any one of the 32! places in this order as any other. Similarly for P2, P3and P4. This is our null hypothesis.

To calculate significance levels, we chose 999,999 random permutations p of the 32 personalities; the precise way in which this was done is explained in the Appendix (Section A.6). Each of these permutations p determines a statistic P1p; together with P1, we have thus 1,000,000 numbers. Define the rank order of P1 among these 1,000,000 numbers as the number of P1p not exceeding P1; if P1 is tied with other P1p, half of these others are considered to "exceed" P1. Let r1 be the rank order of P1, divided by 1,000,000; under the null hypothesis, r1 is the probability that P1 would rank as low as it does. Define r2, r3 and r4 similarly (using the same 999,999 permutations in each case).

After calculating the probabilities r1 through r4, we must make an overall decision to accept or reject the research hypothesis. In doing this, we should avoid selecting favorable evidence only. For example, suppose that r3 = 0.01, the other ri being higher. There is then the temptation to consider r3 only, and so to reject the null hypothesis at the level of 0.01. But this would be a mistake; with enough sufficiently diverse statistics, it is quite likely that just by chance, some one of them will be low. The correct question is, "Under the null hypothesis, what is the probability that at least one of the four ri would be less than or equal to 0.01?" Thus denoting the event "ri <= 0.01" by Ei, we must find the probability not of E3, but of "E1 or E2or E3 or E4." If the Ei were mutually exclusive, this probability would be 0.04; overlaps only decrease the total probability, so that it is in any case less than or equal to 0.04. Thus we can reject the null hypothesis at the level of 0.04, but not 0.01.

More generally, for any given d, the probability that at least one of the four numbers ri is less than or equal to d is at most 4 d. This is known as the Bonferroni inequality. Thus the overall significance level (or p-value), using all four statistics, is r0 := 4 min ri.