|
compiled by James H. Dee Prefatory Note If you already know what relational databases are and how to use them, I hope you'll find this database worth exploring, in spite of its primitive appearance. If all this is new to you, you will need to consult the manual for whatever particular database program you have or seek help from knowledgeable persons at your home institution-every school and college has at least a few people (faculty or students) who are know a great deal about this subject, and they can help you learn how to extract information from this particular database. Please keep in mind that this is a far from perfect instrument, as the cautionary remarks below will make abundantly clear, and that all data contained in or derived from this database apply only to those specific texts which were culled to create it and cannot be used for casual extrapolation to the rest of the corpus of classical (much less medieval) Latin. Click here for the database in Excel Click here for the database in plain text | Instructions Introduction Any set of statistics about word frequencies in Latin will inevitably be, in the absence of a survey covering all classical texts (however one might define "classical" or "text"), a function of the specific works and passages chosen for the database. This collection is drawn from two word counts made much earlier in this century, viz.:
Since Avery's anthology has only one passage which also occurs in Lodge's texts (Caes. BG 5.38-48; its words have been subtracted from Diederich's totals) and since Garrod's anthology omits the Aeneid, the two sets are, as presented here, mutually exclusive and thus fully complementary. Lodge's listing includes all words in his corpus except for proper names, regardless of frequency, whereas Diederich includes only items which occurred at least five times. This gives the two sets a certain imbalance, which the user may wish to remove by creating a subset of Lodge which includes only those words occurring at least three times. Diederich's classical sources have 124,686 words, compared with Lodge's 77,142, so a ratio of five to three is the correct proportion.² I created this database collation of those two sources with the intent of converting to electronic form as much valid information as could readily be obtained, without attempting to add very much. The following table shows the "structure" of the database, i.e. the names of the fields, with expanded definitions given below. The Master.dbf file was originally created in Microsoft FoxPro for the Mac 2.6; it has been exported in tab-delimited format and should be usable by other database programs without much difficulty. Word Individual entry word Base Etymological base word Base2 Second base for a compound word Type Grammatical category Pro Diederich: classical prose count Ver Diederich: classical verse count Pd_cl Diederich: classical prose & verse total Med Diederich: medieval count Pd_sum Diederich: classical & medieval total Caes Lodge: Caesar count Cic Lodge: Cicero count Verg Lodge: Vergil count Gl_sum Lodge: total Cl_tot Pd_cl + Gl_sum (classical total) Pd_gl Pd_sum + Gl_sum (classical + medieval total) Gl_Base Number in Lodge's suggested core of 2000 words Word: In most instances, this item represents the specific word as given in one or both sources. Lodge subsumes counts for such categories as noun/adjective uses of participles and irregular adjectives under a single entry (e.g. pactum at paciscor, melior and optimus at bonus); that practice made it difficult, and in a few cases, impossible, to match the greater differentiation in Diederich's list. The list given below shows which entries in Diederich were merged and which were separated by means of searches using the PHI Latin CD-ROM 5.3 disk. (Lodge does not state which editions were used in his culling, but the discrepancies in such well-established texts will probably be rather few.) Some rather ungainly abbreviations are used in this field to indicate those mergings. Diederich includes personal and place names (totalling 7,771 occurrences); since Lodge excludes those categories, they were omitted altogether in this gathering. There are several points of substantial uncertainty in the assignment of Lodge's and Diederich's data. (1) Diederich, surprisingly, does not distinguish between the two main functions of cum (sc. preposition and conjunction), so Lodge's separate figures had to be combined. (2) Diederich has entries for mecum, tecum, secum, and vobiscum, whereas Lodge has none-and none for nos and vos (the figures for Caesar show that they are not contained within the numbers for ego and tu); that deficiency was remedied by the PHI disk, but it is possible that the counts for the singular forms are included in Lodge's numbers for the pronouns. (3) Diederich has no separate counts for the pronouns mei, tui, nostri, and vestri (though there is one for sui); it is quite possible that occurrences of the pronoun have been counted with the adjectives. (4) Diederich does not specify which word is involved in the entry for sero (sertum has a separate listing); because Verg. Ecl. 8.99 (satas) is in the selections in the Oxford Book of Latin Verse, his counts were put at sero, satum-but I did not check to see if the other 13 cases he cites are all correctly placed. Base & Base2: Entries in these fields are intended to show the ultimate base words underlying the entries made at Word. Most of these have been arranged according to the judgments of Ernout-Meillet or the Oxford Latin Dictionary, and they claim no originality or independent authority. Some have asterisks at the end to indicate that the given form is not attested. In the case of compound words, the field Base2 is used for the second base; this means that the gathering of etymological families involves dovetailing data from two fields, but restricting the entries to one record per word makes e.g. the counting of the total number of words in the whole database or any subset thereof much easier than it would have been if two separate records had been created for a single compound word. Type: This shows in very crude fashion the general category to which the particular word belongs. Most of the abbreviations will be self-evident: n, v, adj, adv, pr, cj, and int. Further subdivision is provided by numerals and letters (e.g. n1f, n2m, a12, a3, v2, v4d); it should be noted that 'v3i' means a 3 -io verb, whereas 'vi' means irregular verb (though some words typically called irregular, e.g. fero, are left in the most appropriate conjugation). Sometimes these are not quite adequate representations: when an adjective also has a noun function, I did not try to indicate the double role. I acknowledge that this is a somewhat imperfect instrument-there are no subdivisions for different types of, say, third declension nouns, but it will make it quite easy to see the relative sizes and proportions of the main categories. Ascriptions of gender may not always be on target, especially in some cases of medieval vocabulary, but I have made a separate check using several standard sources. Counts: The next eleven, presumably self-explanatory, fields display simple numeric data representing individual or combined totals for the categories named in the descriptions above. Lodge Core: There are 2,000 consecutively-numbered headwords within Lodge's collection of 4,650, intended to provide a trio of cores for the second, third, and fourth years. In his work, these three levels are represented typographically by different fonts, but there seemed no compelling reason to include that element here. Lodge's introduction (iv-v) discusses the principles which governed his selection of the 2,000 words; they are not invariably the most frequent, and the artificial inclusion of gracilis and seven numerals, none of which occurred at all in Lodge's actual corpus, shows why Diederich chose to work from a much wider range of texts.
|