A Dual-Source Database of Word Frequencies in Latin
compiled by James H. Dee

Prefatory Note

          If you already know what relational databases are and how to use them, I hope you'll find this database worth exploring, in spite of its primitive appearance. If all this is new to you, you will need to consult the manual for whatever particular database program you have or seek help from knowledgeable persons at your home institution-every school and college has at least a few people (faculty or students) who are know a great deal about this subject, and they can help you learn how to extract information from this particular database. Please keep in mind that this is a far from perfect instrument, as the cautionary remarks below will make abundantly clear, and that all data contained in or derived from this database apply only to those specific texts which were culled to create it and cannot be used for casual extrapolation to the rest of the corpus of classical (much less medieval) Latin.

Click here for the database in Excel

Click here for the database in plain text | Instructions

Introduction

        Any set of statistics about word frequencies in Latin will inevitably be, in the absence of a survey covering all classical texts (however one might define "classical" or "text"), a function of the specific works and passages chosen for the database. This collection is drawn from two word counts made much earlier in this century, viz.:

  • Gonzalez Lodge, The Vocabulary of High School Latin (New York: Columbia University Press 1912). Lodge's counts are derived from a corpus of 77,142 words, drawn from Caesar, De Bello Gallico 1-5, Cicero, In Catilinam 1-4, De Imperio Pompei, Pro Archia, and Vergil, Aeneid 1-6.

  • Paul B. Diederich, The Frequency of Latin Words and their Endings (Chicago: University of Chicago Press 1939). Diederich's counts are derived from a corpus of 202,158 words drawn from Maurice W. Avery, Latin Prose Literature: Cato to Suetonius (Boston: Little Brown 1931), containing 49,363 words, The Oxford Book of Latin Verse, ed. H. W. Garrod (Oxford: Oxford University Press 1912), containing 75,323 words (with the omission, it should be noted, of eight pages of pre-Ennian verse), and Charles H. Beeson, A Primer of Medieval Latin (Chicago: Scott Foresman 1925), containing 77,472 words.¹

         Since Avery's anthology has only one passage which also occurs in Lodge's texts (Caes. BG 5.38-48; its words have been subtracted from Diederich's totals) and since Garrod's anthology omits the Aeneid, the two sets are, as presented here, mutually exclusive and thus fully complementary. Lodge's listing includes all words in his corpus except for proper names, regardless of frequency, whereas Diederich includes only items which occurred at least five times. This gives the two sets a certain imbalance, which the user may wish to remove by creating a subset of Lodge which includes only those words occurring at least three times. Diederich's classical sources have 124,686 words, compared with Lodge's 77,142, so a ratio of five to three is the correct proportion.²

         I created this database collation of those two sources with the intent of converting to electronic form as much valid information as could readily be obtained, without attempting to add very much. The following table shows the "structure" of the database, i.e. the names of the fields, with expanded definitions given below. The Master.dbf file was originally created in Microsoft FoxPro for the Mac 2.6; it has been exported in tab-delimited format and should be usable by other database programs without much difficulty.


	Word	Individual entry word
	Base	Etymological base word
	Base2	Second base for a compound word
	Type	Grammatical category
	Pro		Diederich: classical prose count
	Ver		Diederich: classical verse count
	Pd_cl	Diederich: classical prose & verse total
	Med	Diederich: medieval count
	Pd_sum	Diederich: classical & medieval total
	Caes	Lodge: Caesar count
	Cic		Lodge: Cicero count
	Verg	Lodge: Vergil count
	Gl_sum	Lodge: total
	Cl_tot	Pd_cl + Gl_sum (classical total)
	Pd_gl	Pd_sum + Gl_sum (classical + medieval total)
	Gl_Base	Number in Lodge's suggested core of 2000 words

Word: In most instances, this item represents the specific word as given in one or both sources. Lodge subsumes counts for such categories as noun/adjective uses of participles and irregular adjectives under a single entry (e.g. pactum at paciscor, melior and optimus at bonus); that practice made it difficult, and in a few cases, impossible, to match the greater differentiation in Diederich's list. The list given below shows which entries in Diederich were merged and which were separated by means of searches using the PHI Latin CD-ROM 5.3 disk. (Lodge does not state which editions were used in his culling, but the discrepancies in such well-established texts will probably be rather few.) Some rather ungainly abbreviations are used in this field to indicate those mergings. Diederich includes personal and place names (totalling 7,771 occurrences); since Lodge excludes those categories, they were omitted altogether in this gathering. There are several points of substantial uncertainty in the assignment of Lodge's and Diederich's data. (1) Diederich, surprisingly, does not distinguish between the two main functions of cum (sc. preposition and conjunction), so Lodge's separate figures had to be combined. (2) Diederich has entries for mecum, tecum, secum, and vobiscum, whereas Lodge has none-and none for nos and vos (the figures for Caesar show that they are not contained within the numbers for ego and tu); that deficiency was remedied by the PHI disk, but it is possible that the counts for the singular forms are included in Lodge's numbers for the pronouns. (3) Diederich has no separate counts for the pronouns mei, tui, nostri, and vestri (though there is one for sui); it is quite possible that occurrences of the pronoun have been counted with the adjectives. (4) Diederich does not specify which word is involved in the entry for sero (sertum has a separate listing); because Verg. Ecl. 8.99 (satas) is in the selections in the Oxford Book of Latin Verse, his counts were put at sero, satum-but I did not check to see if the other 13 cases he cites are all correctly placed.

Base & Base2: Entries in these fields are intended to show the ultimate base words underlying the entries made at Word. Most of these have been arranged according to the judgments of Ernout-Meillet or the Oxford Latin Dictionary, and they claim no originality or independent authority. Some have asterisks at the end to indicate that the given form is not attested. In the case of compound words, the field Base2 is used for the second base; this means that the gathering of etymological families involves dovetailing data from two fields, but restricting the entries to one record per word makes e.g. the counting of the total number of words in the whole database or any subset thereof much easier than it would have been if two separate records had been created for a single compound word.

Type: This shows in very crude fashion the general category to which the particular word belongs. Most of the abbreviations will be self-evident: n, v, adj, adv, pr, cj, and int. Further subdivision is provided by numerals and letters (e.g. n1f, n2m, a12, a3, v2, v4d); it should be noted that 'v3i' means a 3 -io verb, whereas 'vi' means irregular verb (though some words typically called irregular, e.g. fero, are left in the most appropriate conjugation). Sometimes these are not quite adequate representations: when an adjective also has a noun function, I did not try to indicate the double role. I acknowledge that this is a somewhat imperfect instrument-there are no subdivisions for different types of, say, third declension nouns, but it will make it quite easy to see the relative sizes and proportions of the main categories. Ascriptions of gender may not always be on target, especially in some cases of medieval vocabulary, but I have made a separate check using several standard sources.

Counts: The next eleven, presumably self-explanatory, fields display simple numeric data representing individual or combined totals for the categories named in the descriptions above.

Lodge Core: There are 2,000 consecutively-numbered headwords within Lodge's collection of 4,650, intended to provide a trio of cores for the second, third, and fourth years. In his work, these three levels are represented typographically by different fonts, but there seemed no compelling reason to include that element here. Lodge's introduction (iv-v) discusses the principles which governed his selection of the 2,000 words; they are not invariably the most frequent, and the artificial inclusion of gracilis and seven numerals, none of which occurred at all in Lodge's actual corpus, shows why Diederich chose to work from a much wider range of texts.

 

Click here for the database in Excel

Click here for the database in plain text | Instructions