Introduction

This is simply an overview of some counts I've produced in my attempts to quantify my impressionistic understanding of the difficulty of reading various books of the Greek New Testament. I doubt I am the first one to generate many of these counts. I've really just done this as part of the joy of exploratory data analysis with an open-source database in hand. I've included the SQL calls, which assume the Sqlite database in my sblgnt-to-sqlite repository.

I am done with this for now, but I'm sure that the muse of database queries will move me again at some point, and I'll update this page as needed.

Basic Counts

Querying the number of chapters per book:


SELECT book_name,COUNT(DISTINCT book_number||','||chapter) AS Chapters FROM sblgnt GROUP BY book_name ORDER BY book_number ASC;

Not a lot of surprises here:

Querying the number of verses per book:


SELECT book_name,COUNT(DISTINCT book_number||','||chapter||','||verse) AS Verses FROM sblgnt GROUP BY book_name ORDER BY book_number ASC;

Querying the number of words per book:


SELECT book_name,COUNT(*) AS Words FROM sblgnt GROUP BY book_name ORDER BY book_number ASC;

This is where we get into the question of length proper, in my opinion. Some books have far more verses-per-chapter, for some reason. Looking at the word counts, I think we get a more accurate sense of how long each book is. Looking at the graph below, I am struck by the fact that the shortest gospel is still longer than any non-gospel. I also would not have guessed that Revelation would be the next-longest book after the gospels (and with a pretty big step down before you get to Romans).

The average number of words per verse:


SELECT book_name,
		COUNT(_id) AS Words,
		COUNT(DISTINCT book_number||','||chapter||','||verse) AS Verses, 
		CAST( COUNT(DISTINCT _id) AS REAL)/CAST(COUNT(DISTINCT book_number||','||chapter||','||verse) AS REAL) AS WordsPerVerse 
	FROM sblgnt GROUP BY book_name ORDER BY book_number ASC;

This is of course just an artifact of how the text was broken up into verses. I would not have guessed that Revelation and 1 John had unusually long verses.

Counting Lemmas

For calculating the length of a book, it's appropriate to count words. For evaluating the difficulty of a book—at least with regard to complexity of vocabulary—it makes more sense to count lemmas. “Lemma” is what most people probably think of when they say “word”: a single entry in the dictionary. So the one lemma γεννάω has many other word forms (γεννηθὲν, γεννηθέντος, γεννᾶται, ἐγεννήθησαν, ...), and we don't want to count those separately.

So let's look at how many distinct lemmas occur in each book:


SELECT book_name,COUNT(DISTINCT lemma) AS Lemmas FROM sblgnt GROUP BY book_name ORDER BY book_number ASC;

You'd expect that the more words, the more lemmas. Where there is a discrepancy, it means that the author uses a greater variety of words, relatively speaking. In the graph below, it makes sense to the that Luke-Acts has the highest number of lemmas. Hebrews and Revelation also make sense, since they deal with their own subject matter. I am surprised to see so much varied vocabulary in Matthew.

Another interesting question is how many hapax legomena occur in each book:


SELECT book_name,COUNT(lemma) as NumberHapax FROM 
	(SELECT lemma,COUNT(lemma) AS cnt,book_name,book_number FROM sblgnt GROUP BY lemma) 
	WHERE cnt=1 GROUP BY book_name ORDER BY book_number ASC;

This is where Luke-Acts really shines as difficult text: 420 hapax in Acts, 290 in Luke. Hebrews is a distant third, and Romans surprisingly (to me) edges out Revelation.

This leads naturally to the question of how many words occur in each book that occur nowhere else in the New Testament:


SELECT book_name,COUNT(lemma) as NumberOnlyFoundHere FROM 
	( SELECT lemma,COUNT(book_name) as NumberOfBooks,book_name,book_number FROM
		(SELECT DISTINCT book_name,lemma,book_number FROM sblgnt)
		GROUP BY lemma HAVING NumberOfBooks=1 )
	GROUP BY book_name
	ORDER BY book_number;

Acts of course carries the field: all that nautical and Roman administrative vocabulary. Otherwise it's the usual suspects.

The number of hapax legomena is clearly related to the number of words that only occur in one book—additively, and probably in some way that can be deduced from Zipf's Law as well.

The preceding graphs have been dominated by the longer books. To correct for that, let's look at the number of words that occur only in a book, as a fraction of the number of distinct lemmas of that book.


SELECT tmpB.book_name,COUNT(lemma) as NumberOnlyFoundHere, Lemmas, CAST(COUNT(lemma) AS REAL)/CAST(Lemmas AS REAL) FROM 
	( SELECT lemma,COUNT(book_name) as NumberOfBooks,book_name,book_number FROM
		(SELECT DISTINCT book_name,lemma,book_number FROM sblgnt)
		GROUP BY lemma HAVING NumberOfBooks=1 ) AS tmpA
	LEFT JOIN
		( SELECT book_name,COUNT(DISTINCT lemma) AS Lemmas FROM sblgnt GROUP BY book_name ) AS tmpB
	ON tmpA.book_name=tmpB.book_name
	GROUP BY tmpB.book_name
	ORDER BY book_number ASC;

What I appreciate about this graph is that it brings 2 Peter up from obscurity (and to a lesser extent Jude), and shows them to be difficult books in their way.

Other word counts

The data are not syntactically tagged, so we're limited to investigating things we can glean from the morphology. Participles have always been relatively challenging for me (some kind of pun intended), so I looked at the number of participles per word. (The query below reflects the fact that in the database, only participles will be tagged as “verb” and have a non-null value for “gender.”)


SELECT tmpA.book_name,CAST( nParticiples AS REAL)/CAST(nWords AS REAL) as Fraction FROM
	(SELECT book_number,book_name,COUNT(_id) AS nWords FROM sblgnt GROUP BY book_name) AS tmpA
		LEFT JOIN
	(SELECT book_name,COUNT(_id) AS nParticiples FROM sblgnt WHERE part_of_speech='verb' AND gender IS NOT NULL GROUP BY book_name) AS tmpB
	ON tmpA.book_name=tmpB.book_name
	ORDER BY tmpA.book_number ASC;

Strange to say, I think this is the graph that best reflects my intuition about which books are hardest to read. The most-participles-per-word are Jude, 2 Peter, 1 Peter, Acts, and Hebrews. The Johannine literature has relatively few participles, and those are indeed the easier books for me to read. The gospels are higher than the Pauline epistles, which is contrary to my intuition, but part of my intuition reflects the difference between narrative and epistolary form, which is not captured here.

Acknowledgements

All contents © 2024 Adam Baker, except where otherwise noted.