Daniel Schreier, Milan Marković & Saša Petković, Selim 23 (2018): 115–136. 
ISSN 1132-631X 

 
i sall synge in haboundance of gastly softne: /h/ insertion in 
Middle English — Methodology, data mining and some first 
interpretations1 
 
 
Daniel Schreier, Milan Marković & Saša Petković 
University of Zurich 

 
This paper discusses some of the methodological challenges for a first corpus-based 
analysis of so-called /h/ insertion in English, a feature that has been widely observed 
yet not analysed empirically so far. We survey the existing literature and present what 
is known about historical variation and change, before describing our data-driven 
approach in the Helsinki Corpus and the Corpus of Early English Correspondence and 
presenting some first results on internal conditioning, restriction to word type, and 
overall frequency. We show that /h/ was inserted on English as well as French 
loanwords, nouns, adjectives, verbs, adverbs, and numerals, and that there was a 
positive match of identical lexical items in the two corpora (able, am, it, and itself), 
making this a historically robust feature. 

 
Keywords: /h/ insertion; corpus analysis; historical variation and change; Helsinki 
Corpus; Corpus of Early English Correspondence 

 
1
 This paper was delivered as a plenary talk at the 2017 Conference of the Spanish 

Society for Medieval English Language and Literature (SELIM), held at the University 
of Málaga on 21–23 September 2017. We wish to acknowledge the feedback and 
contributions from the audience present. A number of colleagues shared data and 
experience, commented on earlier drafts of the manuscript and pointed our interest to 
relevant issues, and our thanks go to: Rhona Alcorn, Jorge Luis Bueno-Alonso, Javier 
Calle-Martín, Juan Camilo Conde-Silvestre, Ray Hickey, Jesús Romero-Barranco, 
Olga Timofeeva, and Christine Wallis. Any remaining mistakes are our own. 


116 Daniel Schreier, Milan Marković & Saša Petković 

1. Introduction 
 
It is perhaps not exaggerated to say that /h/ insertion (in words that receive 
initial stress and begin with a vowel, island, abundance, it, etc.) is one of the 
least understood variables in the history of the English language. It is reported 
in overviews of English historical phonology (Lass 1992, Milroy 1992) and can 
be traced back to Early Middle English. Milroy (1992) has shown that the 
Norfolk Gilds (late 14

th
 century) or the Paston Letters (late 15

th
 century) exhibit 

variable use of <h> spellings, namely both absence (in alpenie ‘halfpenny’) and 
un-etymological insertion (in hoke lewes ‘oak leaves’). It is frequently used as a 
stylistic device in Charles Dickens’ renderings of working-class London 
English (“gas microscopes with hextra power”, in the Pickwick Papers), and 
historical corpora (such as the Linguistic Atlas of Early Middle English, 
LAEME) provide evidence also. Notwithstanding, inserted /h/ has very near 
to extinction in Present-day British English (surviving in cases of occasional 
hypercorrection, such as in the letter <h>, pronounced /heit∫/) but has been 
maintained in several post-colonial English varieties around the world (such as 
Tristan da Cunha English; see Schreier forthc.). All in all, /h/ insertion is 
well-known and mentioned in the literature, yet always on a descriptive level, 
and quantitative analyses are lacking. So far, it has not yet been studied from a 
standpoint of variationist sociolinguistics or (with the exception of Milroy 
1992) historical sociolinguistics. 

This paper goes some way of redressing this imbalance and reports on a 
first attempt to reconstruct historical variation in /h/ insertion in two corpora, 
namely the Helsinki Corpus (HC) and the Corpus of Early English 
Correspondence Sampler (CEECS). A prime concern of our focus is 
methodology and we will detail our data mining procedures in detail here, in 
the hope that /h/ insertion is studied in other historical corpora as well. We 
will also provide some information on the lexical dimension of /h/ variation 
and compare the two corpora with the aim of uncovering parallel patterns of 
/h/ insertion in earlier forms of English. We begin with a general discussion of 
/h/ insertion in English and present what is known about this variable, based 
on existing sources. Following the general discussion, we detail the 
methodology developed for our research project, describe the two corpora used 
(HC, CEECS), present our findings, and end with a conclusion and an 
assessment of future research on the variable. 
 
 
 /h/ insertion in Middle English  117 

2. /h/ insertion in English: What we know so far  
 
One problem with assessing the historical validity of /h/ insertion is that the 
comments are sporadic, non-representative, and selective. The high degree of 
awareness of this variable somewhat contrasts with the sources for which it is 
reported, and over the last two centuries, commentators (such as Jespersen 
1949) focus on very few sources where it is mentioned. To give an example: 
one of the most widely used 19

th
-century sources is the direct speech of 

working-class London English as reported in Charles Dickens’ novels. Indeed, 
/h/ variation is a rather prominent feature here: 
 
Adam Bede: 
 Mr Casson, a butler, talks to a traveler. “He’ll be comin’ of hage this ‘ay-

‘arvest” 
 
Nicholas Nickleby: 
 “This is the hend, is it? continued Miss Squeers, who being excited 

aspirated her h’s strongly!” (p. 518) 
 
Great Expectations: 
 Pip writes to Joe: “mI deEr JO i opE U r krWitE wEll i opE i shAl soN B 

haBelL 4 2 teeDge U JO” (i, p. 98) 
 
The Pickwick Papers: 
 “If they wos a pair o’ patent double million magnifyin’ gas microscopes of 

hextra power, p’raps I might be able to see through a flight o’ stairs and a 
deal door” (Chapter 34) 

 
Previous studies on variation and change of English dialect features in 

dialect contact scenarios overseas (such as the /v–w/ merger, Trudgill et al. 
2003) have readily used Dickens as a resource and taken dialect reports of this 
kind as prima facie evidence of working-class London English (assuming that 
it was brought to overseas areas). This is not unproblematic, however. From a 
standpoint of English historical linguistics, there is no warranty that the direct 
speech reported in a source is an adequate reflection of how Londoners used 
to speak at the time (this, of course, goes for all historical sources, Chaucer, 
Shakespeare, etc.), let alone that London English served as a donor variety, 
which is a considerable conundrum. Playwrights have always taken liberties 


118 Daniel Schreier, Milan Marković & Saša Petković 

when creating their characters; reported dialogues at odds with adequate (or 
authentic) renderings of speech and may in fact be nothing more than literary 
artefacts, not to be taken at face value. Mugglestone reminds historical 
linguists that  
 

Dickens’s lower-class characters are, correspondingly, often depicted as being 
seemingly incapable of pronouncing hand other than as and, hungry other than 
as ungry, or, conversely, under other than as hunder. Such systematic patterns 
are, in real terms, fictions just as much as the characters themselves. 
(Mugglestone 1995: 137) 

 
The same observation, albeit stronger, is found in Jespersen: 
 

Many novelists would have us believe, that people who drop their aspirates 
place false aspirates before every vowel that should have no [h]; such systematic 
perversion is not, however, in human nature. But they sometimes inadvertently 
put a [h] between two vowels (rarely after a consonant), especially when the 
word is to receive extra emphasis, and of course, without any regard to whether 
the word ‘ought to’ have [h] or not. The observer […] is struck with the 
instances of disagreement, deducing from them the impression of a systematic 
perversion (‘Am an’ heggs’). (Jespersen 1949: 378) 

 
Both Mugglestone and Jespersen call for a cautious interpretation of 

historical materials (of which Dickens’ novels are just one source, of course, 
even though a rather widely quoted one). However, this is not to refute their 
validity. Jespersen himself discusses /h/ variation in the history of English 
(1949: 378–390) and quotes other sources, such as James Elphinston’s (1787) 
Principles of English Grammar: “E 1787 (vol 2.254 ff.) complains of exactly the 
same errors in this respect as are met with nowadays: ils, ouzes, eariing the owls 
in the hevening, orse, art, arm, etc.”; John Walker’s (1791) Critical Pronouncing 
Dictionary: “W 1791 speaks of the ‘fault of the Londoners: not sounding h 
where it ought to be sounded, and inversely’.”; and Thomas Batchelor’s (1809) 
Orthoëpical Analysis: “B 1809 p. 29 says: ‘the aspirate h […] is often used 
improperly, and is as frequently omitted where it should be used. Give my orse 
some hoats has been given as an example of these opposite errors from the 
Cockney dialect’.” 

Additional evidence comes from regional dialectology. The Survey of 
English Dialects (SED, Luick 1964) reports fieldwork data of so-called NORM 
(Non-mobile Older Rural Male) speakers throughout England. Fieldworkers 


 /h/ insertion in Middle English  119 

were instructed to elicit data via questionnaires and to transcribe the 
responses. With regard to /h/ insertion, Luick concludes that 
 

Anm. 1. Dass h nicht gänzlich geschwunden ist (so Wright, Dial. Gram. 254), 
ergibt sich aus den Aufzeichnungen bei Ellis (vgl. Güning 3ff) und neueren 
Einzeluntersuchungen. Danach würde heute in Kendal im nördlichen 
Cumberland und in Suffolk h zumeist richtig gesprochen (Hirst 12, 111, 
Brilioth 93, Kökeritz 106) und ware in Penrith und Süd-Durham bei manchen 
Sprechern oder in gewissen Orten regelmässig vorhanden (Reaney 133, Orton 
6, 141). Für in der Emphase falsch angefügtes h bieten Beispiele: Hackness (Cowling 
101), Adlington (Hargreaves 73), Pewsey (Kjederqvist 113), West-Somerset 
(Kruisinga 93), Stokesley (Klein 76), Oldham (Schilling 107). (Luick 1964: 1093, 
emphasis added) 
‘Obs. 1: The fact that h has not entirely disappeared (cf. Wright, Dial. Gram. 
254), can be seen in the notes of Ellis (cf. Güning 3ff) and some more recent 
case studies. Accordingly, in Kendal in northern Cumberland and in Suffolk h 
was pronounced correctly most of the time (Hirst 12, 111, Brilioth 93, 
Kökeritz 106) and it was regularly found in some speakers and certain localities 
in Penrith and South-Durham (Reaney 133, Orton 6, 141). Examples for h 
that are added incorrectly in emphatic positions: Hackness (Cowling 101), 
Adlington (Hargreaves 73), Pewsey (Kjederqvist 113), West-Somerset 
(Kruisinga 93), Stokesley (Klein 76), Oldham (Schilling 107).’ 

 
This suggests that /h/ insertion was well and alive in the mid-20

th
 century (at 

least in older speakers of traditional dialects) and that it was rather widespread 
regionally. This is confirmed in late Middle English, where Milroy’s (1992) 
analysis of the Linguistic Atlas of Late Medieval English (LALME) provides 
evidence in texts from the East Midlands, East Anglia and the South (in c. 
1190–1320, the feature was attested in a region from Lincolnshire and Norfolk 
to the southern counties but “the instability seems to be greatest in the East 
Midlands”, Milroy 1992: 140). Milroy also finds that /h/ insertion was attested 
in the Norfolk Gilds (late 14

th
 century) and the Paston Letters (late 15

th
 

century; alpenie ‘halfpenny’ and hoke lewes ‘oak leaves’). 
Alcorn (personal communication, August 2016) reports even earlier 

evidence of /h/ insertion in LAEME, where a search surfaced  
 

multiple examples in multiple LAEME texts from multiple counties, including 
Gloucestershire, Kent, Lincolnshire, Norfolk, Suffolk, and Worcestershire.  
Adj[ective] examples include: holde ‘old’, hiuel ‘evil, hunkinde ‘unkind’.  
Adv[erb] examples include: heuere ‘ever’, hinne ‘in’, hout ‘out’.  


120 Daniel Schreier, Milan Marković & Saša Petković 

Conj[unction] examples include: hif ‘if’, her ‘ere’, has ‘as’.  
Prep[osition] examples include: huntil ‘until’, hafter ‘after’, hat ‘at’. 

 
Alcorn informs us that /h/ insertion is widespread, both externally in terms of 
region, and internally in all parts of speech (which is noteworthy, as most of 
the later examples come from nouns; see below). 

Like other sources, Milroy emphasises that there is a relationship between 
/h/ dropping and /h/ insertion, though he is more sociolinguistically sensitive 
to the variable nature of the two processes than scholars such as Jespersen are. 
Milroy (1992) finds that /h/ variation, e.g. ate for ‘hate’ and om for ‘home’, 
halle for ‘alle’ and his for ‘is’, has a complex sociolinguistic history: 
 

Many Early M[iddle] E[nglish] sources exhibit variable use of the letter h in 
syllable-initial positions before vowels (that is, in such words as hate, hopper). 
Sometimes it is omitted where it is historically expected to be present, and 
sometimes it is added where it is not expected. (Milroy 1992: 140)  

 
Milroy offers a historical-sociolinguistic explanation for /h/ variation, 
namely that there was linguistic insecurity as to when to pronounce /h/. One 
of the most openly stigmatised sociolinguistic variables in the history of 
English is /h/ dropping, and Milroy suggests that instability in /h/ usage leads 
to hyper-correction and insertion of /h/ where it is not etymological. 
Speakers, in other words, over-compensate in order to avoid using stigmatised 
innovations —which would explain why one and the same speaker would have 
both dropping and insertion at the same time (which, of course, makes it very 
noticeable). A similar explanation was offered by Christine Wallis 
(unpublished ms.), who studied the Monasteriales Indicia, a short description 
of sign language in a monastery during times of silence (commonly used items 
to do with liturgy, eating and drinking, clothing, and people in the abbey) and 
was dated to Canterbury, c. mid-11

th
 century. In this text, Wallis noted both 

deletion (is ‘his’, abban ‘to have’) and insertion (hunlocan ‘unlock’, his ‘is’, 
halban ‘alb’) and concluded that “the scribe is inconsistent in his use of <h> 
[…] ‘overcompensating’ in speech for h-dropping. Therefore, the text would 
reflect the writer’s speech. Alternatively, it could be a written feature reflecting 
only h-lessness in speech.”  

Of course, one problem with such an approach is that one would expect 
the variable to be above the sociolinguistic radar. If stigmatised features are 
‘corrected’, then they would have to be noticeable and subject to open 


 /h/ insertion in Middle English  121 

comments and discourse, particularly in the usage guides and orthoepic 
treatises of the 18

th
 century. However, there are “few comments on ‘[h]-

dropping’ (and [h] insertion) before 1800. The very few comments also refer 
to Cockney, or London English” (Milroy 1992: 138). This is also reflected in 
Jespersen’s discussion (see above).  
 
Linguistic insecurity is questioned by others as well: 
 

Until the beginning of the sixteenth century, there was no evidence of 
association between h-dropping and social or educational status, but the 
attitudes began to shift in the seventeenth century, and by the eighteenth 
century [h]-lessness was stigmatized in both native and borrowed words. 
(Minkova 2013: 107; cf. Mugglestone 2006) 

 
Still, compared with the situation in the following century, any adverse 
sociophonetic consequences of [h]-dropping and adding are, if anything, 
relatively unstressed by late-eighteenth-century observers. (Jones 2006: 257) 

 
To substantiate these claims, we checked a selection of grammar books and 
usage guides published in the second half of the 18

th
 century (see Table 1), 

and it is striking that not a single one of them listed /h/ insertion. 
This, in our view, is evidence that the feature was not in the public eye and 

underneath the sociolinguistic radar (discussion in Schreier forthc.), making 
Milroy’s explanation difficult to uphold. The three sources quoted by 
Jespersen (1949) and reproduced in other publications seem to be the only 
ones that make explicit reference to the feature.  

To sum up this very short overview of what we know about the history of 
/h/ insertion in British English: the evidence we have at present is anecdotal 
and sporadic at best. /h/ deletion and insertion are commonly discussed hand 
in hand and considered as a manifestation of /h/ variation in general. The 
earliest reports we could locate come from the mid-11

th
 century (Monasteriales 

Indicia), then inserted /h/ is reported throughout the Middle English and 
Early Modern English periods (Norfolk Gilds, late 14

th
 century; Paston Letters, 

late 15
th

 century; and also in historical dialect corpora: LAEME, LALME). 
 

122 Daniel Schreier, Milan Marković & Saša Petković 

Table 1. 18
th

-/19
th

-century usage guides checked for comments on /h/ insertion 
Year Author Title 

1726 N. Bailey An Introduction to the English Tongue: Being a Spelling Book 
1750 S. Hammond A New Introduction to Learning; or, A Sure Guide to the 

English Pronunciation and Orthography 
1762 G. Sharp A Short Treatise on the English Tongue: Being an Attempt to 

Render the Reading and Pronunciation of the Same More Easy 
to Foreigners 

1764 A. Johnston Pronouncing and Spelling Dictionary 
1773 W. Kenrick A New Dictionary of the English Language 
1781 T. Sheridan A Rhetorical Grammar of the English Language. Calculated 

Solely for the Purposes of Teaching Propriety of Pronunciation, 
and Justness of Delivery, in that Tongue, by the Organs of Speech 

1784 R. Nares Elements of Orthoepy 
1786 W. Scott A New Spelling, Pronouncing, and Explanatory Dictionary of the 

English Language 
1791 J. Walker A Critical Pronouncing Dictionary and Expositor of the English 

Language 

1792 W. Fogg Elementa Anglicana; or, The Principles of English Grammar 
Displayed and Exemplified 

1793 W. Perry The Royal Standard English Dictionary 
1795 W. Smith An Attempt to Render the Pronunciation of the English Language 

More Easy to Foreigners 

1799 J. Adams The Pronunciation of the English Language Vindicated from 
Imputed Anomaly & Caprice 

 
As for insights from regional dialectology and historical sociolinguistics, 

Ellis (1889: 1) reports the feature in “uneducated people, speaking an inherited 
language, in all parts of Great Britain where English is the ordinary medium of 
communication between peasant and peasant”, and it is found in a broad range 
of diverse areas, including Yorkshire, Lancashire, Wiltshire, and Somerset. 
Following Milroy, there was a regional concentration in the Southeast and the 
Midlands, and by the late 19

th
 century it had developed a strong association 

with working-class London English (Cockney). This no doubt fuelled its 
prominent usage in the literature (as in Dickens’ novels), where it was 
portrayed (perhaps even stereotyped) as an East London feature. Later on, it 
was all but lost from British English; whereas dialect atlases such as the SED 
still report it as common (at least in the speech of older non-mobile men in 


 /h/ insertion in Middle English  123 

the countryside), it became obsolescent from the 1950s onwards and is now 
very near extinction (or already extinct). 

As quantitative evidence is lacking, the present study is a first attempt to 
develop a methodology that allows us to retrace historical variation with regard 
to /h/ variation. We would like to present some first historical data to explore 
that variable nature of /h/ and also get a better understanding of its frequency 
and internal conditioning in Middle and Early Modern English. With this 
aim, we selected two corpora, the HC and the CEECS, and searched for cases 
of /h/ insertion. In a next step, we detail the methodology we employed before 
we go on to present some first findings. 
 
 
3. /h/ insertion in Middle English: Methodology and some 
first findings  

 
While certainly not the largest of the historical corpora available today, the 
Helsinki Corpus and the Corpus of Early English Correspondence (CEEC) are 
particularly well-suited for diachronic investigation of /h/ insertion, since they 
were both designed and compiled along principles of historical sociolinguistics. 
These two collections of written language represent a proverbial ‘window into 
the past’, as they provide a very good opportunity to extend the analysis of /h/ 
insertion into the domain of English historical linguistics and to assess 
patterns of variation and change. 

The HC (Rissanen et al. 1991) is comprised of two major components: a 
diachronic and a dialectal one. The diachronic component of the HC was 
released in 1991 and includes a selection of texts that span the period from 
around 730 to 1710 (thus almost one thousand years). These texts are 
organised into three large sections, which correspond to the Old, Middle and 
Early Modern English periods. In total, the HC includes c. 450 texts and 
1,572,800 words. The Old English section contains 413,300 words, the Middle 
English section 608,600 words, and the Early Modern English section 551,000 
words. The representativeness of the language included in the corpus is 
described by Kytö, who notes: 
 

[T]he selectional criteria adopted for including a text […] reflect the principles 
of socio-historical variation analysis […] Periodization has been of primary 
importance […] but attention has also been paid to geographical dialect, type 
and register of writing (text type, relationship to spoken language, setting on 


124 Daniel Schreier, Milan Marković & Saša Petković 

formal-informal axis) and sociolinguistic variation (different author-related 
parameters such as gender, age, social rank). (Kytö 1996) 

 
However, even though the corpus covers nearly an entire millennium in the 
history of the English language, only its Middle and Early Modern English 
components, which together span the period from 1150 to 1710, were 
included for the present investigation. Its Old English component, on the 
other hand, which covers the period from 730 to 1150, was not taken into 
consideration. 

In a similar vein, the original purpose of the CEEC (Nevalainen et al. 1998) 
was to examine the ways in which a methodology employed for modern 
sociolinguistic research can be applied to historical data. This was made 
possible by including “an extensive database containing background 
information about letter writers” (Nevalainen et al. 1998) into the corpus 
metadata. The original version of the CEEC was released in 1998 and covers 
the time period between 1410 and 1681. It is comprised of around 5,961 
letters and amounts to more than 2.5 million words. The CEECS, its publicly-
available version, was released in the same year (Nevalainen et al. 1998), and it 
is this version of the corpus that we used for the purposes of the present 
paper. The sampler itself covers the period between 1418 and 1680 and 
contains 1,147 letters from 194 individual writers, amounting to around 
450,000 words in total. It is worth mentioning that this sampler represents a 
“fairly accurate small-scale copy of the full CEEC” as it provides very “similar 
results for many linguistic phenomena” (Nevalainen et al. 1998). 

As mentioned previously, so far very few attempts have been made to 
account for /h/ insertion in quantitative terms. As a result, we set out to 
devise a universal, reliable and robust set of criteria, which would allow us to 
accurately track the usage of this sociolinguistic variable over long periods of 
time (and that would be general enough so that it could be applied to other 
corpora as well). Our methodological procedure, which is essentially an 
extension of the methodology first employed to investigate /h/ insertion in 
Tristan da Cunha English (see Schreier forthc.), is described in detail in the 
passages below, in the hope that it will help advance further quantitative 
analyses of /h/ variation in the domain of English historical linguistics. 

During the initial stage of the analysis, it was necessary to identify and 
extract all the instances of /h/ insertion from the two corpora in order to get 
some first understanding of its use. This presented us with a first major 
obstacle. Specifically, while this procedure would have been a relatively 


 /h/ insertion in Middle English  125 

straightforward task when dealing with a smaller, specialised corpus, where the 
feature investigated is usually marked via some type of notation in order to 
facilitate its searchability and recall, the same does not apply when dealing 
with large, general-purpose corpora, such as the HC and the CEEC, which 
strive for universality and where the level of non-standardness in terms of 
notation is reduced to a bare minimum. To make matters more complicated, 
upon running some of our preliminary searches, we also encountered a 
number of ambiguous word forms, a variety of different spelling realisations 
for the same form as well as forms carrying inflectional suffixes which have 
long disappeared from usage in the English language. This is illustrated in 
Table 2, which shows the complete list of word forms attested in the HC data 
for the lexical units ASK and ABIDE. This, as a consequence, rendered looking 
for any underlying patterns, which could facilitate the retrieval of /h/ tokens, a 
virtually impossible task. 
 
Table 2. List of word forms for ASK and ABIDE attested in the Helsinki Corpus of 
English Texts 

Lexeme Word form 

ASK 

ask, asks, asking, asked, ask’d, askande, askede, askeden, askeing, 
askeinge, asken, askes, askest, asket, asketh, askeyd, askeþ, askid, askis, 
askist, askit, askith, askt, askte, askus, askyd, askyde, askyn, askyng, 
askynge, askynges, askys, askyth, axing, akseþ, aksy, axande, axe, axed, 
axede, axeden, axeing, axeinge, axes, axelyne, axest, axet, axen, axeth, 
axeyd, axeþ, axid, axis, axist, axit, axith, axt, axte, axus, axyd, axyde, axyn, 
axyng, axynge, axynges, axys, axyth, hasked, haske 

ABIDE 

abide, abiding, abidað, abidæn, abidan, abidand, abidas, abiddan, abide, 
abideð, abiden, abideth, abideþ, abiding, abidiþ, abidon, abidyng, 
abidynge, abyd, abydand, abyde, abyden, abydest, abydeth, abydith, 
abydyn, abydyng, abydynge, abydyth, abydythe, habyde, habydynge 

 
Hence, in order to substantially limit the scope of our investigation and bring 
it to a somewhat more manageable level, we first extracted every single word 
form which starts with the letter h. This procedure was carried out using 
corpus query language; however, the same result can also be achieved with the 
use of regular expression syntax. Unsurprisingly, running a search of such 
general scope resulted in an extremely high number of hits in both the HC 
and the CEECS (N=109,920 and N=27,335, respectively). The concordanced 
results were then exported as a comma separated file (.csv) and imported into 
an Excel spreadsheet for further analysis. 


126 Daniel Schreier, Milan Marković & Saša Petković 

During the next stage, each of the hits was manually inspected for the 
presence of a potential /h/ prefix. Only those cases where /h/ insertion was 
positively attested were retained in the tabulations, while the rest of the data 
was discarded. As a result, we managed to identify a total of 1,661 instances of 
/h/ insertion in the HC and 520 instances of /h/ insertion in the CEECS. 

Nevertheless, upon closer examination of the results, we noticed two 
distinct types of tokens with an /h/ prefix in the HC data. The first type was 
comprised of word forms such as habilite (ability), hable (able), haboundance 
(abundance), haske (ask), hevere (every), hit (it), etc. These instances of /h/ 
insertion ‘proper’ were retained in the analysis. On the other hand, we also 
noticed a number of tokens occurring in the Middle English component of 
the HC (1150–1350, specifically), which were transformed into wh- words over 
time. This type consisted of word forms such as huanne (when), huere (where), 
huet (what), huich (which), hwile (while), etc. We treated these instances as 
members of the ‘Don’t Count’ class and removed them entirely from the scope 
of our investigation. With the exclusion of these word forms, the total 
number of /h/ tokens in the HC data was reduced from the original 1,661 to a 
somewhat lower value of 1,309. 

In the next step, we categorised each of the /h/ tokens obtained from the 
two corpora according to their corresponding lexical values. Upon the 
completion of this process, we managed to identify nineteen unique lexical 
items in each of the two corpora. These lexemes were then used to find the 
corresponding word forms where /h/ insertion could but did not occur. This 
step was crucial in order to ensure that our method remains in compliance 
with Labov’s (1982:30) ‘principle of accountability’. 

Specifically, the method which was employed to extract the /h/ variation 
data was the following. First, the entire inflectional paradigm was 
reconstructed for each of the lexemes by adding all the possible suffixes, 
depending on its part of speech membership (e.g. ABILITY: ability, abilities; 
ALLOW: allow, allows, allowing, allowed; ABLE: able, abler, ablest). Next, /h/ 
prefixed equivalents were created for each of the word forms obtained from 
the above procedure. Finally, all the prefixed and non-prefixed word forms 
were combined to create a list in order to extract both realisations of the 
dependent variable.  

However, while this procedure works very well with the data from the 
Modern English period (see Schreier forthc.), it alone is not sufficient when 
earlier forms of English are taken into account, due to the variability in form 
and spelling mentioned earlier. Hence, in order for our procedure to be 


 /h/ insertion in Middle English  127 

considered methodologically robust, the word list needed to be expanded with 
all the archaic realisations of the lexemes as well. These were identified with 
the help of two major Middle English resources: LAEME (Laing 2013) and 
the Middle English Dictionary (MED, Lewis 1959). In the last step, all the 
archaic word forms attested in our subsequent searches were included in the 
final list. An illustrative sample of such a list, which was used to extract the 
/h/ variation data from the HC, is presented in Table 3. 
 
Table 3. Sample of the list used for extraction of Ø and /h/ tokens in the Helsinki 
Corpus of English Texts 

Lexeme Ø /h/ 

ABILITY (N) 

ability hability 
abilities habilities 

abilite habilite 
abilitie habilitie 
abilitys habilitys 

ALLOW (V) 

allow hallow 
allows hallows 
allowing hallowing 

allowed hallowed 
allow’d hallow’d 
allowd hallowd 

allowe hallowe 
allowes hallowes 
alloweþ halloweþ 

ABLE (Adj) 

able hable 
abler habler 
ablest hablest 

abel habel 
abil habil 
abill habill 

abyl habyl 
abyll habyll 

ABUNDANTLY 
(Adv) 

abundantly habundantly 

abundauntly habundauntly 
aboundantlie haboundantlie 
aboundantly haboundantly 

 
128 Daniel Schreier, Milan Marković & Saša Petković 

3.1 The Helsinki Corpus of English Texts 
 
As shown in Table 4, the initial results of the analysis on the HC show 
quantitative evidence of /h/ insertion in the Middle English varieties. Even 
though only at 3.8 per cent, the overall insertion rate in texts ranging from 
1150 to 1710 stems from a relatively high token count of 1,309 (34,507 in 
total). Considering that these tokens are extracted from the records of written 
language when only the educated and the nobility were literate, a substantial 
number of /h/ prefixed instances such as this warrants further investigation. 
 
Table 4. Helsinki Corpus of English Texts: Overall /h/ insertion rate 

Ø (N) /h/ (N) Total (N) /h/ (%) 

33,198 1,309 34,507 3.8 

 
Categorised into periods ranging from seventy to one hundred years in 

length, the analysis of /h/ insertion rates reveals a drastic decrease and eventual 
loss of this feature over time. Texts from the earliest period between 1150 and 
1250 show a very high /h/ insertion rate amounting to 41.1 per cent (527 out 
of 1282), which very noticeably contrasts with the results from the other 
periods. Even though the periods from 1250 to 1710 are above the overall 
insertion rate, they exhibit very low insertion rates in comparison. Specifically, 
the periods from 1250 to 1350 and 1420 to 1500 display /h/ insertion rates of 
6.6 per cent and 7.4 per cent, respectively. The remaining periods show a 
substantial decrease in insertion rates, dropping to 1.7 per cent between 1350 
and 1420, then to 0.9 per cent between 1500 and 1570, and finally to 0.3 per 
cent between 1570 and 1640. There are no occurrences of /h/ insertion in the 
text records from the period between 1640 and 1710. This is an indication that 
the feature was lost, or at the very least, that there was a considerable decline 
of /h/ insertion in written language, as seen in Figure 1. 

On the lexical level, the HC results show a varied distribution of /h/ 
insertion rates. These findings are presented in Table 5. Due to their low 
token counts, the lexemes ability, abound, and abundantly will not be discussed 
further. The adjective abundant is one of the exceptions since it presents a very 
interesting case, as does the reflexive pronoun itself in connection to the 
pronoun it. Further investigation shows that eight out of sixteen remaining 
lexemes display low insertion rates. In the case of abide, only one example of 
/h/ insertion out of 171 total tokens was discovered, which points to a 0.6 per 
cent insertion rate. Similarly, the lexeme in appears only twice with the /h/ 


 /h/ insertion in Middle English  129 

prefix out of 18,357 total tokens. The lexemes old (N=563), up (N=519), and 
upon (N=755) have high total token counts but the number of their prefixed 
tokens is low, so that their insertion rates range from 0.3 per cent to 0.6 per 
cent. These appear to be actual instances of /h/ insertion; nevertheless, they 
should be examined further on a larger data sample before drawing any 
conclusions here. The lexeme ask, on the other hand, is only minimally more 
reliable with four /h/ prefixed tokens out of 536, producing a 0.7 per cent 
insertion rate, which is identical to the lexeme every with six out of 902 total 
 
 
Figure 1. Helsinki Corpus of English Texts: /h/ insertion rates across different time 
periods 

 
tokens. Likewise, the possessive pronoun our, displays a conclusively low 
insertion rate of 0.7 per cent with eleven /h/ inserted tokens out of 1,623 in 
total. 

Moving on to lexemes with higher insertion rates, the verb allow with 
thirty-five total tokens appears only twice in the examined data sample 
(insertion rate 5.7%). Similarly, with a total token count of fifty-two, the 
preposition/conjunction until also appears only twice in the corpus yielding a 
3.8 per cent insertion rate. The lexeme able appears prefixed by /h/ four times 
out of the total 153 tokens with an insertion rate of 2.6 per cent. Nevertheless, 
these lexemes require further research on a larger data sample in order to 

0%

10%

20%

30%

40%

50%

1150-1250 1250-1350 1350-1420 1420-1500 1500-1570 1570-1640 1640-1710


130 Daniel Schreier, Milan Marković & Saša Petković 

verify the reliability of these results. The first person singular verb am on the 
other hand provides a much clearer picture of its insertion rate (twenty-one 
out of 920, or 2.3 per cent). Furthermore, the remaining lexemes are all 
examples of substantially higher /h/ insertion rates, which warrants closer 
examination. The adjective abundant, although low in total token count, 
appears almost categorically with the /h/ prefix in the data sample. With 
 
Table 5. Helsinki Corpus of English Texts: /h/ insertion rates per lexeme 

Lexeme /h/ (N) Ø (N) Total (N) /h/ (%) 

ABIDE 1 170 171 0.6 

ABILITY 4 7 11 36.4 

ABLE 4 149 153 2.6 

ABOUND 1 14 15 6.7 

ABUNDANCE 12 22 34 35.3 

ABUNDANT 12 1 13 92.3 

ABUNDANTLY 1 13 14 7.1 

ALLOW 2 33 35 5.7 

AM 21 899 920 2.3 

ASK 4 532 536 0.7 

EVERY 6 896 902 0.7 

IN 2 18,355 18,357 0.01 

IT 1,214 8,601 9,815 12.4 

ITSELF 4 15 19 21.1 

OLD 3 560 563 0.5 

OUR 11 1,612 1,623 0.7 

UNTIL 2 50 52 3.8 

UP 3 516 519 0.6 

UPON 2 753 755 0.3 

Total 1,309 33,198 34,507 3.8 

 
 /h/ insertion in Middle English  131 

twelve insertion examples out of the total thirteen tokens, its insertion rate 
amounts to 92.3 per cent. Correspondingly, perhaps in line with the previous 
result, the noun abundance, with twelve examples of /h/ insertion out of 
thirty-four total tokens, produces an insertion rate of 35.3 per cent. 

The most interesting finding in the data sample, however, is the personal 
pronoun it. With 1,214 occurrences of /h/ insertion out of the total 9,815 
tokens, this lexeme has an insertion rate of 12.4 per cent. Its insertion rate is 
not as high as that of the previous two lexemes, nevertheless, with such a large 
number of /h/ inserted occurrences (92.7% of all the /h/ tokens in the HC), 
the pronoun it is a perfect candidate for further study of the underlying 
linguistic constraints which condition /h/ insertion. Similarly, although the 
reflexive pronoun itself has a low total token count (N=19), with four 
occurrences of /h/ insertion it produces the insertion rate of 21.1 per cent and 
merits additional research on a larger data sample, if only in connection with 
the pronoun it. 
 
 
3.2 The Corpus of Early English Correspondence Sampler 
 
Although much smaller in size, the CEECS shows an overall insertion rate of 
2.7 per cent, which is very close to the HC rate of 3.8 per cent (see Table 6). 
Out of 19036 total tokens, 520 have the inserted /h/ and, considering that the 
CEEC sampler version is less than half the size of the selected data sample 
from the HC, these results promise to be reliable, comparable and 
generalisable. 
 
Table 6. Corpus of Early English Correspondence Sampler: Overall /h/ insertion rate 

Ø (N) /h/ (N) Total (N) /h/ (%) 

18,516 520 19,036 2.7 

 
Investigating the insertion rates in the CEECS on the lexemic level reveals 

the interesting fact that both corpora produce the total of nineteen unique 
lexemes prefixed with /h/ in written records (see Table 7). Of those nineteen 
only four appear in both data samples: able, am, it, itself. Again, out of the 
nineteen lexemes three (abandon, ache, ale) will not be discussed due to their 
low total token numbers. Out of the remaining sixteen, seven lexemes (answer, 
as, at, one, order, over, us) have high total token counts but appear only once or 
twice with the /h/ prefix in the corpus so they will also be disregarded as 


132 Daniel Schreier, Milan Marković & Saša Petković 

possible hapax legomena. The reflexive pronoun itself will only be mentioned in 
connection to the lexeme it due to its very low number of occurrences (N=10). 
First of all, the lexeme all produces the insertion rate of 0.2 per cent appearing 
four times with the /h/ prefix out of the total 2,314 tokens, as does the lexeme 
any, occurring three times out of 1,322. The adjective/pronoun other appears 
four times out of 832 tokens and produces an insertion rate of 0.5 per cent. 
 
Table 7. Corpus of Early English Correspondence Sampler: /h/ insertion rates per lexeme 

Lexeme /h/ (N) Ø (N) Total /h/ (%) 

ABANDON 1 2 3 33.3 

ABLE 25 155 180 13.9 

ACHE 1 1 2 50.0 

ALE 2 5 7 28.6 

ALL 4 2,310 2,314 0.2 

AM 14 1,062 1,076 1.3 

ANSWER 1 336 337 0.3 

ANY 3 1,319 1,322 0.2 

ARMS 2 26 28 7.1 

AS 1 4,652 4,653 0.02 

AT 2 2,326 2,328 0.1 

IT 450 3,797 4,247 10.6 

ITSELF 1 9 10 10.0 

ONE 2 707 709 0.3 

ORDER 2 169 171 1.2 

OTHER 4 828 832 0.5 

OVER 1 238 239 0.4 

UNCLE 2 22 24 8.3 

US 2 552 554 0.4 

Total 520 18,516 19,036 2.7 

 
 /h/ insertion in Middle English  133 

Two of the lexemes, arms occurring with the /h/ prefix two out of twenty-
eight times, and uncle, two out of twenty-four times, produce higher insertion 
rates (7.1% and 8.3%, respectively). However, these lexemes require further 
research on a larger data sample in order establish how robust and frequent 
this insertion process occurs. Regarding the lexemes that appear in both data 
samples, the verb form am shows an insertion rate of 1.3 per cent (occurring 
fourteen out of 1,076 times), one per cent less frequently than in the HC 
(2.3%). The adjective able, on the other hand, produces very high insertion 
rate of 13.9 per cent (appearing twenty-five out of 180 times), which is 
considerably higher than the 2.3 per cent rate from the HC results. 

Although the two corpora show some similarities as well as differences in 
both lexeme choice and insertion rates, the most interesting finding in both 
data samples is the case of the pronoun it (see Table 8). In the CEECS, this 
lexeme appears prefixed by /h/ 450 times out of the total 4,247 tokens, 
yielding an insertion rate of 10.6 per cent. The pronoun accounts 86.5 per cent 
of all /h/ prefixed occurrences in the sampler corpus. In the HC, the pronoun 
it has a 12.4 per cent insertion rate (occurring 1,214 out of 9,815 times). Even 
though the occurrences, token numbers and the size of the examined data 
samples are 50-60 per cent smaller for the CEECS compared to the HC, the 
results are similar, and the extracted word form variants are nearly identical. 
The personal pronoun it is definitely a promising candidate for further and 
more granular research as it produces consistent results across two very 
different data samples. Moreover, the reflexive pronoun itself may also provide 
some insight if examined in more detail on a larger corpus. 
 
Table 8. it: /h/ insertion rates across the corpora 

Corpus /h/ (N) Ø (N) Total /h/ (%) 

CEECS 450 3,797 4,247 10.6 

HC 1,214 8,601 9,815 12.4 

 
4. Conclusion: What we know and need to know  
 
Our analysis of inserted /h/ in two historical corpora has provided some first 
evidence of historical variation and change. We found that /h/ insertion 
operated on a limited set of lexical items only. The comparison of the HC and 
the CEEC suggests that there are some parallels. First of all, the overall 


134 Daniel Schreier, Milan Marković & Saša Petković 

insertion rate is low yet similar in the two corpora, second, /h/ insertion is 
reported in identical lexical items (which in our view attests to the robustness 
of the process), and third, there are similarities regarding the diachronic 
manifestation, particularly when it comes to obsolescence. For one, based on 
the criteria selected here, we show that /h/ was inserted on English as well as 
French loanwords, nouns, adjectives, verbs, adverbs, and numerals, and that 
there was a positive match of identical lexical items in the two corpora (able, 
am, it, itself). However, the overall token count was low (in fact even lower had 
we disregarded it, which deserves special mention) so we could not carry out a 
detailed variationist analysis and can (for the moment) only provide a 
superficial understanding of the internal and external factors that correlate 
with variation. As for timing, our analysis indicates that there was ongoing 
obsolescence of the feature, as the earlier periods (1250–1350 and 1420–1500) 
display higher /h/ insertion rates, whereas there is a substantial decrease 
during the periods between 1350 and 1640. This supports existing claims (e.g. 
Milroy 1992) and suggests that /h/ insertion became less frequent from the 
Early Middle English period onwards. 

There are a few desiderata for future research in historical sociolinguistics. 
For one, one would need to apply this methodological framework for the 
analysis of other corpora. Given that our findings indicate a decline in the 14

th
 

century, LAEME would be an ideal starting point. Not only would this allow 
us to push the timeframe back by about 150 years, but such a study would also 
bring to light considerably higher numbers of tokens. This ideally would 
enable us to conduct a variationist analysis so that we can take a further step in 
order to unwrap the envelope of variation. Such knowledge would help us to 
contextualise synchronic findings, namely data from present-day varieties of 
English where inserted /h/ has survived (Tristan da Cunha in the South 
Atlantic, Palmerston Island in the Pacific, various Caribbean settings) in order 
to locate founder processes and, though this may be a long shot, contact-
induced innovation mechanisms that operate in the formation of Englishes 
around the globe. Finally, scrutinising lexical variation throughout the history 
of English may provide us with more information on the possible origins of 
/h/ insertion in English (i.e. whether this originated as a contact-induced 
phenomenon with French or whether it was present already in Old English, 
which would point to earlier origins or perhaps even feature legacy from the 
Germanic dialects that served as inputs to Old English). We hope that this 
paper provides a further stepping stone for these research questions. 
 

 /h/ insertion in Middle English  135 

 
References 
 
Batchelor, T. 1809: An Orthoëpical Analysis of the English Language. 
CEEC: Corpus of Early English Correspondence. 1998: T. Nevalainen, H. Raumolin-

Brunberg, J. Keränen, M. Nevala, A. Nurmi & M. Palander-Collin comps. 
Helsinki, Department of Modern Languages, University of Helsinki. 

Ellis, A. 1889: On Early English Pronunciation. Part V. London, Truebner and Co. 
HC: Helsinki Corpus of English Texts. 1991: M. Rissanen, M. Kytö, L. Kahlas-Tarkka, 

M. Kilpiö, S. Nevanlinna, I. Taavitsainen, T. Nevalainen & H. Raumolin-
Brunberg comps. Helsinki, Department of Modern Languages, University of 
Helsinki. 

Jespersen, O. 1949: A Modern English Grammar on Historical Principles. Vol. I: Sounds 
and Spellings. London, Allen and Unwin. 

Jones, C. 2006: English Pronunciation in the Eighteenth and Nineteenth Centuries. 
Basingstoke, Palgrave Macmillan. 

Kytö, M. ed. 1996: Manual to the Diachronic Part of the Helsinki Corpus of English 
Texts: Coding Conventions and Lists of Source Texts. 3rd ed. Helsinki, Department 
of English, University of Helsinki. 

Labov, W. 1982: Building on Empirical Foundations. In W. P. Lehmann & Y. 
Malkiel eds. Perspectives on Historical Linguistics. Amsterdam & Philadelphia, John 
Benjamins: 17–92. 

LAEME: A Linguistic Atlas of Early Middle English, 1150–1325. 2013–: M. Laing 
comp. Version 3.2. Edinburgh, The University of Edinburgh. 
http://www.lel.ed.ac.uk/ihd/laeme2/laeme2.html. 

Lass, R. 1992: Phonology and Morphology. In N. Blake ed. The Cambridge History of 
the English Language. Vol. II: 1066–1476. Cambridge, Cambridge University Press: 
23–155. 

Luick, K. 1964: Historische Grammatik der englischen Sprache. Oxford, Blackwell. 
MED: Middle English Dictionary. 1959: R. E. Lewis, ed. Ann Arbor, University of 

Michigan Press. 
Milroy, J. 1992: Linguistic Variation and Change: On the Historical Sociolinguistics of 

English (Language in Society 19). Oxford & Cambridge, Blackwell. 
Minkova, D. 2013: A Historical Phonology of English. Edinburgh, Edinburgh University 

Press. 
Mugglestone, L. C. 1995: “Talking Proper”: The Rise of Accent as Social Symbol. Oxford, 

Clarendon Press. 
Mugglestone, L. C. 2006: English in the Nineteenth Century. In L. C. Mugglestone 

ed. The Oxford History of English. Oxford, Oxford University Press: 274–304. 
Schreier, D. Forthcoming: Tracking Language Change via Dialect Transplantation: 

1,200 Years of /h/ Insertion in English. 


136 Daniel Schreier, Milan Marković & Saša Petković 

Trudgill, P., D. Schreier, D. Long & J. P. Williams 2003: On the Reversibility of 
Mergers: /w/, /v/ and Evidence from Lesser-Known Englishes. Folia Linguistica 
Historica 37 (Issue Historica Vol. 24.1–2): 23–46. 

Walker, J. 1791: A Critical Pronouncing Dictionary (R. C. Alston, English Linguistics 
1500-1800 117). Menston, The Scolar Press. 

Wallis, C. <h> in Old English. (Unpublished ms.) 

 
Author’s address 
English Department 
University of Zurich 
Plattenstrasse 47 
CH-8032 Zürich 
Switzerland 
e-mail: schreier@es.uzh.ch, milan.markovic2@uzh.ch, sasa.petrovic@uzh.ch  
 received: 15 March 2018 
 revised version accepted: 5 April 2018