10.4. Wordhood: An unsolved problem

Julianne Doner

10. Words

10.4. Wordhood: An unsolved problem

In Section 10.2, we learned about how the word word can be used in many different ways. In addition, we defined many technical terms we can use to distinguish these different uses. But we still haven’t defined what a word is. There is a reason for that! The definition of word is an unsolved problem in linguistics. There is no definition of word that works for all known languages.

Some attempts at defining word

In this section, we will consider many attempts at definitions of word and discuss how they fall short. Before you read this section, though, take a moment and attempt to write a definition of word. As you read through these possible definitions, compare them to the one you wrote. Do any of the problems discussed here also apply to your definition?

Based on spelling

We could attempt to define words based on how they appear in writing. One possible definition, then could be as in (1).

(1) Definition attempt 1: A string of letters written with a space on each end.

The first problem with defining word based on orthography is that not all languages are written down. Definition (1), then, would mean that languages without a writing system would have no words at all, which does not match speaker intuition. Even languages with a writing system don’t all indicate word boundaries. Classical Chinese, Ancient Greek, and Classical Latin were all written without spaces between words.

In languages that have a writing system that uses spaces, the use of spaces doesn’t always correspond exactly to word units. For example, English compound words can be written as one word, with a hyphen, or with a space, as illustrated in (2).

(2) a. Compound words with spaces: hot dog, high school, common sense, dining room, first aid, peanut butter, post office, prime minister, search engine, remote control, washing machine, role model

b. Hyphenated compound words: sister-in-law, check-in, far-fetched, free-for-all, know-how, merry-go-round, one-sided, well-being, up-to-date, self-esteem, freeze-dried, cage-free

c. Compound words without spaces: greenhouse, airport, bathroom, basketball, daylight, desktop, firefly, grandmother, grapefruit, grasshopper, lipstick, mailbox

Whether a compound word has a space, a hyphen, or is written as one word is just convention, and it can change over time. For example, in older texts, you may see ice-cream written with a hyphen, whereas in modern texts it tends to be written with a space instead, as ice cream. However, there is evidence, as we will discuss in Chapter 10 on compounding, that compound words behave as a single word, regardless of how they are written.

Based on uninterruptibility

We could attempt to define words based on their interruptibility. Perhaps a word is a unit that cannot have anything inserted into its middle.

(3) Definition attempt 2: A unit that cannot have anything added to it, except at the edges.

One of the big problems with this definition is that it doesn’t account for infixes. For example, the Lakhota first singular subject marker -wa- is an infix.^[1] This is shown in (4), where the roots máni, aphé, and hoxpé have -wa- inserted in between the first and second syllable to form the first-person singular.

(4)	a.	máni	‘he walks’	ma-wá-ni	‘I walk’
	b.	aphé	‘he hits’	a-wá-phe	‘I hit’
	c.	hoxpé	‘he coughs’	ho-wá-xpe	‘I cough’

(Lakhota; Albright 2000: 2)

If we maintain the the definition in (3), that a word is a unit that can only be added to at its edge, data like (4) would force us to conclude that the roots máni, aphé, and hoxpé each consist of two words. This is problematic, though, because the roots only contain one morpheme—that is, one piece of meaning! How could one morpheme be spread across two words?

Another challenge to the idea that a word is not interruptible are phrasal verbs. Phrasal verbs, such as turn on, blow up, break down, and call off are a subcategory of compound words that usually consist of a verb and a preposition. Phrasal verbs can have past tense markers inserted in between their two components, as in called off. Phrasal verbs are also well-known for allowing the object to occur in between the verb and the preposition, as shown in (5) to (8).

(5)	a.	turn on the lights
	b.	turn the lights on
(6)	a.	blow up the building
	b.	blow the building up
(7)	a.	break down the box
	b.	break the box down
(8)	a.	call off your dog
	b.	call your dog off

English phrasal verbs, on their own, are perhaps not the strongest piece of evidence, since we could perhaps argue that they are two words, not one.

German has a similar pattern that is even more puzzling.^[2] In German, verbs with separable prefixes are written with one word when there is an auxiliary, for example aufstehen in (9b) and zurückgeben in (10b). But in (9a) and (10a), when there is an auxiliary, the word is split into two pieces which aren’t even next to each other! The verb portions of the particle verbs, stehst in (9a) and gebe in (10a), are the second word in each sentence, while the preposition portions of the particle verbs, auf in (9a) and zurück in (10a), are at the end of each sentence.

(9)	German
	a.	Wann	stehst	du	morgen	auf?
		when	stand.pres.2sg	you	tomorrow	up
		‘When will you get up tomorrow?’

(11)	b.	Wann	willst	du	morgen	aufstehen?
		when	want.pres.2sg	you	tomorrow	stand.up
		‘When do you want to get up tomorrow?’

(10)	German
	a.	Ich	gebe	meiner	Tante	ihr	Buch	zurück.
		I	give.pres.1sg	1sg.dat.f.poss	aunt	3sg.f.dat	book	give.back
		‘I give my aunt her book back.’

(10)	b.	Ich	muss	meiner	Tante	ihr	Buch	zurückgeben.
		I	must.pres.1sg	1sg.dat.f.poss	aunt	3sg.f.dat	book	give.back
		‘I have to give my aunt her book back.’

Based on independence

Perhaps we should define words, not based on whether we can interrupt them, but based on whether they can stand on their own.

(11) Definition attempt 3: A unit that can be pronounced in isolation.

The problem with this definition is, first of all, that some subcomponents of words can stand on their own. For example, clippings like math, flu, and fridge can stand on their own. Does that mean that mathematics, influenza, and refrigerator are not words, but phrases? Our intuitions (or mine at least) says that can’t be right, because that would mean that the parts of mathematics, influenza, and refrigerator that don’t show up in math, flu, and fridge are also words.

Secondly, some words, especially functional words, can’t really stand on their own, such as the and of.

Based on the interface between semantics and phonology

Perhaps, then, we could define words based on their function as a form-meaning pair. In that case, a word would be a string of sounds or signs that, together, are associated with a meaning.

(12) Definition attempt 4: A form-meaning pair.

However, this definition does not quite work either. There are, of course, form-meaning pairs that are bigger than a word as well as form-meaning pairs that are smaller than a word.

It is easy to find form-meaning pairs that are smaller than a word. Any morpheme is a form-meaning pair, thus any multimorphemic word consists of form-meaning pairs that are smaller than a word. For example, teapot consists of two form-meaning pairs, tea and pot.

There are also form-meaning pairs that are bigger than a word, which are called idioms. Idioms are phrases with non-compositional meaning. In other words, it is the entire phrase that is paired with a meaning. Some examples of English idioms are listed in (13).

(13)	a.	break a leg	to wish someone luck before a performance
	b.	beat around the bush	explain or request something indirectly
	c.	hit the sack	go to bed
	d.	on the ball	prepared, ready
	e.	raining cats and dogs	raining very hard
	f.	spill the beans	tell a secret
	g.	under the weather	sick

Based on phonological domains

Perhaps we can define words based on the limits of phonological processes, such as stress assignment.

(14) Definition attempt 5: The domain in which phonological processes such as stress occur.

The first problem with this definition is that functional words such as the and of often do not receive stress at all, in which case they would not count as words.

The second problem is that the domain of phonological processes vary, both within the same language and across different languages. For example, the two roots in a compound word sometimes behave like they’re in the same phonological domain, and sometimes they do not.

Let’s consider vowel harmony in Finnish, which causes all vowels in the same word to match in backness. In (13), the adessive marker appears as -llä if the vowels in the stem are front vowels, as in (13a), and as -lla if the vowels in the stem are back vowels, as in (13b).

(15)	a.	pöydä-llä
		table-ADESS
		‘on the table’
	b.	kadu-lla
		street-ADESS
		‘in the street’

(Finnish; Julien 2002: 24)

The same pattern does not hold in compound words, as shown in (16). The first root, pää ‘head’, has front vowels, while the second root, kaupunki ‘city’, has back vowels. Vowel harmony does not cross the boundary between the two compound words.

(16)	pää-kaupunki
	head-city
	‘capital’

(Finnish; Julien 2002: 24)

In contrast, compound words in Greek do behave as a single domain for stress assignment, as shown in (17). In this compound word, there is only one stress, marked by the accent on the final vowel.

(17)	ksilo-θimonyá
	wood-stack
	‘wood stack’

(Greek; Julien 2002: 17)

If this compound word behaved like two separate words, we would expect there to be a stress on each root. Instead, there is only stress on the head of the compound word.

Based on speaker intuition

Let’s attempt one last definition of word. Perhaps we can consider words to be a unit according to the intuition of the language users.

(18) Definition attempt 6: A unit that stands on its own according to the intuition of language users.

The difficulty with this definition is that literate language users don’t normally have intuitions about wordhood that are separate from their writing system. Illiterate language users are becoming more difficult to find, due to global advances in education. The rules from the writing system might reflect underlying grammatical principles, but they also might not.

What do we do now?

We do not have a consistent way of defining words across all contexts and languages. This could mean lots of different things. It could mean that we just haven’t identified the right definition yet. It could mean that there is no universal definition of word, but that it is defined language-by-language. It could also mean that the word word refers to more than one thing, and using the same word in all of these different contexts has confused us! Finally, it could mean that there really is no such thing as words in the grammar at all. Let’s look at these last two ideas more closely.

Different kinds of word

One possible solution is to claim that there are different domains that are sometimes called words: phonological words, grammatical words, and orthographic words. Sometimes these different domains coincide, and refer to the same string, but sometimes they don’t. The different wordhood domains could be defined as follows:

Phonological words are a unit in the prosodic structure of a sentence. They are the domains in which word-level phonological processes occur, including stress assignment. Phonological words are the smallest unit that can stand on their own.
Grammatical words are a unit in morphosyntactic structure. They correspond to the terminal nodes in syntax trees. In other words, they are the units that syntax can manipulate. This definition is probably the least satisfying of these three, since we sometimes put morphemes that cannot stand on their own in terminal nodes of syntactic trees.
Orthographic words are a unit we use in writing, based on spelling convention.

In other cases, linguists divide the notion of word into lexeme, word token, or word form, as discussed in Section 8.2.

Maybe words aren’t real

Another possible solution to this problem is to assume that words aren’t actually a real component in our model of human language. Julien (2002) is one linguist who hypothesizes this:

…my working hypothesis in the following will be that ‘word’ in the nonphonological sense is a distributional concept. That is, if a given string of morphemes is regarded as a word, it simply means that the morphemes in question regularly appear adjacent to each other and in a certain order. The reason the morphemes show such behavior is to be found in their syntax. But notably, the structural relation between the morphemes is not directly relevant for the word status of the string; it only matters insofar as some structural arrangements of morphemes may result in independent distribution and internal cohesion, whereas others may not.

Crucially, if wordhood cannot be associated with any particular structural morpheme configuration, it follows that grammar cannot have at its disposal any specific word-forming devices. If a word is just the accidental outcome of the manipulation of morphemes that takes place in syntax, it must be the case that words come into being in our perception; that is, words are perceived rather than formed.

(Julien 2002: 36)

Basically, Julien is arguing that words are just strings of morphemes that appear together frequently enough that we treat them in a special way, rather than from some special configuration in syntactic structure.

Key takeaways

There is no way to define word that works for all languages because orthographic, phonological, morphological, syntactic, and semantic boundaries don’t all align in the same way across words and across languages.
One possible explanation for this is that phonological words, grammatical words, and orthographic words are three different kinds of units that only sometimes coincide.
Another possible explanation is that words are strings of morphemes that appear together frequently but that don’t correspond to any particular structure.

Check yourself!

References and further resources

Academic sources

Albright, Adam. 2000. The productivity of infixation in Lakhota. UCLA Working Papers in Linguistics 0: 1–19.

Julien, Marit. 2002. Syntactic heads and word formation. Oxford: Oxford University Press.

The Lakhota first person singular subject marker alternates between being a prefix and an infix, depending on the stem word. ↵
German data provided by Katharina Pabst. ↵

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

The Linguistic Analysis of Word and Sentence Structures Copyright © 2025 by Julianne Doner is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.