4. Linguistic data in morphosyntax

4.2. Methods of data collection

There are three main ways we collect data for use in morphology and syntax: corpus studies, linguistic elicitation, and experimentation.

Types of data collection

Corpus studies

A corpus study is based on a collection of found real-world data. A corpus (plural: corpora) can be compiled from written text, such as a collection of social media posts, newspaper articles, or books, or it can be compiled from video or audio recordings, such as a set of talk show interviews, political speeches, or podcasts.

Sometimes researchers compile their own corpora, but there are also several corpora that have already been compiled and are available for use, such as the Corpus of Contemporary American English (COCA).

If a word or construction has been found in a corpus, we say it is attested. If it has not been found, we say it is unattested.

Linguistic elicitation

In linguistic elicitation, linguists work with a user of a language to collect linguistic data from that language. Elicitation is often bilingual, which means that data collection is mediated by another language. The linguist may ask the language user to translate a word, phrase, or sentence into the target language, or the linguist may construct examples in the target language and ask the language user if the constructed examples are acceptable.

Linguistic elicitation can also be performed even if the linguist and the language user have no language in common, although it is more difficult. Daniel Everett gave a demonstration of how to perform a monolingual elicitation at the 2013 Linguistic Institute, which was recorded and posted here. To perform a monolingual elicitation, you can use props or images to elicit words and act out situations or use videos to elicit sentences.

Although bilingual elicitation is a lot easier, it also increases the likelihood that the language user will be influenced by the mediating language. For example, if two word orders are permitted in the target language, a language user might be more likely to use the word order that matches with the word order of the mediating language, even if it is the less common one in the target language.

Another method of data collection that is closely related to elicitation is introspection. If you are a speaker of the language that you are studying and you make up your own data based on your own personal judgments, that is called introspection.

Experimentation

An experiment is a highly controlled procedure, usually done in a research lab. The researcher will recruit multiple research participants, and ask them to perform the same task, in as close to the same conditions as possible, and observe their behaviour.

Experiments in morphosyntax can be simple, such as a survey which collects grammaticality judgments from a large number of people, or more complex, such as an eye-tracking study that pays attention to where research participants are looking when reading a sentence. A pause or slower reaction time can be a clue that the structure is more complex.

Classification of data collection methods

We can classify these methods of data collection by whether they collect observational data or targeted data. Targeted data is data that is specifically sought out by the researcher to test a hypothesis. Both elicitation and experimentation are targeted. In elicitation, the researcher asks the language user for specific constructions that provide evidence for or against their hypothesis. In experiments, the researcher designs the methods of their experiment to provide evidence supporting or contradicting their hypothesis. Observational data, on the other hand, is naturally occurring data that has been observed. Corpus studies use observational data, as the data was created for independent purposes and later analyzed.

We can also classify these methods of data by whether they are categorical or quantitative. Categorical data is sorted into categories, such as grammatical or ungrammatical. Elicitation results in categorical data, based on the judgments of the language user. Quantitative data is data that has been counted and statistically analyzed. Corpus studies result in quantitative data, such as the frequency of a particular word or construction. Experiments also result in quantitative data, such as the rate at which research participants exhibit a particular behaviour.

Table 1. Classification of data collection methods
Categorical data Quantitative data
Observational data Corpus study
Targeted data Elicitation Experiment

Which methods are best?

Each of the methods have different strengths and weaknesses. Which method is most appropriate for a given study will vary based on the research question and several other factors. It’s best, if possible, to use multiple methods–hopefully the different methods will all converge on the same results!

In this next section, we will look at two factors that should be considered when choosing a method: negative evidence and the resources of the language community.

Negative evidence

In Section 2.2, we learned that negative evidence, which is evidence that something is not possible, did not occur, or is absent, is important for linguistic analysis. Which methods might produce negative evidence?

Let’s do a little thought experiment. Let’s say we looked through the entire publishing history of The Toronto Star to look for the two sentences in (1) and didn’t find them.

(1) a. The prime minister of Canada ate a salami sandwich.
b. *The ate of salami a Canada ministerial prime sandwich.

Does this mean that both sentences (1a) and (1b) are not possible? According to my intuition, sentence (1a) is a possible sentence, but it just hasn’t been used by The Toronto Star. On the other hand, sentence (1b) is not a possible sentence of English.

We can’t tell the difference between a sentence that is possible but unattested and a sentence that is not possible at all just by looking at the sentences that have been produced in the past. Since language is creative and productive, not all possible sentences have been produced in the past. Arguably, most possible sentences haven’t been produced yet. Because of this, no conclusions can be drawn if the pattern you are looking for doesn’t show up in your corpus. As Carl Sagan famously said, “Absence of evidence is not evidence of absence.”

Corpus studies, since they result in observational data and not targeted data, cannot provide negative evidence. Elicitation and experimentation, on the other hand, can provide negative evidence, depending on their design.

Resources of the language community

Another important factor to consider when choosing a data collection method is the resources of the language community. If you are doing a study on English, it is easy to recruit a large number of English speakers to participate in a research study, and there are multiple electronic corpora that can be searched with just a few clicks. But English is one of the best-resourced languages in the world, and the situation is not the same for many minority and endangered languages. However, minority and endangered languages are crucial for helping us understand the breadth of the diversity of human language!

Minority and endangered languages are exactly the languages for which it may be impossible to gather large numbers of participants, and where the absence of literacy, the age of the speakers, and other factors make certain types of experiments unfeasible. (Davis et al. 2014: e187)

Even within well-resourced languages like English, corpora have biases. There are numerous dialects of English, which may all get mixed together in a single corpus, obscuring the differences between the dialects and making it unclear which groups of people use which language patterns. Minority dialects might not show up at all, or show up infrequently enough that they get excluded by the statistical analysis.

Check yourself!

References and further resources

For linguistics students

🔍 Everett, Daniel. 2013. Monolingual demonstration. Linguistic Institute. https://www.youtube.com/watch?v=sYpWp7g7XWU

Academic sources

📑 Davies, Mark. (2008–) The Corpus of Contemporary American English (COCA). https://www.english-corpora.org/coca

Davis, Henry, Carrie Gillon, and Lisa Matthewson. 2014. How to investigate linguistic diversity: Lessons from the Pacific Northwest. Language 90(4), e180–e226

definition

License

Share This Book