What is a corpus?

A corpus is a collection of texts. More specifically, in the words of Sinclair, it is "a collection of naturally-occurring language text, chosen to characterize a state or variety of a language" (1991, p. 171). In addition to this illustrative quote, there is today a growing consensus that a corpus is a collection of machine-readable authentic texts sampled to be representative. Firstly, to say that they are machine-readable in effect means that the texts can be manipulated and searched with the help of a computer, using some kind of specialised interface. Secondly, to say that the texts are authentic means that they have been taken from original sources of written and spoken language, such as published books, periodicals, reports, lectures, talks, meetings, speeches, sermons, and sport commentaries. Finally, to say that they are representative means that the collected texts should ideally represent a particular language variety. The language texts of a corpus are thus normally assembled with particular purposes in mind. For example, the British National Corpus (BNC) is a multi-purpose corpus consisting of approximately 100 million words. One of the main aims of the construction of the corpus was to create a material that would reflect contemporary British English in its various social and generic uses (Kennedy 1998; Meyer 2002). The majority of the BNC consists of written British English material (about 90 per cent), but there is also a smaller part made up by spoken British English material (about 10 per cent). The material is effectively divided into 4124 so-called documents, where each document contains a sample of either written texts, or transcribed spoken discourse, and where a variety of different genres are represented. Most samples contain between 40,000 and 50,000 words (Aston & Burnard 1998, p. 28). The written material was collected between 1960 and 1993, but no data are given as to when the spoken material was recorded.

When using a corpus, it is very common to retrieve concordances. Concordances, or concordance lines, as they are also called, are a compilation of examples that a computer can present as a result of a search that we have specified. The good thing about concordance lines is that it is possible to get a context in which a word or a phrase occurs. The size of the context can vary from a number of words on each side of the node word (the word that is searched for) to a whole sentence or even several sentences. This makes it possible to see how a word is used in an authentic context. For example, a writer of a text may want to know how a particular word is used in English. Let us assume that it is not clear which preposition should be used with the word interested. From a Swedish contrastive perspective, the Swedish construction would be intresserad av. Does this mean that the phrase interested of should be used in English? We can use a concordancer – a programme with which we search the corpus – to find this out by entering our word in the dialogue box of the concordancing software. As a response to our search, the programme will display a set of concordance lines which, according to our settings, will show the word we were looking for together with a specified amount of context.The example below shows ten concordance lines from a search for the sequence interested, followed by a preposition. The example has been taken from the Corpus of Contemporary American English (COCA) (Davies 2008).

The ten concordance lines shown in the example above all indicate that the preposition in seems to be commonly used with the adjective interested. Furthermore, a feature in the particular corpus used in the example (COCA) allows us to also retrieve frequency values for the searches we make. For example, the programme can tell us how many instances of interested in there are in the corpus, compared to instances of the word interested followed by any other English preposition. The example below supplies the figures for a search for the sequence interested + any word classified as a preposition in the corpus texts.

As can be seen in the above example, the preposition in overwhelmingly dominates the scene with 22,733 instances in which it follows the word interested. Far behind comes the second most frequently used preposition at, with 27 instances. We can safely conclude that the most frequent preposition following the word interested is in. However, we also see that other prepositions have been used, but these occur only in a very small number and some of them might even be incorrect uses, or cases in which the preposition is not connected to interested, but to the folllowing phrase.

The term collocation is typically used to describe the frequent co-occurrence of words in a text. Through hundreds of years of language use, certain combinations of words become conventionalised. This means that native speakers of a language tend to use a limited number of preferred ways in which a certain situation, event or phenomenon is described. Thus, even though a language in theory offers a large number of possible word combinations, it seems that only a small number of these are actually used. Collocation is important since it tells us what word combinations are frequently used in a language. For example, if we happen to have a headache and want to tell other people about this, we may choose to say that we simply have a headache. However, headaches can differ in intensity so we might want to modify the word headache with an adjective. The question, then, is how do native speakers of English typically describe headaches? A search in a corpus can tell us what the most common adjective collocates of the word headache are. major headache 15 bad headache 11 severe headache 11 throbbing headache 8
The above word combinations (adjective + noun) are examples retrieved from a search in The British National Corpus. We learn that a headache can be major, bad, severe or throbbing. The most frequent collocation seems to be major headache. If we look more closely at the list of possible collocates we also find examples like: splitting headache 6 slight headache 4 terrible headache 3 When investigating what word combinations (collocations) are frequently used, it is common to inspect something called concordance lines, through which it is possible to see the exact contexts in which the collocation occurs in the corpus.

Clearly, using a corpus when looking for answers to questions about English words and grammar is a great method. However, like most methods it has its flaws and disadvantages. Lindquist (2009, p. 10) raises a number of caveats:

Since the number of possible sentences in a language is infinite, corpora will never be big enough to contain everything that is known by a speaker of a language.
Some of the findings may indeed be trivial.
The intuition of a native speaker will always be needed to identify what is grammatical and what is not.
Corpora [may] contain all kinds of mistakes, speech errors etc. which may have to be disregarded.

On the whole, then, it is important to remember that just because something you search for does not exist in a specific corpus, it does not mean that that particular word or phrase does not exist at all in the language at hand. Conversely, just because you find an example of a phrase in a corpus, it does not mean that it is used by native speakers of that language. This is especially so when the phrase occurs only one or a couple of times. It could be that this is a case of a mistake or error.

Use of cookies

What is a corpus?