The concept of collocation was introduced in the middle of the last century by J.R. Firth with his famous quote “You shall know a word by the company it keeps”. Words are not distributed randomly in a text, but instead they stick with each other, their ‘company’. Starting in the late 1980s, the increased interest in collocation by computational linguists and others working in NLP has lead to a proliferation of methods and algorithms to extract collocations from text corpora.
Typically one starts with the environment of the target (or node) word, and collects all the words that are within a certain distance (or span) of the node. Then their frequency in a reference corpus is compared with their frequency in the environment of the node, and from the ratio of frequencies we determine whether they’re near the node by chance or because they’re part of the node’s company. A bewildering variety of so-called significance functions exists, the oldest probably being the z-score, used by Berry-Rogghe in 1973; later, Church and Hanks (1991) popularised mutual information and t-score, which now seem to have been displaced by log-likelihood as the predominant measure of word association.
The problem is: all these metrics yield different results, and nobody knows (or can tell) which are ‘right’. Mutual information, for example, favours rare words, while the t-score promotes words which are relatively frequent already. But apart from rules-of-thumb, there exists no linguistic justification why one metric is preferable to another. It is all rather ad-hoc.
Part of this is that collocation as a concept is rather underspecified. What does it mean for a word to be ‘significantly more common’ near the node word as opposed to be there just by chance? In a sense, collocations are just diagnostics: we know there are words that are to be expected next to bacon, and we look for collocates and find rasher. Fantastic! Just what we expected. But then we look at fire, and find leafcutter as a very significant collocate. How can that happen? What is the connection between fire and leafcutter? The answer is: ants. There are fire ants, and there are leafcutter ants, and they are sometimes mentioned in the same sentence.
This leads us to an issue which I believe gets us on the right track in the end: the fallacy of using the word as the primary unit of analysis. In the latter example, we’re not dealing with fire and leafcutter, we’re instead concerned with fire ants. Once we realise that, then it is perfectly natural to see leafcutter ants as a collocate, whereas we would be surprised to find engine, which instead is a collocate of the lexical item fire.
So, phraseology is the clue. If we get away from single words, and instead consider multi-word units, then we also have an explanation for collocations. Single words form part of larger MWUs, together with other single words. So leafcutter often forms a unit with ants, as does fire. More generally, MWUs such as parameters of the model are formed of several single words, and here we can observe that parameters and model occur together. But they form a single unit of analysis, and only if we break up this unit by considering single words, then we can observe that parameters and model commonly occur together.
From this we can define a very simple procedure to compute collocations: from a corpus, gather all the MWUs that are associated with a particular word. Get a frequency list of all the single word items in those MWUs, sort by frequency, and there we are.
To conclude, collocation is an epiphenomenon of phraseology, a side-effect of words forming larger units. Phraseological units contain multiple single words, and those are picked up by collocation software, because those are the ones that commonly occur in a text together. And the reason for occurring together is that they form a single unit. Once we look at text in terms of MWUs, the need for collocation disappears. Collocation just picks out the constituent elements of multi-word units.
One could of course argue that this is a circular argument, that we are simply replacing a procedure to calculate collocations by one that calculates MWUs. But the difference between those two procedures is that MWU-recognition does not require complicated statistics (which I find hard to see justification for), but instead simply looks at recurrent patternings in language. MWUs are re-usable chunks of texts, which can be justified on the grounds of usage. Collocation is a much harder concept to explain and integrate into views of language. And, as it turns out, we don’t really need it at all.
- Berry-Rogghe, G.L.M. (1973)
- “The Computation of Collocations and Their Relevance in Lexical Studies.” in The Computer and Literary Studies. Eds. A.J. Aitken, R.W. Bailey and N. Hamilton-Smith. Edinburgh: Edinburgh University Press, p 103-112.
- Church, K., and Hanks, P. (1991)
- “Word Association Norms, Mutual Information and Lexicography,” Computational Linguistics, Vol 16:1, p 22-29.
- Firth, J. R. (1957)
- “A Synopsis of Linguistic Theory 1930-1955” in Studies in Linguistic Analysis, Oxford: Philological Society.