#CORPUSMOOC : Week 2 Notes from @drbexl

The second week of the MOOC ‘Corpus Linguistics‘ via Lancaster University:


I want you to think of two words – ‘diamond’ and ‘cause’.

Without consulting anybody else, or looking at any reference resources, write two short definitions for these words. Take no more than two minutes to complete this task.

A diamond is a compressed mineral whose rarity ensures that it has high value. It has gained meaning in recent centuries as a valuable gift, especially to signify love, and is commonly used in engagement rings. As an anniversary it signifies a long marriage.

The word cause may refer to ‘a cause’ that one supports, including charitable causes, or ‘to cause’, as in cause something to happen.

Recap and Introduction to Collocation

  • How can we manipulate and exploit that frequency data in order to gain insights?
  • Collocation is one way to do this – systematic co-occurrence of words in use, and may influence each other’s meanings, e.g. back/front, telephone/operator = the result of hunches.
  • Hunches can be right, but not always, as things may be more/less important than we think they are.
  • See, e.g. diamond, and we’ll be reminded of a range of meanings.

Screen Shot 2014-10-10 at 19.44.08

  • How close do these have to be to collocate? +/- 5 words seems to work, with a minimum of 10 occurrences, and be aware of sentence boundaries.
  • Know these words ‘by the words that they keep’, and can be before/after.
  • Frequency can’t be the only measurement – seek mutual information value, and identify if words rarely occur with other words.

Collocation, colligation and related features

  • What about grammatical words? Words do have strong affinities for certain prepositions or occasional articles. Colligation – affinity with a grammatical class (rather than meaning).
  • “For now, word form refers to any word that you may find in a corpus. So ‘fighting’ and ‘fought’ are both word forms. On the other hand, a lemma is what we might call the base form of a word – so the lemma ‘fight’ gives rise to multiple word forms, including ‘fighting’ and ‘fought’.”
  • Semantic preferences – e.g. diamond (often part of a class of gems), but ‘a glass of’ includes drinkable liquids.
  • Discourse Prosody – expresses speaker attitude = important for ‘discourse analysis’.
    • ‘Cause’ often associated with trouble, pain, suffering – subconsciously the word has negative discourse prosody.
    • The way that words in a corpus can collocate with a related set of words or phrases, often revealing (hidden) attitudes.”


  • Are there words that appear more frequently in Corpus A than they are in Corpus B? Can use statistical significance tests.
  • What words are ‘unusually frequent’ in this particular dataset? [I’m thinking here if we did research into words used by those of different religions on Twitter – what words would appear ‘unusually frequently’ in each religion?]
  • Analysts often cut off the top 50-100 keywords to create manageable data, and there must be 20+ keywords, and those distributed across the range of texts (and not bunched in one text/paragraph)
  • Typical keywords: Proper nouns (names), Style/genre markers (grammatical words), spelling idiosyncrasies (British/American English) – for discourse analysis = “the aboutness” of the text – the gist of the text.
    • Once identifying salient words – identifies interesting factors and explain ‘meaning’ and why those words are there.
    • Discover words (especially once run through computing power) that our conscious cognitive abilities would not identify as salient.
    • Can the experiment be replicated – follow the same process, and it should come out the same ‘objectively’.

Change over time and lock words

  • Which words have become steadily less/more frequent – or stayed the same (locked in place) – and what this tells us about cultural values.
  • The Brown Corpus – what were the key shifts happening in language 1931-2006 (4 sample points). E.g. Mrs down, health up and money largely ‘locked’.
  • What have declined?
    • A more informal society as less use of Mr, Miss, etc.
    • A modal verb – less comfortable with ‘imposing’ on people, so this is declining also.
    • Longer forms are contracting – as people seek to squeeze as much as possible into a short a space as possible [e.g. Twitter!]
  • What are lock-words?
    • Weaker modality
    • Wh – question words
    • Body parts
    • Other nouns, including money (we’re still obsessed)
  • Increased use
    • Contracted forms, such as it’s
    • Numbers as 34, rather than thirty-four
    • Social terms
  • Why has the word ‘children’ increased over time?
    • 1990s – fear of danger to children, promoting/supporting children and families… children are being problematized… [That fits with Raising Children in a Digital Age]
    • 2006 corpus – lots of moral panics…

Screen Shot 2014-10-10 at 20.27.09

  • Dominant discourse arising in Britain relating to children.


  • Corpora give us insights into the mechanics of language, and of the society within which that language is being used.
  • They can answer some questions really well, but others not so much – be mindful!
  • Corpora should be linked with other methods for study of language, society, history, etc… which expand the range of studies/findings?
  • Mesh qualitative/quantitative data…
  • Toolbox – use the right tool, in the right combination…