#CORPUSMOOC : Week 2 Notes from @drbexl

The second week of the MOOC ‘Corpus Linguistics‘ via Lancaster University:


I want you to think of two words – ‘diamond’ and ‘cause’.

Without consulting anybody else, or looking at any reference resources, write two short definitions for these words. Take no more than two minutes to complete this task.

A diamond is a compressed mineral whose rarity ensures that it has high value. It has gained meaning in recent centuries as a valuable gift, especially to signify love, and is commonly used in engagement rings. As an anniversary it signifies a long marriage.

The word cause may refer to ‘a cause’ that one supports, including charitable causes, or ‘to cause’, as in cause something to happen.

Recap and Introduction to Collocation

  • How can we manipulate and exploit that frequency data in order to gain insights?
  • Collocation is one way to do this – systematic co-occurrence of words in use, and may influence each other’s meanings, e.g. back/front, telephone/operator = the result of hunches.
  • Hunches can be right, but not always, as things may be more/less important than we think they are.
  • See, e.g. diamond, and we’ll be reminded of a range of meanings.

Screen Shot 2014-10-10 at 19.44.08

  • How close do these have to be to collocate? +/- 5 words seems to work, with a minimum of 10 occurrences, and be aware of sentence boundaries.
  • Know these words ‘by the words that they keep’, and can be before/after.
  • Frequency can’t be the only measurement – seek mutual information value, and identify if words rarely occur with other words.

Collocation, colligation and related features

  • What about grammatical words? Words do have strong affinities for certain prepositions or occasional articles. Colligation – affinity with a grammatical class (rather than meaning).
  • “For now, word form refers to any word that you may find in a corpus. So ‘fighting’ and ‘fought’ are both word forms. On the other hand, a lemma is what we might call the base form of a word – so the lemma ‘fight’ gives rise to multiple word forms, including ‘fighting’ and ‘fought’.”
  • Semantic preferences – e.g. diamond (often part of a class of gems), but ‘a glass of’ includes drinkable liquids.
  • Discourse Prosody – expresses speaker attitude = important for ‘discourse analysis’.
    • ‘Cause’ often associated with trouble, pain, suffering – subconsciously the word has negative discourse prosody.
    • The way that words in a corpus can collocate with a related set of words or phrases, often revealing (hidden) attitudes.”


  • Are there words that appear more frequently in Corpus A than they are in Corpus B? Can use statistical significance tests.
  • What words are ‘unusually frequent’ in this particular dataset? [I’m thinking here if we did research into words used by those of different religions on Twitter – what words would appear ‘unusually frequently’ in each religion?]
  • Analysts often cut off the top 50-100 keywords to create manageable data, and there must be 20+ keywords, and those distributed across the range of texts (and not bunched in one text/paragraph)
  • Typical keywords: Proper nouns (names), Style/genre markers (grammatical words), spelling idiosyncrasies (British/American English) – for discourse analysis = “the aboutness” of the text – the gist of the text.
    • Once identifying salient words – identifies interesting factors and explain ‘meaning’ and why those words are there.
    • Discover words (especially once run through computing power) that our conscious cognitive abilities would not identify as salient.
    • Can the experiment be replicated – follow the same process, and it should come out the same ‘objectively’.

Change over time and lock words

  • Which words have become steadily less/more frequent – or stayed the same (locked in place) – and what this tells us about cultural values.
  • The Brown Corpus – what were the key shifts happening in language 1931-2006 (4 sample points). E.g. Mrs down, health up and money largely ‘locked’.
  • What have declined?
    • A more informal society as less use of Mr, Miss, etc.
    • A modal verb – less comfortable with ‘imposing’ on people, so this is declining also.
    • Longer forms are contracting – as people seek to squeeze as much as possible into a short a space as possible [e.g. Twitter!]
  • What are lock-words?
    • Weaker modality
    • Wh – question words
    • Body parts
    • Other nouns, including money (we’re still obsessed)
  • Increased use
    • Contracted forms, such as it’s
    • Numbers as 34, rather than thirty-four
    • Social terms
  • Why has the word ‘children’ increased over time?
    • 1990s – fear of danger to children, promoting/supporting children and families… children are being problematized… [That fits with Raising Children in a Digital Age]
    • 2006 corpus – lots of moral panics…

Screen Shot 2014-10-10 at 20.27.09

  • Dominant discourse arising in Britain relating to children.


  • Corpora give us insights into the mechanics of language, and of the society within which that language is being used.
  • They can answer some questions really well, but others not so much – be mindful!
  • Corpora should be linked with other methods for study of language, society, history, etc… which expand the range of studies/findings?
  • Mesh qualitative/quantitative data…
  • Toolbox – use the right tool, in the right combination…

By admin

Dr Bex Lewis is passionate about helping people engage with the digital world in a positive way, where she has more than 20 years’ experience. She is Senior Lecturer in Digital Marketing at Manchester Metropolitan University and Visiting Research Fellow at St John’s College, Durham University, with a particular interest in digital culture, persuasion and attitudinal change, especially how this affects the third sector, including faith organisations, and, after her breast cancer diagnosis in 2017, has started to research social media and cancer. Trained as a mass communications historian, she has written the original history of the poster Keep Calm and Carry On: The Truth Behind the Poster (Imperial War Museum, 2017), drawing upon her PhD research. She is Director of social media consultancy Digital Fingerprint, and author of Raising Children in a Digital Age: Enjoying the Best, Avoiding the Worst  (Lion Hudson, 2014; second edition in process) as well as a number of book chapters, and regularly judges digital awards. She has a strong media presence, with her expertise featured in a wide range of publications and programmes, including national, international and specialist TV, radio and press, and can be found all over social media, typically as @drbexl.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.