The second week of the MOOC ‘Corpus Linguistics‘ via Lancaster University:
I want you to think of two words – ‘diamond’ and ‘cause’.
Without consulting anybody else, or looking at any reference resources, write two short definitions for these words. Take no more than two minutes to complete this task.
A diamond is a compressed mineral whose rarity ensures that it has high value. It has gained meaning in recent centuries as a valuable gift, especially to signify love, and is commonly used in engagement rings. As an anniversary it signifies a long marriage.
The word cause may refer to ‘a cause’ that one supports, including charitable causes, or ‘to cause’, as in cause something to happen.
Recap and Introduction to Collocation
- How can we manipulate and exploit that frequency data in order to gain insights?
- Collocation is one way to do this – systematic co-occurrence of words in use, and may influence each other’s meanings, e.g. back/front, telephone/operator = the result of hunches.
- Hunches can be right, but not always, as things may be more/less important than we think they are.
- See, e.g. diamond, and we’ll be reminded of a range of meanings.
- How close do these have to be to collocate? +/- 5 words seems to work, with a minimum of 10 occurrences, and be aware of sentence boundaries.
- Know these words ‘by the words that they keep’, and can be before/after.
- Frequency can’t be the only measurement – seek mutual information value, and identify if words rarely occur with other words.
Collocation, colligation and related features
- What about grammatical words? Words do have strong affinities for certain prepositions or occasional articles. Colligation – affinity with a grammatical class (rather than meaning).
- “For now, word form refers to any word that you may find in a corpus. So ‘fighting’ and ‘fought’ are both word forms. On the other hand, a lemma is what we might call the base form of a word – so the lemma ‘fight’ gives rise to multiple word forms, including ‘fighting’ and ‘fought’.”
- Semantic preferences – e.g. diamond (often part of a class of gems), but ‘a glass of’ includes drinkable liquids.
- Discourse Prosody – expresses speaker attitude = important for ‘discourse analysis’.
- ‘Cause’ often associated with trouble, pain, suffering – subconsciously the word has negative discourse prosody.
- “The way that words in a corpus can collocate with a related set of words or phrases, often revealing (hidden) attitudes.”
- Are there words that appear more frequently in Corpus A than they are in Corpus B? Can use statistical significance tests.
- What words are ‘unusually frequent’ in this particular dataset? [I’m thinking here if we did research into words used by those of different religions on Twitter – what words would appear ‘unusually frequently’ in each religion?]
- Analysts often cut off the top 50-100 keywords to create manageable data, and there must be 20+ keywords, and those distributed across the range of texts (and not bunched in one text/paragraph)
- Typical keywords: Proper nouns (names), Style/genre markers (grammatical words), spelling idiosyncrasies (British/American English) – for discourse analysis = “the aboutness” of the text – the gist of the text.
- Once identifying salient words – identifies interesting factors and explain ‘meaning’ and why those words are there.
- Discover words (especially once run through computing power) that our conscious cognitive abilities would not identify as salient.
- Can the experiment be replicated – follow the same process, and it should come out the same ‘objectively’.
Change over time and lock words
- Which words have become steadily less/more frequent – or stayed the same (locked in place) – and what this tells us about cultural values.
- The Brown Corpus – what were the key shifts happening in language 1931-2006 (4 sample points). E.g. Mrs down, health up and money largely ‘locked’.
- What have declined?
- A more informal society as less use of Mr, Miss, etc.
- A modal verb – less comfortable with ‘imposing’ on people, so this is declining also.
- Longer forms are contracting – as people seek to squeeze as much as possible into a short a space as possible [e.g. Twitter!]
- What are lock-words?
- Weaker modality
- Wh – question words
- Body parts
- Other nouns, including money (we’re still obsessed)
- Increased use
- Contracted forms, such as it’s
- Numbers as 34, rather than thirty-four
- Social terms
- Why has the word ‘children’ increased over time?
- 1990s – fear of danger to children, promoting/supporting children and families… children are being problematized… [That fits with Raising Children in a Digital Age]
- 2006 corpus – lots of moral panics…
- Dominant discourse arising in Britain relating to children.
- Corpora give us insights into the mechanics of language, and of the society within which that language is being used.
- They can answer some questions really well, but others not so much – be mindful!
- Corpora should be linked with other methods for study of language, society, history, etc… which expand the range of studies/findings?
- Mesh qualitative/quantitative data…
- Toolbox – use the right tool, in the right combination…