#CORPUSMOOC Week 4 Notes

There are some things you can do whilst on the sofa with a fuzzy head and streaming nose right? A bit slower than usual, but still… 

Warm-Up Activity … appears to involve using Antconc … wonder if reaching the limits of this course on a theory only basis… let’s see…

Look at the files on your hard drive: How many documents do you have on there which have been written by you? What time period does that collection of documents cover? What types of genres do those documents represent?

Collection of documents from at least 3 years, with some older files back to the 1990s, largely writing, speaking (including video examples), press captures, poster images, with a few HTML files downloaded from the net.

How many words do you think are in your personal corpus if you saved approx. 12 documents in one genre as an example and run through AntConc?

100,000+, with lots of use of words such as is, for, the, etc I suspect – expect highly ranking would be digital media, children, internet, social media, propaganda, poster, history, food, body image.

VIDEO 1: Building your own Corpus

As teachers, look at a corpus of student writing (or speech?) – wouldn’t be as large as pre-existing “learner corpora”, but you’d have more control/be more familiar with the content of it.

Kennedy (1998, p70-85)

Design: Without a solid design, nothing else works. What is it going to be used to do? What research questions are we defining? And what are we comparing it to? Speech or writing? Time periods? How big does the corpus need to be? Depends on restrictiveness of language you’re analysing (e.g. adverts = very short, so small corpus allowed analysis across a range of adverts).

British National Corpus needs to be large (100million words) to represent the range of language.

Brown Corpora = only about 1 million words each seems to work, but covers only written text and not all forms of writing.

A rare feature (e.g. hereof) requires a larger corpus than common words (e.g. because). Sometimes you have to settle for what you can get (time//££ may limit).

What about the individual size of your files within corpus? Ensuring that one is not over-represented? E.g. 5 essays per class, 15 essays from another – still take all essays, but tag/annotate to double-check balance. What about length of writing? E.g. Take samples of 300 words per essay (grammatical interest)? But this loses analysis of the overall structure of the text. What about samples from different parts of the text as words associate with beginnings/middles/ends (skewed)? Think about size/representativeness with a pilot study, think about how you’d store articles on your computer (see image: age/ID No/Essay), how might stratify data in order to ask good questions of it.

VIDEO 2: Building a Corpus: The Basics

Keep a list of sources of information, by whom, when (if not obvious from the text), when accessed, gender, topic, language, etc… but only if relevant to research question, or to those who might use your data at a later date.

For under 18s = need parental permission, otherwise subject permission. (Letters re purpose of research, anonymity). If going to share data with others, then need to sign a copyright release form. Ensure anonymisation/ethics.

Sources? Word-process by hand (interesting but time-consuming, but necessary for spoken), scan-in (time consuming to error check), ask friends, etc for texts, buy, or use an existing corpus that’s in electronic form (care with copyright for materials directly from the internet – but are number of text archives available).


Note differences between ‘spoken language’ and ‘written to be spoken’ language – not a problem unless you claim that scripts are representative of spoken language. 

BBC Webpage: given as example to collect data from, note the issue of underlying code – so save file in text-only format., although text as image may require typing in e.g. highlight boxes. Or strip e.g. menu text, or copy/paste text… or use e.g.


Consider using to collect material.

VIDEO 3: Mark Up and Annotation

Add meta-data to help analysts … Header files = title, date, author, etc.

Annotation for stylistic interpretation… e.g. heading levels. But only if you’re interested in the features that help your question.

If sharing with others, you need to be clear about the system, so others can use it.

Grammatical annotation can be done fairly quickly with computers, but accuracy is not always great, especially if using rare-words/mis-spelling not recognised. May have to ‘error tag’ – has to be done manually.

VIDEO 4: American English

Corpora at Brigham Young University, from range of sources, includes historical data. OED dictionary of historical English…

COCA – ‘Must’ is most frequent in academic writing, and least frequent is spoken language – it’s a word in decline, especially after the 1990s.

BNC – only contains texts to 1993.


#CORPUSMOOC – Corpus Linguistics Week 3

Find an article in which the word ‘refugee’ is mentioned – make notes about how refugees, migrants, asylum seekers, etc are talked about. Chose:

  • Referred to in terms of numbers (large numbers)
  • Range of words indicating a ‘problem’ to be solved, stemmed, halted, stop them infiltrating, as a danger, etc.
  • Refugees = a destabilising influence
  • Humanitarian refugees (criteria unknown) only allowed.
  • ¼ people in Lebanon = refugees, highest number in the world = straining infrastructure and driving down wages.
  • Need ££ to deal with “influx”.

Oh, maybe it was supposed to be a British newspaper – ah well, pretty familiar!

Video 1: Refugees and Asylum Seekers in the UK Press

Methodologically – need large amounts of data, frequency data, hunt for co-occurrences, annotation/grouping, quantification and statistical significance.

Merits – helps us get ‘the big picture’, identify the ‘aboutness’/areas of interest that can be interrogated – can work qualitatively/quantitatively and check on ‘gut instinct’

Core terms – keywords, cluster, collocation, semantic prosody, discourse prosody.

Video 2: Building the Corpus and Initial Analysis

In UK universities is access to many newspapers, but need to define the keywords [x OR x OR x AND NOT x]

How derive a query? Collected a quick corpus of texts from a pilot study, then compared to ‘general English’ to define the ‘aboutness’, then used keywords/intuitions/concordancing to include/exclude from collection. Data was split into ‘tabloids’ and ‘broadsheets’ (interesting distinction). More data in the broadsheets, but articles in broadsheets = longer (so they are not ‘more obsessed’ about them.

Finding ‘topoi’ = finding key ‘theme’ in the data. How do ‘collocates’ (associated words) help construct that theme?

Statistical significance important. Red = tabloids; blue = broadsheets.



  • Generally about entry (mode, place, legality) – discourse largely established by the TABLOIDS
  • Number, Abuse, Numbers, Finance (cost/abuse), threat – also tabloids (except large numbers)
  • Residence, legality, issues with system, unwelcome (authentic and legitimacy only mentioned by broadsheets).
  • PLIGHT – much larger in the broadsheets (so more sympathetic?)

VIDEO 3: Tabloids, Broadsheets and Key Clusters

High probability for collocates. Red = tabloids, blue = tabloids; black = equal.


Related to numbers/quantity – different ways of doing it, but both speak in quantity metaphors, and also in the idea of ‘plight’ (based on number of collocates).quantity

To look with the word ‘illegal’ – manually checked it, then right-sorted to see what followed the word illegal. Identifying origin, ethnicity, religion, age, type of work, etc.


Number of clusters – some are more ‘emblematic’ of tabloids…


Equivalence is being ‘forced’ – terrorism, crime, fraud, etc. all being brought together in the discourse, rather than representing ‘reality’.

How many occurrences per million ‘normalised’ amongst words? Expect to see more in the tabloids than the broadsheets.


VIDEO 4: IN FOCUS. The expression ‘pose as’.

Who uses the term ‘Pose As’ in relation to RASIM? Tabloids use it 8 x times more than broadsheets…

Beggars, crooks, etc. are identified as ‘posing as RASIM’ = taken ‘as fact’, and therefore positive stance towards ‘tougher measures’ – this is particularly in the tabloids. It’s there in the broadsheets too, but the opposite view is presented (if with less words).

Identifying problems in the asylum system by police/reporters ‘posing as’ RASIM.

The tabloids focus particularly upon asylum seekers ‘posing’ as nurses, etc…

Criminals may pose as RASIM to harm RASIM – also in tabloids, but very low numbers…

VIDEO 5: Summing Up

Focus upon words ‘suffocated’ and ‘drowned’ – focus upon whether they were represented as ‘illegal’ – directly (illegal immigrants) or indirectly (sneaking)?


Dictionary may have a range of different meanings, but the press gives a range of terms that ‘mean similar’ … used in a particular way continuously.

Remember that there are distinctions within newspapers, rather than labelling ‘tabloids’. Question how helpful your distinctions are.

Move between largescale analysis, and closer/more-detailed readings of the text.


#CORPUSMOOC : Week 2 Notes from @drbexl

The second week of the MOOC ‘Corpus Linguistics‘ via Lancaster University:


I want you to think of two words – ‘diamond’ and ‘cause’.

Without consulting anybody else, or looking at any reference resources, write two short definitions for these words. Take no more than two minutes to complete this task.

A diamond is a compressed mineral whose rarity ensures that it has high value. It has gained meaning in recent centuries as a valuable gift, especially to signify love, and is commonly used in engagement rings. As an anniversary it signifies a long marriage.

The word cause may refer to ‘a cause’ that one supports, including charitable causes, or ‘to cause’, as in cause something to happen.

Recap and Introduction to Collocation

  • How can we manipulate and exploit that frequency data in order to gain insights?
  • Collocation is one way to do this – systematic co-occurrence of words in use, and may influence each other’s meanings, e.g. back/front, telephone/operator = the result of hunches.
  • Hunches can be right, but not always, as things may be more/less important than we think they are.
  • See, e.g. diamond, and we’ll be reminded of a range of meanings.

Screen Shot 2014-10-10 at 19.44.08

  • How close do these have to be to collocate? +/- 5 words seems to work, with a minimum of 10 occurrences, and be aware of sentence boundaries.
  • Know these words ‘by the words that they keep’, and can be before/after.
  • Frequency can’t be the only measurement – seek mutual information value, and identify if words rarely occur with other words.

Collocation, colligation and related features

  • What about grammatical words? Words do have strong affinities for certain prepositions or occasional articles. Colligation – affinity with a grammatical class (rather than meaning).
  • “For now, word form refers to any word that you may find in a corpus. So ‘fighting’ and ‘fought’ are both word forms. On the other hand, a lemma is what we might call the base form of a word – so the lemma ‘fight’ gives rise to multiple word forms, including ‘fighting’ and ‘fought’.”
  • Semantic preferences – e.g. diamond (often part of a class of gems), but ‘a glass of’ includes drinkable liquids.
  • Discourse Prosody – expresses speaker attitude = important for ‘discourse analysis’.
    • ‘Cause’ often associated with trouble, pain, suffering – subconsciously the word has negative discourse prosody.
    • The way that words in a corpus can collocate with a related set of words or phrases, often revealing (hidden) attitudes.”


  • Are there words that appear more frequently in Corpus A than they are in Corpus B? Can use statistical significance tests.
  • What words are ‘unusually frequent’ in this particular dataset? [I’m thinking here if we did research into words used by those of different religions on Twitter – what words would appear ‘unusually frequently’ in each religion?]
  • Analysts often cut off the top 50-100 keywords to create manageable data, and there must be 20+ keywords, and those distributed across the range of texts (and not bunched in one text/paragraph)
  • Typical keywords: Proper nouns (names), Style/genre markers (grammatical words), spelling idiosyncrasies (British/American English) – for discourse analysis = “the aboutness” of the text – the gist of the text.
    • Once identifying salient words – identifies interesting factors and explain ‘meaning’ and why those words are there.
    • Discover words (especially once run through computing power) that our conscious cognitive abilities would not identify as salient.
    • Can the experiment be replicated – follow the same process, and it should come out the same ‘objectively’.

Change over time and lock words

  • Which words have become steadily less/more frequent – or stayed the same (locked in place) – and what this tells us about cultural values.
  • The Brown Corpus – what were the key shifts happening in language 1931-2006 (4 sample points). E.g. Mrs down, health up and money largely ‘locked’.
  • What have declined?
    • A more informal society as less use of Mr, Miss, etc.
    • A modal verb – less comfortable with ‘imposing’ on people, so this is declining also.
    • Longer forms are contracting – as people seek to squeeze as much as possible into a short a space as possible [e.g. Twitter!]
  • What are lock-words?
    • Weaker modality
    • Wh – question words
    • Body parts
    • Other nouns, including money (we’re still obsessed)
  • Increased use
    • Contracted forms, such as it’s
    • Numbers as 34, rather than thirty-four
    • Social terms
  • Why has the word ‘children’ increased over time?
    • 1990s – fear of danger to children, promoting/supporting children and families… children are being problematized… [That fits with Raising Children in a Digital Age]
    • 2006 corpus – lots of moral panics…

Screen Shot 2014-10-10 at 20.27.09

  • Dominant discourse arising in Britain relating to children.


  • Corpora give us insights into the mechanics of language, and of the society within which that language is being used.
  • They can answer some questions really well, but others not so much – be mindful!
  • Corpora should be linked with other methods for study of language, society, history, etc… which expand the range of studies/findings?
  • Mesh qualitative/quantitative data…
  • Toolbox – use the right tool, in the right combination…

#CORPUSMOOC : Week 1 Notes from @drbexl

Here’s my notes from week 1 of Lancaster University’s MOOC ‘Corpus Linguistics’ (Haven’t got time to do the practical exercise, but this is twigging some thinking re my PhD thesis database!):

Screen Shot 2014-10-01 at 18.03.52 (3)

What is a corpus?

  • A collection of words?
  • It’s a methodology but not a theory of language.

Why might I use corpus linguistics?

  • Look at language ‘as it is’
  • Large amounts of data which are difficult to ID with intuition/anecdotes
  • Large amounts of data show us things we’re doing that we don’t even realise.
  • ID rare/exceptional cases not identifiable in others ways
  • Human beings are slower/less accurate than computers for purposes of this kind of research.

What is your research question/hypothesis?

  • Is the corpus ‘off the shelf’ useful to your question?
  • If you’re developing a corpus – how will you need to define it?
  • 30,000 à billions of words.
  • Needs to be representative of the corpus – e.g.
  • Must be machine-readable (not just a photo of the text) – so that the computer can identify the words
  • It may act as a standard reference for what is typical in language.
  • May be annotated with extra linguistic codes (e.g. grammar)

What is annotation and markup?

Computers do not have the cultural knowledge that we have, so we have to mark-up the text so it can read the nuances, etc.

  • Delimit particular sections as e.g. a ‘heading’, a ‘sentence’, etc. allowing computer to analyse just those areas, etc.
  • Understand how this is done, as the computer can automatically do this, etc. then allows sophisticated searches through the data.

Types of Corpora

Come in different flavours, so different things can be assessed – e.g. date, time, genre, etc. Specific = outline the areas, but are also general ‘corpora’ – especially language (note difference between spoken/written).

  • Think about the shape of spoken language – especially the differences between e.g. the different people you talk to.
  • Parallel, new language, historic material, on-going corpus…

Frequency data, concordances and collocation

A search, how often does it appear, but also how frequently per million words, and what kind of documents/context does it appear within.

  • Think you see a pattern emerging, can ‘sort’ so can start to see patterns emerging [on the basis of which attain themes to identify].
  • Needs a cycle of extraction of data, and analysis, and close reading of relevant parts of the text.
  • Collocation – co-occurance – from which meaning (and possibly) grammar appears – words are not randomly put together – words ‘shade one another’s meanings’ and ‘co-construct meaning’ – seek patterns in language.

Corpora and Language Teaching

This is less relevant to me, but interesting that need to identify the right words that help people understand which words are used frequently, so which to come first within a textbook – could be helpful within digital literacy training.

What can’t we do with corpus?

  • Just because it doesn’t exist in the corpus doesn’t mean it can’t be used – may be rare.
  • As with any scientific method, we are making deductions, not facts.
  • No visual information (pictures or body language) – traditionally people have set aside the visual and focused on the written language, but tools are being developed. * See database methodology – visual material collated in the 1990s for PhD research.