Here’s my notes from week 1 of Lancaster University’s MOOC ‘Corpus Linguistics’ (Haven’t got time to do the practical exercise, but this is twigging some thinking re my PhD thesis database!):
What is a corpus?
- A collection of words?
- It’s a methodology but not a theory of language.
Why might I use corpus linguistics?
- Look at language ‘as it is’
- Large amounts of data which are difficult to ID with intuition/anecdotes
- Large amounts of data show us things we’re doing that we don’t even realise.
- ID rare/exceptional cases not identifiable in others ways
- Human beings are slower/less accurate than computers for purposes of this kind of research.
What is your research question/hypothesis?
- Is the corpus ‘off the shelf’ useful to your question?
- If you’re developing a corpus – how will you need to define it?
- 30,000 à billions of words.
- Needs to be representative of the corpus – e.g. http://www.natcorp.ox.ac.uk
- Must be machine-readable (not just a photo of the text) – so that the computer can identify the words
- It may act as a standard reference for what is typical in language.
- May be annotated with extra linguistic codes (e.g. grammar)
What is annotation and markup?
Computers do not have the cultural knowledge that we have, so we have to mark-up the text so it can read the nuances, etc.
- Delimit particular sections as e.g. a ‘heading’, a ‘sentence’, etc. allowing computer to analyse just those areas, etc.
- Understand how this is done, as the computer can automatically do this, etc. then allows sophisticated searches through the data.
Types of Corpora
Come in different flavours, so different things can be assessed – e.g. date, time, genre, etc. Specific = outline the areas, but are also general ‘corpora’ – especially language (note difference between spoken/written).
- Think about the shape of spoken language – especially the differences between e.g. the different people you talk to.
- Parallel, new language, historic material, on-going corpus…
Frequency data, concordances and collocation
A search, how often does it appear, but also how frequently per million words, and what kind of documents/context does it appear within.
- Think you see a pattern emerging, can ‘sort’ so can start to see patterns emerging [on the basis of which attain themes to identify].
- Needs a cycle of extraction of data, and analysis, and close reading of relevant parts of the text.
- Collocation – co-occurance – from which meaning (and possibly) grammar appears – words are not randomly put together – words ‘shade one another’s meanings’ and ‘co-construct meaning’ – seek patterns in language.
Corpora and Language Teaching
This is less relevant to me, but interesting that need to identify the right words that help people understand which words are used frequently, so which to come first within a textbook – could be helpful within digital literacy training.
What can’t we do with corpus?
- Just because it doesn’t exist in the corpus doesn’t mean it can’t be used – may be rare.
- As with any scientific method, we are making deductions, not facts.
- No visual information (pictures or body language) – traditionally people have set aside the visual and focused on the written language, but tools are being developed. * See database methodology – visual material collated in the 1990s for PhD research.