Here’s my notes from week 1 of Lancaster University’s MOOC ‘Corpus Linguistics’ (Haven’t got time to do the practical exercise, but this is twigging some thinking re my PhD thesis database!):
What is a corpus?
- A collection of words?
- It’s a methodology but not a theory of language.
Why might I use corpus linguistics?
- Look at language ‘as it is’
- Large amounts of data which are difficult to ID with intuition/anecdotes
- Large amounts of data show us things we’re doing that we don’t even realise.
- ID rare/exceptional cases not identifiable in others ways
- Human beings are slower/less accurate than computers for purposes of this kind of research.
What is your research question/hypothesis?
- Is the corpus ‘off the shelf’ useful to your question?
- If you’re developing a corpus – how will you need to define it?
- 30,000 à billions of words.
- Needs to be representative of the corpus – e.g. http://www.natcorp.ox.ac.uk
- Must be machine-readable (not just a photo of the text) – so that the computer can identify the words
- It may act as a standard reference for what is typical in language.
- May be annotated with extra linguistic codes (e.g. grammar)
What is annotation and markup?
Computers do not have the cultural knowledge that we have, so we have to mark-up the text so it can read the nuances, etc.
- Delimit particular sections as e.g. a ‘heading’, a ‘sentence’, etc. allowing computer to analyse just those areas, etc.
- Understand how this is done, as the computer can automatically do this, etc. then allows sophisticated searches through the data.
Types of Corpora
Come in different flavours, so different things can be assessed – e.g. date, time, genre, etc. Specific = outline the areas, but are also general ‘corpora’ – especially language (note difference between spoken/written).
- Think about the shape of spoken language – especially the differences between e.g. the different people you talk to.
- Parallel, new language, historic material, on-going corpus…
Frequency data, concordances and collocation
A search, how often does it appear, but also how frequently per million words, and what kind of documents/context does it appear within.
- Think you see a pattern emerging, can ‘sort’ so can start to see patterns emerging [on the basis of which attain themes to identify].
- Needs a cycle of extraction of data, and analysis, and close reading of relevant parts of the text.
- Collocation – co-occurance – from which meaning (and possibly) grammar appears – words are not randomly put together – words ‘shade one another’s meanings’ and ‘co-construct meaning’ – seek patterns in language.
Corpora and Language Teaching
This is less relevant to me, but interesting that need to identify the right words that help people understand which words are used frequently, so which to come first within a textbook – could be helpful within digital literacy training.
What can’t we do with corpus?
- Just because it doesn’t exist in the corpus doesn’t mean it can’t be used – may be rare.
- As with any scientific method, we are making deductions, not facts.
- No visual information (pictures or body language) – traditionally people have set aside the visual and focused on the written language, but tools are being developed. * See database methodology – visual material collated in the 1990s for PhD research.
Dr Bex Lewis is passionate about helping people engage with the digital world in a positive way, where she has more than 20 years’ experience. She is Senior Lecturer in Digital Marketing at Manchester Metropolitan University and Visiting Research Fellow at St John’s College, Durham University, with a particular interest in digital culture, persuasion and attitudinal change, especially how this affects the third sector, including faith organisations, and, after her breast cancer diagnosis in 2017, has started to research social media and cancer. Trained as a mass communications historian, she has written the original history of the poster Keep Calm and Carry On: The Truth Behind the Poster (Imperial War Museum, 2017), drawing upon her PhD research. She is Director of social media consultancy Digital Fingerprint, and author of Raising Children in a Digital Age: Enjoying the Best, Avoiding the Worst (Lion Hudson, 2014) as well as a number of book chapters, and regularly judges digital awards. She has a strong media presence, with her expertise featured in a wide range of publications and programmes, including national, international and specialist TV, radio and press, and can be found all over social media, typically as @drbexl.