#CORPUSMOOC Week 4 Notes

#CORPUSMOOC Week 4 Notes

There are some things you can do whilst on the sofa with a fuzzy head and streaming nose right? A bit slower than usual, but still… 

Warm-Up Activity … appears to involve using Antconc … wonder if reaching the limits of this course on a theory only basis… let’s see…

Look at the files on your hard drive: How many documents do you have on there which have been written by you? What time period does that collection of documents cover? What types of genres do those documents represent?

Collection of documents from at least 3 years, with some older files back to the 1990s, largely writing, speaking (including video examples), press captures, poster images, with a few HTML files downloaded from the net.

How many words do you think are in your personal corpus if you saved approx. 12 documents in one genre as an example and run through AntConc?

100,000+, with lots of use of words such as is, for, the, etc I suspect – expect highly ranking would be digital media, children, internet, social media, propaganda, poster, history, food, body image.

VIDEO 1: Building your own Corpus

As teachers, look at a corpus of student writing (or speech?) – wouldn’t be as large as pre-existing “learner corpora”, but you’d have more control/be more familiar with the content of it.

Kennedy (1998, p70-85)

Design: Without a solid design, nothing else works. What is it going to be used to do? What research questions are we defining? And what are we comparing it to? Speech or writing? Time periods? How big does the corpus need to be? Depends on restrictiveness of language you’re analysing (e.g. adverts = very short, so small corpus allowed analysis across a range of adverts).

British National Corpus needs to be large (100million words) to represent the range of language.

Brown Corpora = only about 1 million words each seems to work, but covers only written text and not all forms of writing.

A rare feature (e.g. hereof) requires a larger corpus than common words (e.g. because). Sometimes you have to settle for what you can get (time//££ may limit).

What about the individual size of your files within corpus? Ensuring that one is not over-represented? E.g. 5 essays per class, 15 essays from another – still take all essays, but tag/annotate to double-check balance. What about length of writing? E.g. Take samples of 300 words per essay (grammatical interest)? But this loses analysis of the overall structure of the text. What about samples from different parts of the text as words associate with beginnings/middles/ends (skewed)? Think about size/representativeness with a pilot study, think about how you’d store articles on your computer (see image: age/ID No/Essay), how might stratify data in order to ask good questions of it.

VIDEO 2: Building a Corpus: The Basics

Keep a list of sources of information, by whom, when (if not obvious from the text), when accessed, gender, topic, language, etc… but only if relevant to research question, or to those who might use your data at a later date.

For under 18s = need parental permission, otherwise subject permission. (Letters re purpose of research, anonymity). If going to share data with others, then need to sign a copyright release form. Ensure anonymisation/ethics.

Sources? Word-process by hand (interesting but time-consuming, but necessary for spoken), scan-in (time consuming to error check), ask friends, etc for texts, buy, or use an existing corpus that’s in electronic form (care with copyright for materials directly from the internet – but are number of text archives available).


Note differences between ‘spoken language’ and ‘written to be spoken’ language – not a problem unless you claim that scripts are representative of spoken language. 

BBC Webpage: http://news.bbc.co.uk/1/hi/8047410.stm given as example to collect data from, note the issue of underlying code – so save file in text-only format., although text as image may require typing in e.g. highlight boxes. Or strip e.g. menu text, or copy/paste text… or use e.g. http://www.httrack.com.


Consider using https://www.mturk.com/mturk/welcome to collect material.

VIDEO 3: Mark Up and Annotation

Add meta-data to help analysts … Header files = title, date, author, etc.

Annotation for stylistic interpretation… e.g. heading levels. But only if you’re interested in the features that help your question.

If sharing with others, you need to be clear about the system, so others can use it.

Grammatical annotation can be done fairly quickly with computers, but accuracy is not always great, especially if using rare-words/mis-spelling not recognised. May have to ‘error tag’ – has to be done manually.

VIDEO 4: American English

Corpora at Brigham Young University, from range of sources, includes historical data. OED dictionary of historical English…

COCA – ‘Must’ is most frequent in academic writing, and least frequent is spoken language – it’s a word in decline, especially after the 1990s.

BNC – only contains texts to 1993.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.