Completing #CORPUSMOOC

During interactions with JISC and ALT in particular, MOOC’s have been hot news for quite some time. MOOC is an acronym for ‘Massive Open Online Courses’ – freely available to all. They don’t have the best reputation for completion rates, which has opened up a number of discussions at JISC/ALT events as to whether completion, and particularly full completion, of a MOOC is the point of these things. In 2012, JISC ran a session ‘What is a MOOC?’ – one of the early slides here:


Picking a Course

Last year, I decided to get my head around these, wondering whether CODEC could usefully develop a MOOC (as the financial imperative is not clear, except as a marketing exercise, for many of these courses). I cheerily signed up to about 3 courses, and… didn’t get started on a single one of them, as other work priorities took over. As we were developing potential funding bids earlier this year related to ‘religious identity online’, Pete suggested that I undertake the ‘Corpus Linguistics’ module that he’d had a go at last year. I had no idea what that was – so as all good academic researchers do, popped across to Wikipedia first for a definition:

Corpus linguistics is the study of language as expressed in samples (corpora) of “real world” text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely derived by an automated process.

so I could see how this would be useful for analysing words collected from Twitter, Facebook, etc. to analyse large social and cultural questions.

Corpus Linguistics Online

The course, hosted on Futurelearn, and presented by Tony McEnery, Lancaster University, was designed as a practical course for humanities and social science researchers, with the following stated aims:

  • Demonstrate that corpus approaches to social science can offer valuable insight into social reality by investigating the use and manipulation of language in society.
  • Equip social scientists with skills necessary for collecting and analysing large digital collections of text (corpora).
  • Provide educational support for those who want to use the corpus method.
  • Demonstrate the use of corpus linguistics in the humanities, especially History.
  • Give a sense of the incredibly wide uses that corpora have been put to.
  • Allow those with an interest in language, who have not heard of the corpus approach before, a new way of looking at language.

Combined with all the recent changes in CODEC, the first week’s material seemed a little overwhelming, so early on I made a conscious decision to focus on the theoretical explanations each week (likely to take around 60-90 minutes per week, rather than 3-4 hours), so that I could grasp an understanding of the method, and the kinds of questions it can allow us to ask, rather than the practical aspects of the software (also provided), although having just finished the course, I am now checking out some of the optional videos, especially one from Claire Hardaker re trolling, as recently I’ve been asked to contribute to debates about trolling, bullying (and the place of restorative justice in these debates):


The Process

First things first was making time in the diary for this. Originally expecting to take around 3 hours a week, this did drop, but I wanted the process to not just be about the course content, but about thinking how a MOOC works, and what it contributes to our learning, although much of this may have been absorbed at gut level, rather than laying it all out here, so this is more of a ‘quick and dirty’ response!

  • Sign up for the course on Futurelearn (there are other providers, including independently hosted MOOCS, e.g. this ‘open theology‘ module I’m undertaking with WTC)
  • Make time, and put reminders in my todoist! Get clear on what I actually wanted to get out of this, so that would focus time/energy on those areas.

It was really easy to sign up for Futurelearn, and everything comes in via email, so it was simple to search and find the course I wanted (and you’ll see from the screenshot how easy it is to leave it too):


Week by Week

Each week the available material would appear. Clearly, it is technologically possible for all the material to appear at the same time, but there’s a need to encourage people to work on the material together, with a start date, etc. encouraging use of the well-structured (and well-used) ‘commenting’ space, which Tony himself contributes to frequently (and is clearly gaining insights into his own research), and with a number of mentors who have been assigned and are highly active (I’ve had replies to several, but without having asked for permission, thought I should just share my own comment!):



So, the material appeared each week, looking like this (I’m assuming if I’d completed all the practical activities, those lines underneath would have got longer!), with the most basic, introductory material (usually in the form of videos from Tony) at the top – which was the stuff I was really interested in.





Being able to see how much more to go is always a good incentive – below the fold there was much extra material – more videos, readings, practical software help, etc. but I usually finished at the point of the quiz (which isn’t assessed, but helps you “know” that you have “learnt” some material that week (and where one might want to go back and re-assess):





Users are encouraged to keep a journal throughout the project, which I did through notes kept in ‘Word’ and then transferred across to this blog, and shared using the hasthag #CorpusMOOC.

What have I gained?

Well, I may have more to say about this in the longer-term, but for now

  1. I’ve started a MOOC
  2. I’ve finished a MOOC
  3. I’ve done the bits of the MOOC I wanted (if you know me, you know I’m a bit of an completer/over-achiever, and initially thought I can’t just do the bits I want!) and no more.
  4. I think I’ve got a good sense of what Corpus Linguistics is capable of, and could see that I could use it in my research, although I would have to spend more time learning the practicalities/partner with A.N.Other.

I thought the material was well-presented, manageable (once I stuck to the first bits), the intentional interaction was good as well as the usability of the software, and I can see how more can be done in this subject area.


Thank you Tony and team!



#CORPUSMOOC: Week 8: A Swearing Extravaganza

This week looking at ‘swearing’ as it is used within language .. so there’s a disclaimer, some of the comments:

The use of ‘bad language’ seems to me to be very cultural specific. For example, young people seems to use it more often than old people. And I see variation of what’s considered as ‘bad language’ between registers and dialects. For example, the same person would never use bad language at work but he probably uses it when he is with friends; and what’s considered bad in some areas would not have this consideration in others.

Of course, you have to define what is meant by ‘bad language’; obscenity is very culturally specific (Northern Europeans: body parts, coition and excrement, Southern Europeans religion, mothers, aspersions re sexuality – the Victorians found the phrase ‘what a cunning hat’ rather racy). The point is well put, though.

Oh dear, the warm up activity is to listen out for the use of bad language in conversation around us … probably more than you’d expect even in my own context! Interesting conversations online about whether language teachers should teach this, as students will come across it (don’t we all remember how funny it was once we learnt ‘merde’ in French classes!)

amazing what you can get used to after a while and how much these words lose their strength through over use.

Part 1: Looking at Bad Language

Why say ‘bad language’ and not ‘swearing’? Definitions of what is ‘swearing’ = complex!


Words developed for the Lancaster Corpus of abusive words – including animal, intelligence, sexuality focused insults. Then had to develop an annotation system for the material – including class, gender, age, etc. Can provide some quite useful distinctions that can be researched. Metalinguistic word – am not using the word, but I’m talking about it/describing it, or quoting someone else saying something.

Who knew there were so many different ways to use ‘fuck’ – fascinating…


Final category = a ‘dustbin category’ for those that didn’t fit any of these categories, and didn’t really need further work.

Commentator suggests that video helps gives further insights into the use of swearing in language – jocular, and ‘fillers’ have been mentioned by other commentators.

Another kind of ‘MOOC’ – – such dictionaries allows us to see language develop.

Part 2: Swearing and Gender

We can use such corpora to see how such language is actually used – but we’ll likely approach such questions with a number of assumptions – e.g. that men swear more than women. In early 1990s, there was no statistical difference in usage, but in looking at the individual words themselves, these are different… words used by men tended to be stronger. 


There are levels of ‘strength’ seen, but there are possibilities that these might be used differently … e.g. ‘religious people’ more offended by God/Jesus than the general population [Note to second year housemates, yes…]

Commentators mentioning encouraging people to rethink phrases that have become everyday

  • Someone being ‘a bit gay’
  • Someone having ‘a blonde moment’
  • Someone ‘running like a girl’

Is there ‘surgical cleaning’ where such words become sanitised? Corpus tools, of course, are good at identifying the change in language of words e.g. ‘gay’.

Different people will probably see some of the words as more offensive than others… e.g. people say ‘God’ without thinking – probably more offensive to ‘religious people’ than many realize.

Part 3: Swearing and Interaction

How do the genders interact when it comes to the use of ‘bad language’ words? Is there a difference between or across? Intra-gender use of swearing is the norm (e.g. men direct swearing at other men more than at females and vice-versa), but men do this much more than women (have they been cultured to swear less in front of women?

What kind of words are targeted? E.g. ‘cow’ exclusively at women…


Wow… so much complex!

Part 4: Strength of Swearing

Different categories of words (e.g. general annoyance) = much milder words, but ‘destinational category’ (reached end of tether = “go away”) = much stronger!

Discussions mentioning new British National Corpus coming this year, where it will be interested to see how words are used/re-used and reclaimed – e.g. African-Americans claiming ‘n****r’, gay people claiming ‘queer’ and women claiming ‘bitch’ as positive interaction words. Also lots of discussion as to regional/cultural differences and how the right corpus might help explore those.

Part 5: Swearing and Age

Assumption is that younger people tend to swear more, and data seems to bear that out:


Is it down to age? It’s not necessarily their age that is the issue. The cultural environment may have meant that swearing was less accepted, so don’t swear less as get older! Are they possibly using ‘swear words’ that are so mild that they’ve not been measured as swear words (e.g. golly, blimey), although this doesn’t exist, either. What about the strength of swear words/categories? Mirrors the distribution from the graph above. Frequency/strength distribution are similar.

Commentator notes: ‘When angry, count to four; when very angry, swear.’ (Mark Twain). Also questioning whether the extra drop-off is down to being in the presence of children/grandchildren, when people seek to reign themselves in.

Part 6: Swearing and Class

How do we draw out the nuances here? Do lower classes select stronger words, and higher class = weaker ones?

AB: 1.81, C1 1.76, C2 2.16, DE 2.47 (General pattern, but AB = stronger than C1)…

What about the type of bad language use? AB/C1 and C2/DE = inverted.

Lots of discussion about whether upper classes = rules don’t apply, and middle classes more cautious…

Part 7: Combining Factors

What happens when try and combine the data – e.g. male AB aged 25-34 = use most? BNC was balanced to get roughly similar amounts of data on single data. May be no examples combining particular factors… that particular group = 2,259 words uttered in the spoken BNC.

How many types of speakers are in the BNC? Not many, but we can combine particular types of data to give insights.

Part 8: Combining Factors – 2 Case Studies

Age/Gender combination:


Class/Gender combination:


Class/Gender/Age combination:


Do you want to argue – are women pre-disposed to use less swearwords? Surely socially constructed, it’s an artifact of the society within which these 2 genders are operating, nothing to do with genetics. Debate? Where did the distinctions come from? What were the social processes that constructed this?

Commentator: People are willing to say things in other languages they’re not prepared to say in their own –

Final Words from Tony

The start of a journey into language .. with an overview of the kind of things you should have learned, and in a position to build your own corpora [though I didn’t use the practical elements!]… and don’t think that this course has given you everything…

We often want to study language in their social contexts, rather than in isolation. Contemporary social issues or historical issues typically the most interesting.


#CorpusMooc: Week 7 Notes

What languages did you learn and how?

Only test I’ve ever got 100% on is a language aptitude test – apparently I’m good at identifying patterns and working it out from there … which probably have noticed “in real language interactions”

French to GCSE level, text books, but to get through the exams = extra spoken lessons, where saying the correct thing was abandoned for getting ‘the right word’

German for a couple of years – got confused between that and French, very particular words and grammar focused

Latin for 3 years – grammar grammar grammar and vocab

Italian – tried an online course – didn’t need to put it into practice

Brazilian Portuguese – Linkwords (linking words to really silly sentences), gave me something to start with for 5 months in Brazil, then have to use language to progress. Now using an iPad app to get back into things – where everything is gamified – largely vocab focused.

Learner Corpora

Contains data by those learning a particular language… Native corpora don’t refer to the problems that learners tend to encounter (as natives don’t tend to make the mistakes that learners do). Identifying errors in essays, etc. allows development of new leaner corpora. There can be bigger complications than frequency, and what is the background of the original language, so what translates/makes sense, etc.? What about under/overuse of words (especially compared to native speakers).

Interesting differences between the keywords that were used in discussions re the use of mobile phones – where are the different cultural emphases?



Interesting – Americans tend to use personal pronouns (I/individual experiences), whilst Polish tended to use (we/group) – speculates whether Polish is more ‘academic’ writing style [or is it the cultural expectations – definitely assume that Americans talk individually] … or Polish have less mobile phones so probably use in the group, and they rely on abstract nouns anyway = more generalisations. Rhetorical style – can be practical reasons, can be teaching style/vocabulary, societal differences.

More common to come across written than spoken corpa data … more difficult to capture, and also captures a larger range of words than there are, as computer doesn’t recognise spelling mistakes, etc. If analysis is just at a lexical level misses the range of uses. Too much research is not shared.


#CORPUSMOOC Week 6 (Notes)

regular_6d7d092b-68a0-40e1-8787-caaeaa0ea753Before you watch the lecture, create two short dictionary definitions: One is for the word ‘threadbare’ the other is for the word ‘luckily’. Do not consult a dictionary or other reference resource – just use your own intuitions. If you do not think you know either word, just make a list of words that you think may be associated with each. Then watch the lecture. 

Threadbare: A condition in which clothes are worn through, nearly to rags.

Luckily: Where a situation could have gone wrong, but the outcome was positive.

VIDEO MATERIAL: History & Development of Corpus Linguistics.

Use large corpora to identify the words that are most frequently used. The most efficient form of language learning ties to the words that people use most frequently. Studies are corpus based in their philosophy.

Early – most was written rather than spoken data, and much was not on contemporary texts (e.g. 19th Century novels and the Bible). By end of 1950s from teaching words to teaching rules (grammar). Verbs = 60% of what we use, but are hard to teach, + irregular verbs. Look for the popular/typify speech words.

Listening to these videos as a piece of history, as the studies have developed over time, identifying various elements of text, speech, and how focus on the words that people actually use etc – a very small number, with a large number of common lexical bundles (less common in academic writing). Developments of dictionary – large numbers of words, especially rare words is not helpful – that’s is required for [e.g. Countdown]. Writing definitions – need examples of how the word is used in context… I like what is able to do with this in making the dictionary digital.

I’m not a linguist (but wanting to interrogate tweets), so I’m multi-tasking on this material and taking fewer notes!





#CORPUSMOOC Week 5 (Notes)

When taking a statement from a witness or suspect, what kinds of factors about them, the crime, or the larger social context should we take into account? One example to get you started: the interviewee’s age – children and the very elderly should be treated especially carefully.

Suspect many would say ideally classless, but their suspected role in the crime, the level of evidence, age, race, gender, religion, class, education level, the recency of the crime?

Forensic Logistics (Claire Hardaker)

Narrow View – forensics = court room views, etc.

Broad View = anything from criminal/civil trial or part of the investigative procedure (they may not have been expected to be forensic data, but they become some).

What is the meaning of this text? (what is the purpose of it?)

Who authored a given text? (actually written by x)

Language of legal texts/processes (e.g. was consent truly informed). Huge area, restricted only by the questions you ask…

Physical Evidence

Analyse ink/paper, etc. to see if appropriate to era, etc.

Historical Evidence

Knowing what the author of a particular document knew, usually most people a specialise in only one author as depth of knowledge

Cipher-based decryption

Author has deliberately encoded their name into texts (particular to Bible studies/Shakespeare studies). Not particularly serious method of analysis

Manual/Qualitative analysis

Conversation/discourse analysis, syntax, stylistic choices, etc. Look in depth at the language being used. Drawback = cherry picking – in a court can support offense/defence.

Automated/quantitative analysis

Computational linguistics, computational stylomotry, and today’s focus… Multi-variate approaches…

Combining forensic linguistics and corpus linguistics

What are the benefits or drawbacks? Combining approaches – don’t just celebrate the strengths, but also understand the pitfalls (especially if it’s evidence for a court case).

Looking at ‘disputed authorship’…

Corpus data = large datasets, that has often been cleaned for consistency of spelling, etc.

Forensic data = often small, e.g. a text, so difficult to analyse. Often quite messy.

May allow to set e.g. a text against a larger dialect set.

Looking at ‘style’, are looking for things that are ‘unconscious’ and therefore unchanged from general style – e.g. a forged suicide note. Can be hard to identify unconscious material.

Corpus – easy to search large datasets, whereas forensic information is difficult to encode – e.g. a thread, sarcasm, etc.

Adopting a corpus means that have made assumptions – e.g. that you are going to have something to count/that count will be meaningful. On e.g. Twitter how account for variations of e.g. ‘and’ = +, &, n, etc… Looking at texts, if always seems to write xx at end of texts, but doesn’t on this text –therefore not theirs = needs more context.

If no restriction on author, not going to be able to identify this. Corpus forensics works better at narrowing between a & b, rather than across the sector. Words have range of meanings, can end up with redundant data.


These 2 can still work together

  • Shared goal = objective, quantitative facts of yes/no
  • Is it common? What does it typically mean? Is it significant?
  • Corpus = how likely is it that this occurred by pure chance alone.

The Case of Derek Bentley: the crime

Diagnosed with mental age of 10, reading age of 8, 66 on IQ test (unusual). Who armed him (with knife/knuckle dusters), what he said, ‘diminished responsibility’ was not recognized.

The case of Derek Bentley: the evidence

Saying ‘the gun’ = was assumed (shared knowledge) that there was a gun that Bentley knew about. Police had to write down longhand, but couldn’t ask substantive questions (ask for repeat, but not, what time was that?). Bentley ‘witness statement’ was presented to court = a faithful witness of what he’d said. Throughout the trial Bentley said he didn’t write it himself, but 3 police officers said he did. The statement clearly demonstrates that a conversation has been turned into a witness statement = crucial to his conviction.

The Case of Derek Bentley: the analysis and conclusions

Note the use of ‘then’ (temporal = sequences of events). Typically monologic statements don’t display this, so suggests that there was intervention.




Can’t use this alone, but is another indicator, also see the pre/post-positioning and which is un/usual constructions (I then/then I). 1,000 times more often in Bentley’s statement than in entire Cobeld corpus.

Along with other features, if it becomes clear that Bentley hadn’t written the statement, and he was convicted largely on ‘the gun’, then how reliable is that evidence. Not fully pardoned til 1998. Can’t give Bentley back his life, but can challenge a miscarriage of justice.

PART 6: Other cases and datasets

Look at language he used – doesn’t help prevent a crime, but does help understand triggers, etc. and may provide notifications for other crimes.


Offers a set of forensic data, including Old Bailey, Unabomba, OJ Simpson, Harold Shipman, David Irving vs Penguin, Enron, Anders Breivik, Paul Ceflia vs Mark Zuckerberg, Conrad Murray, etc.

Be aware of version control, ethical nature of the material, whether edits have been made, etc. Ensure rigorous nature of the work that you do, as other’s prison sentences could depend upon it.