#CORPUSMOOC Week 5 (Notes)

When taking a statement from a witness or suspect, what kinds of factors about them, the crime, or the larger social context should we take into account? One example to get you started: the interviewee’s age – children and the very elderly should be treated especially carefully.

Suspect many would say ideally classless, but their suspected role in the crime, the level of evidence, age, race, gender, religion, class, education level, the recency of the crime?

Forensic Logistics (Claire Hardaker)

Narrow View – forensics = court room views, etc.

Broad View = anything from criminal/civil trial or part of the investigative procedure (they may not have been expected to be forensic data, but they become some).

What is the meaning of this text? (what is the purpose of it?)

Who authored a given text? (actually written by x)

Language of legal texts/processes (e.g. was consent truly informed). Huge area, restricted only by the questions you ask…

Physical Evidence

Analyse ink/paper, etc. to see if appropriate to era, etc.

Historical Evidence

Knowing what the author of a particular document knew, usually most people a specialise in only one author as depth of knowledge

Cipher-based decryption

Author has deliberately encoded their name into texts (particular to Bible studies/Shakespeare studies). Not particularly serious method of analysis

Manual/Qualitative analysis

Conversation/discourse analysis, syntax, stylistic choices, etc. Look in depth at the language being used. Drawback = cherry picking – in a court can support offense/defence.

Automated/quantitative analysis

Computational linguistics, computational stylomotry, and today’s focus… Multi-variate approaches…

Combining forensic linguistics and corpus linguistics

What are the benefits or drawbacks? Combining approaches – don’t just celebrate the strengths, but also understand the pitfalls (especially if it’s evidence for a court case).

Looking at ‘disputed authorship’…

Corpus data = large datasets, that has often been cleaned for consistency of spelling, etc.

Forensic data = often small, e.g. a text, so difficult to analyse. Often quite messy.

May allow to set e.g. a text against a larger dialect set.

Looking at ‘style’, are looking for things that are ‘unconscious’ and therefore unchanged from general style – e.g. a forged suicide note. Can be hard to identify unconscious material.

Corpus – easy to search large datasets, whereas forensic information is difficult to encode – e.g. a thread, sarcasm, etc.

Adopting a corpus means that have made assumptions – e.g. that you are going to have something to count/that count will be meaningful. On e.g. Twitter how account for variations of e.g. ‘and’ = +, &, n, etc… Looking at texts, if always seems to write xx at end of texts, but doesn’t on this text –therefore not theirs = needs more context.

If no restriction on author, not going to be able to identify this. Corpus forensics works better at narrowing between a & b, rather than across the sector. Words have range of meanings, can end up with redundant data.


These 2 can still work together

  • Shared goal = objective, quantitative facts of yes/no
  • Is it common? What does it typically mean? Is it significant?
  • Corpus = how likely is it that this occurred by pure chance alone.

The Case of Derek Bentley: the crime

Diagnosed with mental age of 10, reading age of 8, 66 on IQ test (unusual). Who armed him (with knife/knuckle dusters), what he said, ‘diminished responsibility’ was not recognized.

The case of Derek Bentley: the evidence

Saying ‘the gun’ = was assumed (shared knowledge) that there was a gun that Bentley knew about. Police had to write down longhand, but couldn’t ask substantive questions (ask for repeat, but not, what time was that?). Bentley ‘witness statement’ was presented to court = a faithful witness of what he’d said. Throughout the trial Bentley said he didn’t write it himself, but 3 police officers said he did. The statement clearly demonstrates that a conversation has been turned into a witness statement = crucial to his conviction.

The Case of Derek Bentley: the analysis and conclusions

Note the use of ‘then’ (temporal = sequences of events). Typically monologic statements don’t display this, so suggests that there was intervention.




Can’t use this alone, but is another indicator, also see the pre/post-positioning and which is un/usual constructions (I then/then I). 1,000 times more often in Bentley’s statement than in entire Cobeld corpus.

Along with other features, if it becomes clear that Bentley hadn’t written the statement, and he was convicted largely on ‘the gun’, then how reliable is that evidence. Not fully pardoned til 1998. Can’t give Bentley back his life, but can challenge a miscarriage of justice.

PART 6: Other cases and datasets

Look at language he used – doesn’t help prevent a crime, but does help understand triggers, etc. and may provide notifications for other crimes.


Offers a set of forensic data, including Old Bailey, Unabomba, OJ Simpson, Harold Shipman, David Irving vs Penguin, Enron, Anders Breivik, Paul Ceflia vs Mark Zuckerberg, Conrad Murray, etc.

Be aware of version control, ethical nature of the material, whether edits have been made, etc. Ensure rigorous nature of the work that you do, as other’s prison sentences could depend upon it.