Cutting out the middleman in Data-Driven Learning

Cut out the middle man via frontbad sketchbook

“…the attempt to cut out the middleman as far as possible and to give the learner direct access to the data” (Johns, 1991, p.30)

Importance is placed on empirical data when taking a corpus-informed and data-driven approach to language learning and teaching. Moving away from subjective conclusions about language based on an individual’s internalized cognitive perception of language and the influence of generic language education resources, empirical data enable language teachers and learners to reach objective conclusions about specific language usage based on corpus analyses. Tim Johns coined the term Data-Driven Learning (DDL) in 1991 with reference to the use of corpus data and the application of corpus-based practices in language learning and teaching (Johns, 1991). The practice of DDL in language education was appropriated from computer science where language is treated as empirical data and where “every student is Sherlock Holmes”, investigating the uses of language to assist with their acquisition of the target language (Johns, 2002:108).

A review of the literature indicates that the practice of using corpora in language teaching and learning pre-dates the term DDL with work carried out by Peter Roe at Aston University in 1969 (McEnery & Wilson, 1997, p.12). Johns is also credited for having come up with the term English for Academic Purposes (Hyland, 2006). Johns’ oft quoted words about cutting out the middleman tell us more about his DDL vision for language learning; where teacher intuitions about language were put aside in favor of powerful text analysis tools that would provide learners with direct access to some of the most extensive language corpora available, the same corpora that lexicographers draw on for making dictionaries, to discover for themselves how the target language is used across a variety of authentic communication contexts. As with many brilliant visions for impactful educational change, however, his also appears to have come before its time.

This post will argue that the original middleman in Johns’ DDL metaphor took on new forms beyond that of teachers getting in the way of learners having direct access to language as data. An argument will be put forward to claim that the applied corpus linguistics research and development community introduced new and additional barriers to the widespread adoption of DDL in mainstream language education. Albeit well intentioned and no doubt defined by restrictions in research and development practices along the way, new middlemen were paradoxically perpetuated by the proponents of DDL making theirs an exclusive rather than a popular sport with language learners and

The middle man comic – first issue cover via Wikipedia

teachers (Tribble, 2012). And, with each new wave of research and development in applied corpus linguistics new and puzzling restrictions confronted the language teaching and learning community.

The middleman in DDL has presented himself as a sophisticated corpus authority in the form of research and development outputs, including text analysis software designed by, and for, the expert corpus user with complex options for search refinement that befuddled the non-expert corpus user, namely language teachers and learners. Replication of these same research methods to obtain the same or similar results for uses in language teaching and learning has often been restricted to securing access to the exact same software and know-how for manipulating and querying linguistic data successfully.

Which language are you speaking?

He has been known to speak in programming languages with his interfaces often requiring specialist trainers to communicate his most simple functions. Even his most widely known KWIC (Key Word In Context) interface for linguistic data presentation with strings of search terms embedded in truncated language context snippets remain foreign-looking to the mostly uninitiated in language teaching and learning. In many cases, he has not come cheap either and requirements for costly subscriptions to and upgrades of his proprietary soft wares have been the norm, especially in the earlier days.

In particular, with reference to English Language Teaching (ELT), he has criticized many widely used ELT course book publications and their language offerings for ignoring his research findings based on evidence for how the English language is actually used across different contexts of use. In response, a few ELT course book publishers have clamored around him to help him get his words out for a price but in so doing have rendered his corpus analyses invisible, in turn creating even more of a dependency on course books rather than stimulating autonomy among language teachers and learners in the use of corpora and text analysis tools for DDL. And, because publishers were primarily confining him to the course book and sometimes CD-ROM format there were only so many language examples from the target corpora that could possibly fit between the covers of a book and only the most frequent language items made it onto the compact disc.

The Oxford Collocation Dictionary for Students of English, (2nd Edition from 2009 by Oxford University Press) based on the British National Corpus (BNC) is one example where high frequency collocations for very basic words like any and new predominate and where licensing restrictions permit only one computer installation per CD ROM. Further restrictions compound the openness issue with the use of closed corpora in leading corpus-derived ELT books such as the Cambridge University Press (CUP) publication, From Corpus to Classroom (O’Keeffe, McCarthy & Carter, 2007), which might have been more aptly entitled, From Corpus to Book, as it draws heavily on the closed Cambridge and Nottingham Discourse Corpus of English (CANCODE) from Cambridge University Press and Nottingham University and recommends the use of proprietary concordancing programs, Wordsmith Tools and MonoConc Pro, thereby rendering any replication of analyses for the said corpus inaccessible to its readers.

Mainstream language teacher training bodies continue to sidestep the DDL middleman in the development of their core training curricula (for example, the Cambridge ESOL exams) due to the problems he proposes with accessibility in terms of cost and complexity. Instead, English language teacher training remains steadily focused on how to select and exploit corpus-derived dictionaries with reference to training learners in how to identify, for example: definitions, derivatives, parts of speech, frequency, collocations and sample sentences. In the same way that corpus-derived course books do not render corpus analyses transparent to their users, training in dictionary use does not bring teachers and their learners any closer to the corpora they are derived from.

Cambridge English Corpus

registered-blogger-150x150-bannerMichael McCarthy presented, ‘Corpora and the advanced level: problems and prospects’ at IATEFL Liverpool 2013. One of the key take-away messages from his talk was the fact that learners of more advanced English receive little in the way of return on investment once the highest frequency items of English vocabulary had been acquired (he referred to the top 2000 words from the first wordlist of the British National Corpus that make up about 80% of standard English use). To learn the subsequent wordlists of 2000 words each the percentage of frequency in usage drops considerably, so in terms of cost for the time and money you might end up spending if you sign up to yet more English language classes may not be affordable or feasible. This has particular implications in learning English for Specific Purposes (ESP), including English for Academic Purposes (EAP) which many would argue is always concerned with developing specific academic English language knowledge and usage within specific academic discourse communities.

Catching Michael McCarthy on the way out of the presentation theatre he kindly agreed to walk and talk while rushing to catch his train out of Liverpool. Would the Cambridge English Corpus be made available anytime soon for non-commercial educational research and materials development purposes, I asked? I hastened to add the possibilities and the real world need for promoting corpus-based resources and practices in open and distance online education as well as in traditional classroom-based language education. He agreed that the technology had become a lot better for finally realising DDL within mainstream language teaching and learning and within materials development. Taking concordance line printouts into ELT classrooms had never really taken off in his estimation and I would have to agree with him on that point. He indicated that it would be unlikely for the corpus to become openly available anytime in the foreseeable future, however, due to the large amount of private investment in the development of the corpus with restricted access for those participating stakeholders on the project only.

But what would the real risk be in opening up this corpus to further educational research and development for non-commercial purposes with derivative resources made freely available online? Wouldn’t this be giving the corpus resource added sustainability with new lives and further opportunities for exploitation that could advance our shared understanding of how English works? –  across different contexts, using current and high quality examples of language in context? More importantly, wouldn’t this give more software developers the chance to build more interfaces using the latest technology, and for more ELT materials developers, including language teachers, the chance to show different derivative resource possibilities for effectively using the corpus in language teaching and learning?

A non-commercial educational purpose only stipulation could be used in all of the above resource development scenarios. Indeed, these could all be linked back to the Cambridge English Corpus project website as evidence of the wider social and educational impact as a result of their initial investment. This is what will be happening with most of the publicly funded research projects in the UK following recommendations from the Finch report which come into effect in April 2014. It follows that Open Educational Resources (OER) and Open Educational teaching Practices (OEP) will allow for expertise to be readily available when Open Access research publishing is compulsory for all RCUK and EPSRC funding grants for the development of research-driven open teaching and learning derivatives. Privately funded research projects like this one from CUP could also be leading in this area of open access.

Corpora such as the British National Corpus (BNC), the British Academic Written English (BAWE) corpus, Wikipedia and Google linguistic data as a corpus are some of the many valuable resources that have all been developed into language learning and teaching resources that are openly available on the web. In the following sections, I will refer to leading applied corpus linguistics research and development outputs from leading researchers who have been making their wares freely available if not openly re-purposeable to other developers, as in the example of the FLAX language project’s Open Source Software (OSS). And, hopefully these corpus-based resources are getting easier to access for the non-expert corpus user.

“For the time being” CUP are providing free access to the English Vocabulary Profile website of resources based on the Cambridge English Corpus (formerly known as the Cambridge International Corpus), “the British National Corpus and the Cambridge Learner Corpus, together with other sources, including the Cambridge ESOL vocabulary lists and classroom materials.” Below is a training video resource from CUP available on YouTube, which highlights some of the uses for these freely available resources in language learning, teaching and materials development. This is a very useful step for CUP to be taking with making corpus-based resources and practices more accessible to the mainstream ELT community.

Open practices in applied corpus linguistics

goaheadcutoutmiddlemanEnter those applied corpus linguistics researchers and developers who have made some if not all of their text analysis tools and Part-Of-Speech-tagged corpora freely accessible via the Web to anyone who is interested in exploring how to use them in their research, teaching or independent language learning. Well-known web-based projects include Tom Cobb’s resource-rich Lextutor site, Mark Davies’ BYU-BNC (Brigham Young University – British National Corpus) concordancer interface and the Corpus of Contemporary American English (COCA) with WordandPhrase (with WordandPhrase training videos resources on YouTube) for general English and English for Academic Purposes (EAP), Laurence Anthony’s AntConc concordancing freeware for Do-It-Yourself (DIY) corpus building (with AntConc training video resources on YouTube), and the Sketch Engine by Lexical Computing which offers some open resources for DDL. Open invitations from the Lextutor and AntConc project developers seeking input on the design, development and evaluation of existing and proposed project tools and resources are made by way of social networking sites, the Lextutor Facebook group and the AntConc Google groups discussion list. Responses usually come from a steady number of DDL ‘geeks’, however, namely those who have reached a level of competence and confidence with discussing the tools and resources therein. And, most of those actively participating in these social networking sites are also engaging in corpus-based research.

Data-Driven Learning for the masses?

My own presentation at IATEFL Liverpool was based on my most recent project with the University of Oxford IT Services for providing and promoting OSS interfaces from the FLAX language project for increasing access to the BNC and BAWE corpora, both managed by Oxford. In addition to this, the same OSS developed by FLAX has been simplified with the development of easy-to-use interfaces for enabling language teachers to build their own open language collections for the web. Such collections using OER from Oxford lecture podcasts, which have been licensed as creative commons content, have also been demonstrated by the TOETOE International project (Fitzgerald, 2013).

The following two videos from the FLAX language collections show their OSS for using corpus-based resources in ELT that are accessible both in terms of simplicity and in terms of openness. The first training video demonstrates the Web as corpus and how this resource has been effectively mined and linked to the BNC for enhancement of both corpora for uses in DDL. The second training video demonstrates how to build your own Do-It-Yourself corpora using the FLAX OSS and Oxford OER. With open corpus-based resources the reality of DIY corpora is becoming increasingly possible in DDL research and teaching and learning practice (Charles, 2012; Fitzgerald, in press).

So, go ahead, and cut out the middleman in data-driven learning.

FLAX Web Collections (derived from Google linguistic data):

The Web Phrases and Web Collocations collections in FLAX are based on another extensive corpus of English derived from Google linguistic data. In particular, the Web Phrases collection allows you to identify problematic phrasing in writing by fine-tuning words that precede and follow phrases that you would like to use in your writing by drawing on this large database of English from Google. This allows you to substitute any awkward phrasing with naturally occurring phrases from the collection to improve the structure and the fluency of writing.


FLAX Do-It-Yourself Podcast Corpora – Part One:

Learn how to build powerful open language collections through this training video demonstration. Featuring audio and video podcast corpora using the FLAX Language tools and open educational resources (OER) from the OpenSpires project at the University of Oxford and TED Talks.



Anthony, L. (n.d.). Laurence Anthony’s Website: AntConc. Retrieved from

Cobb, T. (n.d). Compleat Lexical Tutor. Retrieved from

Charles, M. (2012). ‘Proper vocabulary and juicy collocations’: EAP students evaluate do-it-yourself corpus-building. English for Specific Purposes, 31: 93-102.

Davies, M. (1991-present). The Corpus of Contemporary American English (COCA). Retrieved from

Davies, M. & Gardener, D. (n.d.) WordandPhrase. Retrieved from

Fitzgerald, A. (2013). TOETOE International: FLAX Weaving with Oxford Open Educational Resources. Open Educational Resources International Case Study. Commissioned by the Higher Education Academy (HEA), United Kingdom. Retrieved from

Fitzgerald, A. (In Press). Openness in English for Academic Purposes. Open Educational Resources Case Study based at Durham University: Pedagogical development from OER practice. Commissioned by the Higher Education Academy (HEA) and the Joint Information Systems Committee (JISC), United Kingdom.

FLAX. (n.d.). The “Flexible Language Acquisition Project”. Retrieved from

Johns, T. (1991). From printout to handout: grammar and vocabulary teaching in the context of data-driven learning. In: T. Johns & P. King (Eds.), Classroom Concordancing. English Language Research Journal, 4: 27-45.

Johns, T. (2002). ‘Data-driven learning: the perpetual challenge.’ In: B. Kettemann & G. Marko (Eds.), Teaching and Learning by Doing Corpus Analysis. Amsterdam: Rodopi. 107-117.

Hyland, K. (2006). English for Academic Purposes: An Advanced Handbook. London: Routledge.

McEnery, T. & A. Wilson. (1997). Teaching and language corpora. ReCALL, 9 (1): 5-14.

O’Keeffe, A., McCarthy, M., & Carter R. (2007). From Corpus to Classroom: language use and language teaching. Cambridge: Cambridge University Press.

Oxford Collocation Dictionary for Students of English (2nd Edition) (2009), Oxford University Press.

Tribble, C. (2012). Teaching and Language Corpora Survey. Retrieved from


Leave a Comment

  1. another excellent and rich post Alannah, thanks.

    what i would add is that the middleman also exists because DDL is difficult to use with all but the ablest students. the literature on minimal guided instruction does not bode well for DDL with novice learners e.g.see this collection of papers

    my experience has shown one needs a lot of “pedagogic mediation” to convert corpus info into classroom activity.



  2. Thanks, Mura, for your comment and for reading my posts. Your comment interests me because the design of DDL resources is one of the big accessibility issues we are trying to address and this has informed a lot of our work with interface designs, moving away from the traditional KWIC interface as a means to enable learners and teachers to get closer to the linguistic data as evidence…maybe this can lead to an interest in DDL, a step toward some of the more complex querying you can do with more traditional corpus tools e.g. Sketch Engine, but which require more guided instruction and know-how than what we are doing with FLAX. This is what I will be testing out for my research so it will be good to see how others view the efficacy of and the accessibility of the resources from the FLAX project for the non-expert corpus user who can also be a language learner or teacher without much knowledge of how the target language works.

    Apologies in advance as this is a somewhat lengthy response because the paper you have referred to is a controversial one within educational research – good stuff for discussion though, especially with reference to DDL 🙂

    Interesting that you should bring up the Kirschner, Sweller and Clark paper, I remember well when it came out in 2006 as I was involved in quite a lot of debate about it with my peers back in Canada and even presented a paper at AECT on ‘An Analysis of the Failure of *Evidence* in Educational Research and Practice: In Response to Kirschner, Sweller and Clark’. I would like to hope that the work I’ve done with the FLAX team for evaluating and promoting their design work in developing easier interfaces for DDL, drawing on powerful and relevant corpora that are made more accessible, are helping to demystify DDL somewhat for uses both within and outside the classroom by users who are not necessarily the ablest and this includes many teachers who have had no exposure to DDL. We know that most language is acquired outside of the classroom e.g. without guided instruction, so having access to resources to help learners manage e.g. large text loads is the type of support we have identified as somewhat lacking in EAP/ESP especially, and good (and free) collocations resources that can help with specificity are few and far between also. For example, all the text support tools within the FLAX BAWE and Learning Collocations collections linking in Wikimedia and live web search resources to provide glossary, word list and part of speech help are tools to reduce cognitive overload when confronted with difficult and unfamiliar language. Rather than grading or dumbing texts down we are trying to enhance the texts with these help features to make the language and meaning more accessible, as is shown in the BAWE collections – we would like to see teachers building their own language collections for this purpose also. Most language teachers would have trouble with some of the technical language used in the BAWE so these tools can be of help to learners and teachers alike. Of course, some initial training to familiarise you or your learners with the features is necessary – minimal guidance with evidence-based resources – and teachers can use the corpus data as evidence to build classroom handouts from or use the interactive tools within FLAX to devise specific tasks – guided evidence-based resources to support instruction.

    Unfortunately, and I’m going back to the KSC paper here, the debate on the appropriateness of evidence-based practice in education is simply narrowed by some to the antagonism between quantitative and qualitative research paradigms. This is all too often extended to the notion of opposing theories on how we learn, whereby think camps that are identified as behavioural, cognitive and constructivist are by definition pitted against one another. Arguably, it would be desirable to understand that different learning goals – whether to change behaviours (behaviourism), to change cognitive processes (cognitivism), to promote the development of mental models through meaning making (constructivism), or adaptation to environments and affordances (situated approaches) – necessitate different learning performance support. Following on from this argument would be the acceptance that different learning objectives in DDL for classroom or autonomous use, or for materials/resources development, require different forms of assessment and evaluation, and most notably different means for measuring the successfulness of any given instruction or resource designed and used to support learning and teaching.


  3. hi again

    the flax database is really pushing fwd the area, and thx to you found out about the web collection which is great for me to try to relate that to my specialized corpus.

    do you have a link to your paper somewhere? i guess that issue about not integrating perspectives in ed research is ever challenging! could you point to any refs regarding evidence on DDLand language learning?

    p.s. any chance i can get some advice regarding a workshop?



    • Great, would be interested in hearing more about the specialized corpus you are building. I see from your blog that you use AntConc a lot.

      Unfortunately that paper for the AECT conference in Anaheim in 2007 was only printed in a paper version of the conference proceedings but there are many papers out there in response to KSC’s paper. Someone who has put together useful DDL bibliographies is Alex Boulton who is based at Nancy in France – anywhere near you? What kind of workshop are you looking for? There are some coming up in Europe this summer for corpora and language teaching:

      Two Summer Courses (July 2013)

      Using Corpora in Language Teaching
      Using Moodle in Language Teaching

      “Both are designed for teachers who would like to use these in their teaching, or who would like to expand their practical knowledge. They are not courses for specialists. All current information is available at the website”

      There’s also Eurocall’s Corpus SIG

      “If you are not already aware of the new site,, created in Moodle, that the four committee members (Alex Boulton, Pascual Pérez-Paredes, Johannes Widmann, James Thomas) have been building over the last two years, please feel free to stop by. Note in particular, the CorpusCALL Resources droplist where most of our hard work has gone. The database of resources is growing nicely and you are more than welcome to add your own for the benefit of the whole community.”


  4. A fascinating post. I am particularly interested in your comments about design of interface to make the corpora accessible to learners. It’s what Language Garden tries to do.


  5. Thanks, David. I’ve taken a quick look at Language Garden and like the work you’ve done with the word tree interface and have noted that you offer some free resources as well 🙂 Yes, it is a fascinating area to look at and try and think beyond the box with interface designs for corpus-based resources – it’s also an under-researched area, so lots to keep going forward with.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s