
“…the attempt to cut out the middleman as far as possible and to give the learner direct access to the data” (Johns, 1991, p.30)
Importance is placed on empirical data when taking a corpus-informed and data-driven approach to language learning and teaching. Moving away from subjective conclusions about language based on an individual’s internalized cognitive perception of language and the influence of generic language education resources, empirical data enable language teachers and learners to reach objective conclusions about specific language usage based on corpus analyses. Tim Johns coined the term Data-Driven Learning (DDL) in 1991 with reference to the use of corpus data and the application of corpus-based practices in language learning and teaching (Johns, 1991). The practice of DDL in language education was appropriated from computer science where language is treated as empirical data and where “every student is Sherlock Holmes”, investigating the uses of language to assist with their acquisition of the target language (Johns, 2002:108).
A review of the literature indicates that the practice of using corpora in language teaching and learning pre-dates the term DDL with work carried out by Peter Roe at Aston University in 1969 (McEnery & Wilson, 1997, p.12). Johns is also credited for having come up with the term English for Academic Purposes (Hyland, 2006). Johns’ oft quoted words about cutting out the middleman tell us more about his DDL vision for language learning; where teacher intuitions about language were put aside in favor of powerful text analysis tools that would provide learners with direct access to some of the most extensive language corpora available, the same corpora that lexicographers draw on for making dictionaries, to discover for themselves how the target language is used across a variety of authentic communication contexts. As with many brilliant visions for impactful educational change, however, his also appears to have come before its time.
This post will argue that the original middleman in Johns’ DDL metaphor took on new forms beyond that of teachers getting in the way of learners having direct access to language as data. An argument will be put forward to claim that the applied corpus linguistics research and development community introduced new and additional barriers to the widespread adoption of DDL in mainstream language education. Albeit well intentioned and no doubt defined by restrictions in research and development practices along the way, new middlemen were paradoxically perpetuated by the proponents of DDL making theirs an exclusive rather than a popular sport with language learners and

teachers (Tribble, 2012). And, with each new wave of research and development in applied corpus linguistics new and puzzling restrictions confronted the language teaching and learning community.
The middleman in DDL has presented himself as a sophisticated corpus authority in the form of research and development outputs, including text analysis software designed by, and for, the expert corpus user with complex options for search refinement that befuddled the non-expert corpus user, namely language teachers and learners. Replication of these same research methods to obtain the same or similar results for uses in language teaching and learning has often been restricted to securing access to the exact same software and know-how for manipulating and querying linguistic data successfully.
Which language are you speaking?
He has been known to speak in programming languages with his interfaces often requiring specialist trainers to communicate his most simple functions. Even his most widely known KWIC (Key Word In Context) interface for linguistic data presentation with strings of search terms embedded in truncated language context snippets remain foreign-looking to the mostly uninitiated in language teaching and learning. In many cases, he has not come cheap either and requirements for costly subscriptions to and upgrades of his proprietary soft wares have been the norm, especially in the earlier days.
In particular, with reference to English Language Teaching (ELT), he has criticized many widely used ELT course book publications and their language offerings for ignoring his research findings based on evidence for how the English language is actually used across different contexts of use. In response, a few ELT course book publishers have clamored around him to help him get his words out for a price but in so doing have rendered his corpus analyses invisible, in turn creating even more of a dependency on course books rather than stimulating autonomy among language teachers and learners in the use of corpora and text analysis tools for DDL. And, because publishers were primarily confining him to the course book and sometimes CD-ROM format there were only so many language examples from the target corpora that could possibly fit between the covers of a book and only the most frequent language items made it onto the compact disc.
The Oxford Collocation Dictionary for Students of English, (2nd Edition from 2009 by Oxford University Press) based on the British National Corpus (BNC) is one example where high frequency collocations for very basic words like any and new predominate and where licensing restrictions permit only one computer installation per CD ROM. Further restrictions compound the openness issue with the use of closed corpora in leading corpus-derived ELT books such as the Cambridge University Press (CUP) publication, From Corpus to Classroom (O’Keeffe, McCarthy & Carter, 2007), which might have been more aptly entitled, From Corpus to Book, as it draws heavily on the closed Cambridge and Nottingham Discourse Corpus of English (CANCODE) from Cambridge University Press and Nottingham University and recommends the use of proprietary concordancing programs, Wordsmith Tools and MonoConc Pro, thereby rendering any replication of analyses for the said corpus inaccessible to its readers.
Mainstream language teacher training bodies continue to sidestep the DDL middleman in the development of their core training curricula (for example, the Cambridge ESOL exams) due to the problems he proposes with accessibility in terms of cost and complexity. Instead, English language teacher training remains steadily focused on how to select and exploit corpus-derived dictionaries with reference to training learners in how to identify, for example: definitions, derivatives, parts of speech, frequency, collocations and sample sentences. In the same way that corpus-derived course books do not render corpus analyses transparent to their users, training in dictionary use does not bring teachers and their learners any closer to the corpora they are derived from.
Cambridge English Corpus
Michael McCarthy presented, ‘Corpora and the advanced level: problems and prospects’ at IATEFL Liverpool 2013. One of the key take-away messages from his talk was the fact that learners of more advanced English receive little in the way of return on investment once the highest frequency items of English vocabulary had been acquired (he referred to the top 2000 words from the first wordlist of the British National Corpus that make up about 80% of standard English use). To learn the subsequent wordlists of 2000 words each the percentage of frequency in usage drops considerably, so in terms of cost for the time and money you might end up spending if you sign up to yet more English language classes may not be affordable or feasible. This has particular implications in learning English for Specific Purposes (ESP), including English for Academic Purposes (EAP) which many would argue is always concerned with developing specific academic English language knowledge and usage within specific academic discourse communities.
Catching Michael McCarthy on the way out of the presentation theatre he kindly agreed to walk and talk while rushing to catch his train out of Liverpool. Would the Cambridge English Corpus be made available anytime soon for non-commercial educational research and materials development purposes, I asked? I hastened to add the possibilities and the real world need for promoting corpus-based resources and practices in open and distance online education as well as in traditional classroom-based language education. He agreed that the technology had become a lot better for finally realising DDL within mainstream language teaching and learning and within materials development. Taking concordance line printouts into ELT classrooms had never really taken off in his estimation and I would have to agree with him on that point. He indicated that it would be unlikely for the corpus to become openly available anytime in the foreseeable future, however, due to the large amount of private investment in the development of the corpus with restricted access for those participating stakeholders on the project only.
But what would the real risk be in opening up this corpus to further educational research and development for non-commercial purposes with derivative resources made freely available online? Wouldn’t this be giving the corpus resource added sustainability with new lives and further opportunities for exploitation that could advance our shared understanding of how English works? – across different contexts, using current and high quality examples of language in context? More importantly, wouldn’t this give more software developers the chance to build more interfaces using the latest technology, and for more ELT materials developers, including language teachers, the chance to show different derivative resource possibilities for effectively using the corpus in language teaching and learning?
A non-commercial educational purpose only stipulation could be used in all of the above resource development scenarios. Indeed, these could all be linked back to the Cambridge English Corpus project website as evidence of the wider social and educational impact as a result of their initial investment. This is what will be happening with most of the publicly funded research projects in the UK following recommendations from the Finch report which come into effect in April 2014. It follows that Open Educational Resources (OER) and Open Educational teaching Practices (OEP) will allow for expertise to be readily available when Open Access research publishing is compulsory for all RCUK and EPSRC funding grants for the development of research-driven open teaching and learning derivatives. Privately funded research projects like this one from CUP could also be leading in this area of open access.
Corpora such as the British National Corpus (BNC), the British Academic Written English (BAWE) corpus, Wikipedia and Google linguistic data as a corpus are some of the many valuable resources that have all been developed into language learning and teaching resources that are openly available on the web. In the following sections, I will refer to leading applied corpus linguistics research and development outputs from leading researchers who have been making their wares freely available if not openly re-purposeable to other developers, as in the example of the FLAX language project’s Open Source Software (OSS). And, hopefully these corpus-based resources are getting easier to access for the non-expert corpus user.
“For the time being” CUP are providing free access to the English Vocabulary Profile website of resources based on the Cambridge English Corpus (formerly known as the Cambridge International Corpus), “the British National Corpus and the Cambridge Learner Corpus, together with other sources, including the Cambridge ESOL vocabulary lists and classroom materials.” Below is a training video resource from CUP available on YouTube, which highlights some of the uses for these freely available resources in language learning, teaching and materials development. This is a very useful step for CUP to be taking with making corpus-based resources and practices more accessible to the mainstream ELT community.
Open practices in applied corpus linguistics
Enter those applied corpus linguistics researchers and developers who have made some if not all of their text analysis tools and Part-Of-Speech-tagged corpora freely accessible via the Web to anyone who is interested in exploring how to use them in their research, teaching or independent language learning. Well-known web-based projects include Tom Cobb’s resource-rich Lextutor site, Mark Davies’ BYU-BNC (Brigham Young University – British National Corpus) concordancer interface and the Corpus of Contemporary American English (COCA) with WordandPhrase (with WordandPhrase training videos resources on YouTube) for general English and English for Academic Purposes (EAP), Laurence Anthony’s AntConc concordancing freeware for Do-It-Yourself (DIY) corpus building (with AntConc training video resources on YouTube), and the Sketch Engine by Lexical Computing which offers some open resources for DDL. Open invitations from the Lextutor and AntConc project developers seeking input on the design, development and evaluation of existing and proposed project tools and resources are made by way of social networking sites, the Lextutor Facebook group and the AntConc Google groups discussion list. Responses usually come from a steady number of DDL ‘geeks’, however, namely those who have reached a level of competence and confidence with discussing the tools and resources therein. And, most of those actively participating in these social networking sites are also engaging in corpus-based research.
Data-Driven Learning for the masses?
My own presentation at IATEFL Liverpool was based on my most recent project with the University of Oxford IT Services for providing and promoting OSS interfaces from the FLAX language project for increasing access to the BNC and BAWE corpora, both managed by Oxford. In addition to this, the same OSS developed by FLAX has been simplified with the development of easy-to-use interfaces for enabling language teachers to build their own open language collections for the web. Such collections using OER from Oxford lecture podcasts, which have been licensed as creative commons content, have also been demonstrated by the TOETOE International project (Fitzgerald, 2013).
The following two videos from the FLAX language collections show their OSS for using corpus-based resources in ELT that are accessible both in terms of simplicity and in terms of openness. The first training video demonstrates the Web as corpus and how this resource has been effectively mined and linked to the BNC for enhancement of both corpora for uses in DDL. The second training video demonstrates how to build your own Do-It-Yourself corpora using the FLAX OSS and Oxford OER. With open corpus-based resources the reality of DIY corpora is becoming increasingly possible in DDL research and teaching and learning practice (Charles, 2012; Fitzgerald, in press).
So, go ahead, and cut out the middleman in data-driven learning.
FLAX Web Collections (derived from Google linguistic data):
The Web Phrases and Web Collocations collections in FLAX are based on another extensive corpus of English derived from Google linguistic data. In particular, the Web Phrases collection allows you to identify problematic phrasing in writing by fine-tuning words that precede and follow phrases that you would like to use in your writing by drawing on this large database of English from Google. This allows you to substitute any awkward phrasing with naturally occurring phrases from the collection to improve the structure and the fluency of writing.
FLAX Do-It-Yourself Podcast Corpora – Part One:
Learn how to build powerful open language collections through this training video demonstration. Featuring audio and video podcast corpora using the FLAX Language tools and open educational resources (OER) from the OpenSpires project at the University of Oxford and TED Talks.
References
Anthony, L. (n.d.). Laurence Anthony’s Website: AntConc. Retrieved from http://www.antlab.sci.waseda.ac.jp/software.html
Cobb, T. (n.d). Compleat Lexical Tutor. Retrieved from http://www.lextutor.ca/
Charles, M. (2012). ‘Proper vocabulary and juicy collocations’: EAP students evaluate do-it-yourself corpus-building. English for Specific Purposes, 31: 93-102.
Davies, M. (1991-present). The Corpus of Contemporary American English (COCA). Retrieved from http://corpus.byu.edu/coca/
Davies, M. & Gardener, D. (n.d.) WordandPhrase. Retrieved from http://www.wordandphrase.info
Fitzgerald, A. (2013). TOETOE International: FLAX Weaving with Oxford Open Educational Resources. Open Educational Resources International Case Study. Commissioned by the Higher Education Academy (HEA), United Kingdom. Retrieved from http://www.heacademy.ac.uk/projects/detail/oer/OER_int_006_Ox%282%29
Fitzgerald, A. (In Press). Openness in English for Academic Purposes. Open Educational Resources Case Study based at Durham University: Pedagogical development from OER practice. Commissioned by the Higher Education Academy (HEA) and the Joint Information Systems Committee (JISC), United Kingdom.
FLAX. (n.d.). The “Flexible Language Acquisition Project”. Retrieved from http://flax.nzdl.org/
Johns, T. (1991). From printout to handout: grammar and vocabulary teaching in the context of data-driven learning. In: T. Johns & P. King (Eds.), Classroom Concordancing. English Language Research Journal, 4: 27-45.
Johns, T. (2002). ‘Data-driven learning: the perpetual challenge.’ In: B. Kettemann & G. Marko (Eds.), Teaching and Learning by Doing Corpus Analysis. Amsterdam: Rodopi. 107-117.
Hyland, K. (2006). English for Academic Purposes: An Advanced Handbook. London: Routledge.
McEnery, T. & A. Wilson. (1997). Teaching and language corpora. ReCALL, 9 (1): 5-14.
O’Keeffe, A., McCarthy, M., & Carter R. (2007). From Corpus to Classroom: language use and language teaching. Cambridge: Cambridge University Press.
Oxford Collocation Dictionary for Students of English (2nd Edition) (2009), Oxford University Press.
Tribble, C. (2012). Teaching and Language Corpora Survey. Retrieved from http://www.surveyconsole.com/console/TakeSurvey?id=742964
Recent Comments