The doge meme teaches us so much about language learning and how challenging it can be to accurately combine words and patterns when using another language. The FLAX language system teaches us so much about how we can avoid using dodgy language by employing powerful open-source language analysis tools and authentic language resources.

flaxHeader_leftlinkedup trophyThe FLAX (Flexible Language Acquisition) project has won the LinkedUp Vici Competition for tools and demos that use open or linked data for educational purposes. This post is the one I wrote to accompany our project submission to the LinkedUp challenge.

FLAX is an open-source software system designed to automate the production and delivery of interactive digital language collections. Exercise material comes from digital libraries (language corpora, web data, open access publications, open educational resources) for a virtually endless supply of authentic language learning in context. With simple interface designs, FLAX has been designed so that non-expert users — language teachers, language learners, subject specialists, instructional design and e-learning support teams — can build their own language collections.

The FLAX software can be freely downloaded to build language collections with any text-based content and supporting audio-visual material, for both online and classroom use. FLAX uses the Greenstone suite of open-source multilingual software for building and distributing digital library collections, which can be published on the Internet or on CD-ROM. Issued under the terms of the GNU General Public License, Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO.


images_entries_entry_image_file_-_entry_id-4433_-_20111221124909164.w_420.h_280.m_crop.a_center.v_topAt FLAX we understand that content and data vary in terms of licensing restrictions, depending on the publishing strategies adopted by institutions for the usage of their content and data. FLAX has, therefore, been designed to offer a flexible open-source suite of linguistic support options for enhancing such content and data across both open and closed platforms.

Featuring the Latest in Artificial Intelligence &

Natural Language Processing Software Designs

Within the FLAX bag of tricks, we have the open-source Wikipedia Miner Toolkit, which links in related words, topics and definitions from Wikipedia and Wiktionary as can be seen below in the Learning Collocations collection  (click on the image to expand and visit the toolkit in action).

Wikipedia Mining Tool in FLAX Learning Collocations Collection – click on the image to expand and visit the collection

Featuring Open Data

Available on the FLAX website are completed collections and on-going collections development with registered users. Current research and development with the FLAX Law Collections is based entirely on open resources selected by language teachers and legal English researchers as shown in the table below. These collections demonstrate how users can build collections in FLAX according to their interests and needs.

Law Collections in FLAX


Type of Resource

Number and Source of Collection Resources

Open Access Law research articles
40 Articles (DOAJ – Directory of Open Access Journals, with Creative Commons licenses for the development of derivatives)
MOOC lecture transcripts and videos (streamed via YouTube and Vimeo)
4 MOOC Collections: English Common Law (University of London with Coursera), Age of Globalization (Texas at Austin with edX), Copyright Law (Harvard with edX), Environmental Politics and Law (OpenYale)
Podcast audio files and transcripts (OpenSpires)
15 Lectures (Oxford Law Faculty, Centre for Socio-Legal Studies and Department of Continuing Education)
PhD Law thesis writing
50-70 EThoS Theses (sections: abstracts, introductions, conclusions) at the British Library (Open Access but not licensed as Creative Commons – permission for reuse granted by participating Higher Education Institutions)
British Law Reports Corpus (BLaRC)
8.8 million-word corpus derived from free legal sources at the British and Irish Legal Information Institute (BAILII) aggregation website
FLAX Wikipedia English
Linking in a reformatted version of Wikipedia (English version), providing key terms and concepts as a powerful gloss resource for the Law Collections.
FLAX Learning Collocations
Linking in lexico-grammatical phrases from the British National Corpus (BNC) of 100 million words, the British Academic Written English corpus (BAWE) of 2500 pieces of assessed university student writing from across the disciplines, and the re-formatted Wikipedia corpus in English.
FLAX Web Phrases
Linking in a reformatted Google n-gram corpus (English version) containing 380 million five-word sequences drawn from a vocabulary of 145,000 words.

FLAX Training Videos

Featuring Game-based Activities

Click on the image below to explore the different activities that can be applied to language collections in FLAX.

FLAX Apps for AndroidAbout FLAX

We also have a suite of free game-based FLAX apps for Android devices. Now you can interact with the types of activities listed above while you’re learning on the move. Click on the FLAX app icon to the right to access and download the apps and enjoy!

 collocsmatchingapp  collocmatchingapp

FLAX Research & Development

oerresearchhubTo date, we have distributed the English Common Law and the Age of Globalization MOOC collections in FLAX to thousands of registered learners in over a 100 countries – wow!

A collaborative investigation is underway with FLAX and the Open Educational Resources Research Hub (OERRH), whereby a cluster of revised OER research hypotheses are currently being employed to evaluate the impact of developing and using open language collections in FLAX with informal MOOC learners as well as formal English language and translation students.

Radio Ga Ga by Queen via Deviant Art
Radio Ga Ga by Queen via Deviant Art

This is the third satellite post from the mothership post, Radio Ga Ga: corpus-based resources, you’ve yet to have your finest hour. I have also made the complete hyperlinked post (in five sections) available as a .pdf on Slideshare.

Radio 3

I confess that I spend most of my time listening to BBC Radio 3. The parallel that I will draw here is that I was never formally educated in classical music in the same way as I have never worked toward formal qualifications in corpus linguistics during any of my studies. Because I am working broadly across the areas of language resources development and enhancing teaching and learning practices through technology it was only a matter of time, however, before I started exploring and toying with corpus-based resources. I met Dr. Shaoqun Wu of the FLAX project while at a conference in Villach, Austria in 2006 and by 2007 I had begun to delve into the world of open-source digital library collections development with the University of Waikato’s Greenstone software, developed and distributed in cooperation with UNESCO, for realising the much broader vision of reaching under-resourced communities around the world with these open technologies and collections.

Bridging Teaching and Language Corpora (TaLC)

Let’s fast forward to the 2012 Teaching and Language Corpora Conference in Warsaw, Poland. Although I have participated in corpus linguistics conferences before, this was my first time to attend the biennial TaLC conference. TaLCers are very much researchers working in the area of corpus linguistics and DDL and this conference was themed around bridging the gap between DDL research and uses for corpus-based resources and practices in language teaching and learning.

One of the keynote addresses from James Thomas, Let’s Marry, called for greater connectedness in pursuing relationships between those working in DDL research and those working in pedagogy and language acquisition. At one point he asked the audience to make a show of hands for those who knew of big names in the ELT world, including Scrivener, Harmer and Thornbury. Only a few raised their hands. He also made the point that these same ELT names don’t make their way into citations for research on DDL. Interestingly, I was tweeting points made in the sessions I attended to relevant EAP and ELT / EFL / ESL communities online without a TaLC conference hashtag. It would’ve been great to have the other TaLCers tweeting along with me, raising questions and noting key take-away points from the conference to engage interested parties who could not make the conference in person and to catalogue a twitterfeed for TaLC that could be searched by anyone via the Internet at a later point in time. It would’ve also been great to record keynote and presentation speakers as webcasts for later viewing. When approached about these issues later, however, the conference organisers did express interest in ways of amplifying their events by building such mechanisms for openness into their next conference.

Prising open corpus linguistics research in Data Driven Learning (DDL)

Problems with accessing and successfully implementing corpus-based resources into language teaching and learning scenarios have been numerous.  As I discussed in section 2 of this blog, many of the concordancing tools referred to in the research have been subscription-based proprietary resources (for example, the Wordsmith Tools), most of which have been designed for at least the intermediate-level concordance user in mind. These tools can easily overwhelm language teaching practitioners and their students with the complex processing of raw corpus data that are presented via complex interfaces with too many options for refinement. Mike Scott, the main developer of the Wordsmith Tools has also released a free version of his concordancing suite with less functionality and this would suffice for many language teaching and learning purposes. He attended my presentation on opening up research corpora with open-source text analysis tools and OER and was very open-minded as were the other TaLCers whom I met at the conference regarding new and open approaches for engaging teachers and learners with corpus-based resources.

There are many freely available annotated bibliographies compiled by corpus linguists which you can access on the web for guidance on published research into corpus linguistics. Many researchers working in this area are also putting pre-print versions of their research publications on the web for greater access and dissemination of their work, see Alex Boulton’s online presence for an example of this. Also hinted at earlier in part 2 of this blog are the closed formats many of this published research takes, however, in the form of articles, chapters and the few teaching resources available that are often restricted to and embedded within subscription-only journals or pricey academic monographs.  For example, Berglund-Prytz’s ‘Text Analysis by Computer: Using Free Online Resources to Explore Academic Writing’ in 2009 is a great written resource for where to get started with OER for EAP but ironically the journal it is published in, Writing and Pedagogy, is not free. Lancaster University is home to the openly available BNCweb concordancing software which you only need register for to be able to install a free standard copy on your personal computer. A valuable companion resource on BNCweb was published by Peter Lang in 2008 but once again this is not openly accessible to interested readers who cannot afford to buy the book. The great news is that the main TaLC10 organiser, Agnieszka Lenko, has spearheaded openness with this most recent event by trying to secure an Open Access publication for the TaLC10 proceedings papers with Versita publishers in London.

DIY corpora with AntConc in English for Specific Academic Purposes (ESAP)

At TaLC10 I discovered a lot of overlap with Maggie Charles’ work on building DIY corpora with EAP postgraduate students using the AntConc freeware by Laurence Anthony. We had also included workshops on AntConc for students in our OER for EAP cascade at Durham so it was great to see another EAP practitioner working in this way who had gathered data from her on-going work in this area for presentation and discussion at the conference. Many of her students at the University of Oxford Language Centre are working toward dissertation or thesis writing which raises interesting questions around enabling EAP students to become proficient in developing self-study resources for English for Specific Academic Purposes (ESAP). Her recent paper in the English for Specific Purposes Journal (2012) points to AntConc’s flexibility for student use due to it being freeware that can be installed on any personal computer or flash-drive key for portable use. Laurence Anthony’s website also offers a lot of great video training resources for how to use AntConc. The potential that AntConc offers for building select corpora to those students currently pursuing inter-disciplinary studies in higher education is also noted by Charles. Having said this, drawbacks with certain more obscure subject disciplines, for example Egyptology (Ibid.), that had not yet embraced digital research cultures and were still publishing research in predominantly print-based volumes or image-based .pdf files made the development of DIY corpora still beyond the reach of those few students.

Beyond books and podcasts through linking and crowd-sourcing

While presenting on the power of linked resources within the FLAX collections and pushing these outward to wider stakeholder communities through TOETOE, I came across another rapid innovation JISC-funded OER project at the Beyond Books conference at Oxford. The Spindle project, also based at the Learning Technologies Group Oxford, has been exploring linguistic uses for Oxford’s OpenSpires podcasts with work based on open-source automatic transcription tools. Automatic transcription is often accompanied with a high rate of inaccuracy. Spindle has been looking at ways for developing crowd-sourcing web interfaces that would enable English language learners to listen to the podcasts and correct the automatic transcription errors as part of a language learning crowd-sourcing task.

Automatic keyword generation was also carried out in the SPINDLE project on OpenSpires project podcasts, yielding far more accurate results. These keyword lists which can be assigned as metadata tags in digital repositories and channels like iTunesU offer further resource enhancement for making the podcasts more discoverable. Automatically generated keyword lists such as these can also be used for pedagogical purposes with the pre-teaching of vocabulary, for example. The TED500 corpus by Guy Aston which I also came across at TaLC10 is based on the TED talks (ideas worth spreading) which have also been released under creative commons licences and transcribed through crowd-sourcing.

The potential for open linguistic content to be reused, re-purposed and redistributed by third parties globally, provided that they are used in non-commercial ways and are attributed to their creators, offers new and exciting opportunities for corpus developers as well as educational practitioners interested in OER for language learning and teaching.


Anthony, L. (n.d.). Laurence Anthony’s Website: AntConc.

Berglund-Prytz, Y (2009). Text Analysis by Computer: Using Free Online Resources to Explore Academic Writing. Writing and Pedagogy 1(2): 279–302.

British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Oxford University Computing Services on behalf of the BNC Consortium.

Charles, M. (2012). ‘Proper vocabulary and juicy collocations’: EAP students evaluate do-it-yourself corpus-building. English for Specific Purposes, 31: 93-102.

Lexical Analysis Software & Oxford University Press (1996-2012). Wordsmith Tools.

Hoffmann, S., Evert, S., Smith, N., Lee, D. & Berglund Prytz, Y. (2008). Corpus Linguistics with BNCweb – a Practical Guide. Frankfurt am Main: Peter Lang.