Discovering Babel – final outcomes

This is a summary of some of the key outcomes of the Discovering Babel project, with links to where you can find out more.

Next steps

For those of you looking to find electronic literary and linguistic resources please visit the Oxford Text Archive (OTA) and the CLARIN Virtual Language Observatory. The OTA will shortly relaunch with a new look and feel,and many new resources. The VLO is constantly improving and under development.

Those of you creating and sharing language resources, please join the CLARIN-UK mailing list. This list is a forum for creators and users of linguistic resources and tools to discuss how we can go forward to develop better facilities and shared services, and to gather user requirements.

Evidence of reuse

The metadata that has been made available as part of the Discovering Babel project is being harvested by the CLARIN Virtual Language Observatory, and can be viewed on their portal. At the moment, we still have some performance issues with delivering the files via OAI-PMH, so there may only be a few records listed there, but we have identified the problem and will be fixing it in the next few days!

The work in Discovering Babel has contributed to an enhanced Oxford Text Archive, with more reliable and more easily discovered catalogue records, and with open access texts at persistent locations. This is designed to allow others to build services on top of our data, in a distributed environment. It has already helped to make possible the JSC-funded Great Writers project, which will, among other things, link to source texts in various formats, including epub, in the OTA.

The OTA is now also working together with the creators of Voyant at the University of Alberta, who have under development exactly the sort of tools that we imagined would bring our texts alive. Visit http://voyeurtools.org/ and paste in the following URI to get a flavour of what will be possible:

http://www.ota.ox.ac.uk/text/3253.xml

You can see more about this text at http://www.ota.ox.ac.uk/desc/3253. At the beginning of 2011, texts from the OTA were only available on request for download. Already now, thanks in large part to Discovering Babel, we are seeing on our desktop the emergence of seamless access to distributed texts  with remote tools in a service-oriented architecture.

Further collaborations with the National Grid Service in the UK to host language resources in the Cloud for UK researchers, with the development of a cross-repository search service for CLARIN, and shared services in Project Bamboo will all be underpinned in part by work done in Discovering Babel.

Skills needed for the project

The basic technical skills needed were for processing XML, e.g. XSLT 1.0 and 2.0, plus installation of modules in an Apache server, including Shibboleth access and identity management software. Various perl scripts were also deployed. Exactly how to do these things in this circumstances in which we were working were not things that anyone in the team had done before. For example, we had to read about and learn the specifications for the Open Archives Initiative Protocol for Metadata Harvesting, and the about the element set for describing language resources from the Open Language Archives Community, as well as the Shibboleth software. We were able to call on expertise in the Oxford University Computing Services for the fundamental technical areas and administrative procedures, and on experts in the CLARIN network across Europe for guidance on implementation in the specific scenarios for sharing language resources. Perhaps more than technical skills, knowledge of the work that was going on in our institution, nationally, and around Europe in the relevant areas were key to the success of the project.

Most significant lessons learned

  • don’t build a digital silo: engage with infrastructure initiatives, such as CLARIN, and find out about recommendations for good practice in connecting resources, such as the Resource Discovery Task Force, and avoid building an online resource which is difficult to find and unconnected to other data and tools;
  • at the technical level, be flexible. This work touched on fast-changing fields, and we needed to be prepared to learn about new things, and to change the technological solutions which we deployed. This also meant planning for future change in order to make services sustainable;
  • keep it simple: our successes were not the result of great leaps forward, or building complex and flashy front-ends and tools. Instead, we applied good practice in a systematic way in order to provide reliable services to underpin and fit into a shared services infrastructure. So simply providing crosswalks to Dublin Core from our metadata, and establishing an OAI-PMH service opened many doors. Putting the resource files at accessble URIs on the web allows new types of service to be developed, with much easier access and more powerful functionality.
Posted in Uncategorized | Leave a comment

CLARIN infrastructure notes – on the record

In a recent informal meeting involving various members of the CLARIN and other infrastructure initiatives, we had an open, frank and “off the record” discussion about successes and failures so far, and plans for the future. In preparation for the meeting, and to get the discussions going, we were asked to think of five points in response to each of three questions. I’m happy to go “on the record” with mine here!

What were your original impulses and dreams [when CLARIN planning started around 2006]?

1. To build an Arts and Humanities Data Service for Europe, on the model of the AHDS in the UK, to support digital work in the literary and linguistic subject areas, and link with similar emerging initiatives then emerging, e.g. at the CNRS in France.

2. To promote and integrate Central and East European researchers, resources and languages, continuing the work of TELRI project in the previous period.

3. To build new European networks, built on transparency, openness and a real desire to engage with, support and improve research, to replace failed European initiatives which were sometimes built on careerism, croneyism and corruption.

4. To move the focus of language resource & tool creators (especially computational linguists) towards the requirements of Humanities researchers, making it easier for users with little technical support to do simple yet powerful things with key resources.

5. To facilitate the participation of literary and linguistic disciplines in the emerging e-Science agenda.

What are the most important successes and failures so far?

1. Success: the initiative is almost pan-European, although some key countries not involved or not fully integrated (UK, Italy), and a very few not involved at all (Ireland, Switzerland); the integration of former TELRI partners from central and eastern Europe was successfully achieved.

2. Success: we have succeeded in getting enough funding from national funders to make CLARIN happen!

3. Partial failure: we’ve only had fairly small-scale engagement so far of scholars to elecit detailed requirements and to develop use cases.

4. Partial failure: we haven’t made the total shift of focus of the CLARIN community away from traditional concerns (own tools and research) to production infrastructure services for the humanities and social sciences.

5. Partial failure: we have not yet created a standards-oriented ecosystem for resource and tool creators to enable them to contribute to sustainable production services. To put it another way No answer to “How do make CLARIN-conformant resources?” I hope that the forthcoming Reference Manual will at least partially solve this problem.

What are the top priorities for future work?

1. We need to work out ways to lobby for and secure funding, in a situation where, in the Humanities, there is a lack of a critical mass of researchers (in any given discipline) who want research computing infrastructure, or who see it as a top priority. This means that here is a lack of an effective lobby group of influential scholars in most forums. This is one of the disadvantages of the cross-disciplinary nature of linguistics and the language resources and tools field.

2. We need to deliver something urgently to show the relevant communities that we can do it, and to give them a clearer idea of what he intend to do. Access and authentication infrastructure (AAI) is the key to delivering any kind of production service which can show and end-to-end use case, so we should make solutions in this area a logical priority.

3. Where is the data processing going to take place, who is going to pay for it, and how will we do the accounting? We urgently need to make progress towards solutions here as well if we are to create production-quality services.

4. Humanities and social sciences research has global connections. How will we accommodate users and service providers outside of our AAI domain? As CLARIN starts to rely on national funding, there is an increased danger of two-speed progress, with some countries and communities who are currently engaged being pushed out.

5. What will the platforms for users, and who is going to make the user interfaces? Are we going to be able to overcome fragmentation and ‘silo-building’ – can we offer a good user experience while still allowing flexibility and connectiveness? If so, how, and when?

Posted in Uncategorized | Leave a comment

Making your language resources discoverable and reusable

By Ylva Berglund Prytz and Martin Wynne, University of Oxford

The JISC-funded Discovering Babel project has enabled the Oxford Text Archive to improve the ways in which we make our language resources available for users to find and use. Here we will explain some of the ways in which other resource creators might be able to follow in our footsteps.

Language resources are electronic collections of language data that can be used for language study and research, and are created in a number of contexts. Sometimes the main purpose of a project is to create a dataset, and in many other cases language resources are created as a part of or simply as the result of a larger project to investigate a particular aspect of language. Irrespective of why and how a resource is created, there is usually scope for making the resource available to others. This report will examine some simple ways in which creators of language resources can make it easier for others to find and reuse them.

Why make resources available?

There are many reasons why you may want to make your language resources available to others. It may be a requirement for your funding. It may be that you simply want to give something back to the community, and contribute to assisting the our accumulation of knowledge. Sharing your resources can also be a way of drawing attention to your work and getting recognition for what you are doing, and showing that it is having an impact on wider research goals. If you are able to show that something you have created is valuable to a larger group of users, this is likely to work in your favour in future grant applications, and when looking for to find collaborators, and support from the community.

Making language resources available is also a way of minimizing duplication of effort. If you have created a resource that others can use, they do not have to spend time and resources on creating their own resource.

Replicability of research results is another important issue. If others are to test and reproduce your results, or attempt to extend or refine them, then they will need to have access to the data, tools and methods which you used. Making resources available in this way is essential to testing, refining and building on research results, and is considered necessary for the verification of research findings and interpretations in many scientific domains.

Assuming that for one of the above reasons, or for another, you want others to know about and maybe also to reuse your language resources, what are the issues that you need to consider before sharing your resources? Thinking about the questions below should make your task easier and the sharing of the resource more effective.

Issues to consider when deciding whether and how to share your resources include:

  • How do you share?
    • Will you offer metadata, to help users find, evaluate and understand your resource?
    • Will you offer a service for users to access the resource (e.g. online access, or download option only)?
    • Will you deposit the resource in an archive or repository (instead of, or in addition to your own service)?
    • Or do you want to only share on request to users who get in touch with you?
  • Legal issues
    • Do you have the right to share the resources?
    • How do you protect your rights?
    • What kind of licence will you ask users to agree to?
  • Administrative and organizational issues
    • Do you have access to the resources needed to share your resources (server, staff, admin, user support, etc.)?
    • Who will be responsible for the service?
    • Are these reliable, sustainable and likely to be available in the long term?
  • Finding your users
    • How do users find your resource?
    • How can you make it easier for users to find/use your resource?
    • Can you support users?
  • Sustainability
    • How do you ensure you have the necessary resources/support/infrastructure to share your resource?
    • How do you ensure continuation of service?

Let’s now examine in more detail some of the issues relating to how to help users to find your language resources.

Making your resources discoverable

If you want to share your resources you have to make sure people know about them and can find them. The most effective way to do this is to make your metadata available to a portal which brings together information about where to find language resources in different locations. These exist in particular sub-domains (e.g. endangered languages, child language acquisition, learner language, sign language, for particular languages or language families, for historical periods, etc.), and there are a couple of more comprehensive initiatives: the Open Language Archives Community, and the CLARIN Virtual Language Observatory. Some questions to explore in order to market your resource effectively include:

  • Who are the potential users? Where do they currently look for resources?
  • What are the relevant mailing lists, conferences, and publications for your target audience?
  • Where in other domains, or sets of users, or geographical regions (beyond your immediate community or target audience) might you find interest in the resource?
  • If the resource is available online, or has a webpage associated with it, make sure you make it easy for search engines to find and index your page, for example by including the correct keywords in the website metadata (see Google’s guidelines for webmasters).

Once you have decided how and to whom you will make your resource descriptions available, it is necessary to provide the necessary information in the right formats. If you decide to deposit your resource in a repository, you will get some assistance in doing this. If you deposit with the Oxford Text Archive, you will need to fill in a deposit form, and then the repository staff will create an electronic metadata record. This will be transformed automatically to the correct formats for the online catalogue record, for OLAC and for CLARIN. If you want to create your own records, you can follow the guidelines provided by the different repositories. Some expertise in creating and manipulating XML documents will be required.

Social media

You may use social media forums such as blogs, twitter, facebook, dig, de.licious, and zotero, if you think that this might be a way to reach your potential users. It might prove to be a way to reach unexpected groups of users by reaching outside of the academy. Your funders might consider this to be a useful way to increase wider impact. It’s probably still not clear how appopriate and useful such methods are, and it’s a fast-changing field. But it doesn’t take much effort to tweet, announce things on facebook, make links on various services. Furthermore, writing blogs can be a good way to report your work to a wide variety of stakeholders and potential users.

The point of making your language resources discoverable is to facilitate the reuse of them by others. Let us now briefly examine some of the issues relating to how you can make this happen as effectively as possible, starting with avoiding any potential legal pitfalls.

Before you can share – a little more on legal issues

Before you make your resource available you have to make sure you have the right to share it. You may also want to look at what you can do you protect your rights (for example release the resource under a particular licence). You also need to consider if there are any restrictions on what users of the resource are allowed to do with it. Can they share it, add to it or develop it further? This could be specified in a user licence which you specify. Rights issues can be complex and often vary between different countries. If you have questions about what rights you have or what you need to do to have the right to share a resource, you may want to consult a legal representative for your area, for example the University lawyers or legal department.
If you are making the resource ‘freely available’, you may want to specify this with an open access licence. One way to encourage reuse is by making it simple for users is to see under what conditions a resource is available.

Creative Commons (CC) licences can be used as a “a simple, standardized way to grant copyright permissions to [your] creative work”. The CC licences can be used to specify that there are no restrictions whatsoever on re-use, or, for example, that people may only use the resource for non-commercial purposes or that they have to acknowledge the original creator when using it. It is also possible to specify that people may create derivatives (for example use part of the resource and/or add to it) and that such derivatives have to be made available under the same licence conditions. For more information about Creative Commons, please see http://creativecommons.org/.

Whatever rights or restrictions you assign to your resource you need to consider if the situation is likely change in the future. For example, will it be the case that restrictions can be lifted after a certain date? Or do you have permission to sue certain source texts only for a limited time? If so, you have to ensure that you can deal with this.

As well as considering the legal and ethical issues relating to making your language resources available, you should also certainly consider the licensing of the metadata associated with your resources. In order for users to be able to find, evaluate and reuse the resources, good descriptions of their nature and context are necessary. It is usual in the domains using language resources for this descriptions to be made freely available, but usually there is not a specific and clear statement of the terms under which they are made available. In order to avoid any restrictions on the free sharing of metadata, and to ensure that maximum use is to be made of it, it is better to assign a specific open access licence to all metadata records, such as ODC-PDDL or a Creative Commons licence.

In the case of the Oxford Text Archive, we found that because some of our resources are TEI XML documents, with the metadata embedded in the header of a single file which also contains the resource in the body, then it was necessary to apply a single licence to both metadata and data, and we have found that the Creative Commons best fulfills our needs for licensing the textual data (in most cases), we opted for that. In cases where we make just the metadata available, for example as a catalogue, and to metadata harvesters, we will apply the least restrictive possible Creative Commons licence, usually know as the ‘no copyright’ or ‘CC0′ licence (http://creativecommons.org/publicdomain/zero/1.0/).

How do you enable reuse of your language resources?

Depending on the nature of the resources at your disposal, you can opt to share your resources in various ways. Whatever way you choose, the key point is to ensure that the solution that you choose is not dependent on specific people, machines, projects, etc. which are likely to be transient, but rather that it is embedded in stable organizational set-up which is adequate for providing persistent service with high availability. The key questions to ask in deciding what sort of service to offer and how to provide it are the following:

  • Is what I am setting up sustainable?
  • Is the solution technically robust and not subject to discontinuation should current funding/staffing/equipment be cut
  • Who is responsible for the service?
  • Is this a person (named or defined by function) or an organisation (unit, department, institution)?
  • Who is responsible for the various bits of infrastructure on which the service depends?
  • Technology (server, scripts, physical server space, etc)
  • Human resources (server maintenance, user support)
  • What will the situation be in 1, or 2, or 5, or 10, years time?
  • What happens if you (or the person responsible for the service or part of it) leave or take on a different role?
  • What happens at the end of the current round of funding?
  • Will additional funding be needed/be available to continue the service?
  • Would it be better to look to move the service to another institutional home?

How can I make it easier for users to use the resources?

Let’s examine some of these options in a little more detail.

Distribution via email or on disk

A simple option, especially where small resources are concerned, is to simply send the resource to whoever requests it either as an email attachment (suitable for very small resources only) or on a CD or DVD.
This is only suitable for low-demand, small resources. You still have to consider legal issues and what provision there is for making the resource available also if you are not available personally to respond to requests. For distribution on disk there is also a cost – for the media and postage. What is more, the end user is left to their own devices when it comes to getting their resources connected to the relevant analysis tools. It can be tricky to work out which ones to use – will you be prepared to offer advice and technical support to users? Some will ask for it.

Online delivery

If you make your resource available online, you can opt to either make it available for download only (with some of the same problems identified above), or you may offer an online service where people can access and use via their web browser (for example a corpus with a search interface) . Now a new set of questions arise:

  • Who maintains the website?
  • Can the site handle the volumes of traffic, and the amount of processing required?
  • How will you know how many users have visited the site and downloaded your resource, or performed other operations? Do you need to report this to funders or other stakeholders?
  • Who will maintain the server and ensure that the service is available?
  • Will you offer a service level description, setting down exactly what you offer and under what terms?
  • Can you monitor the availability of the online services (i.e. tell if everything is up and working properly)?
  • Do you need to restrict access to certain classes of user? If so, how will you do this?
  • Do you need to recognize users so that they can come back to datasets and workflows that they have started to assemble on previous visits?
  • How will you deal with user support or queries (technical or about the resource/service)?Will it be available even if you leave the institution, or change your ISP?
  • Is the URL stable, or is it likely to change when the university re-designs its website (or the website host goes into administration)?
  • What happens when the technology behind the service needs updating/renewing (for example to work on different operating systems or in different browsers)?
  • Are you prepared to offer any guarantees of availability and persistence of service to users who might require stable datasets and tools for their research, or who may want to be able to come back and reproduce results at a later date?
  • How will users cite your datasets and services in their reports and publications?

Depositing your resource in an archive or repository

A lot of the issues arising from running your own web service can be avoided if you deposit your resources in a repository, which will deal with distribution, as well as perhaps offering long-term preservation, help with generating and sharing metadata, and connection with other tools and resources. So, you may also opt to deposit the resource in a repository. In deciding whether to do this, and whether a repository is appropriate, you may wish to consider:

  • Is there a cost associated? If so, is it a once-off, annual, etc? How will you pay ongoing fees after the end of the project?
  • What do you have to do to deposit (for example format of resource and metadata)?
  • How stable and reliable is the repository? How long is their funding likely to be continued?
  • Who knows about the repository? Is it known to potential users of your resource? Does it share metadata with relevant aggregators, and announce new deposits in appropriate forums?
  • Who has the right to use it? Is access restricted to members of particular institutions, associations, countries, etc.? Are there technical barriers which might exclude some sets of users?

There are several archives and repositories available. The Oxford Text Archive offers a service to deposit resources for a small administrative fee. This has the advantage of being a specialist archive for literary and linguistic resources, offering metadata to aggregators in this domain, and part of the emerging research infrastructure being developed by CLARIN. Other services exist for more specialist resource types, such as SCOTS at the University of Glasgow for Scottish and historical resources, CHILDES for language acquisition studies, ICAME for English language resources, and the Endangered Languages Archive at the School of Oriental and African Studies, University of London. Each is well embedded in their research communities, and so deposit with such an archive is an excellent way to reach particular sets of users.

There is also a lot of ongoing work in developing institutional repositories in Universities in the UK. While some of these are focussed exclusively on e-prints, some offer repository services for research data as well. Creators of resources should check on the facilities and services available in their institution (often based in the library or information services department), and deposit with your institutional repository it may be a viable option. This may be useful for raising your profile locally and as a secure storage solution. It is however highly unlikely to satisfy all of your needs. An institutional repository which aims to cater for research output of all types and for all disciplines cannot have specialist curation expertise in all areas, and will not, for example, know about all of the relevant metadata standards, best practice in digital preservation of language resources, or connection to relevant discipline-specific resource discovery services. Repositories will typically offer non-exclusive deposit agreements, which means that when you deposit your resources, you do not give up any of your rights. There is normally no barrier to you depositing your resource in numerous archives. This is effective for preservation purposes, although you may need to consider the impact that it might have in terms of version control (will the resource be updated, and how to you check that the latest version is available in all places?), and monitoring usage.

Furthermore, it is increasingly likely that federations of archives, with the possibilities of cross searching resources, and connecting disparate collections and tools. Beyond this, sophisticated virtual research environments will emerge allowing more operations, as well as collaborations between groups of scholars, and connections to publications and other outputs. It is likely to be the specialist repositories which are connected to this new infrastructure, and it is likely to become increasingly difficult for the individual scholar to connect up their resources without the assistance of the repository and infrastructure specialists.

Whichever of the options you choose, you can help to ensure that users can work with your resource as effectively as possible by considering offering the following facilities:

  • A full description of the resource, carefully crafted user guidelines, FAQ, instructions (preferably with screenshots);
  • Support for answering user queries;
  • A forum for users where they can discuss issues that come up. Make sure that you, or someone with good knowledge of using the resource, is available to respond to queries, in particular if the forum is new or under-used;
  • Make it easy for users to give appropriate accreditation to resource creators and access services, thereby also further promoting your resource and announcing its availability;
  • Make it clear what the title of the resource is, who the creator is and where it is found (at a persistent URL);
  • Make any licence restrictions clear (especially if your licence stipulates that the resource creator/owner should be attributed by any user);
  • Include on your website a sample citation/bibliography entry that users can use for reference;
  • If you are offering an online service, test the interface during development, and try to find some resources for ongoing development in response to user feedback.

In summary, you need to take as wide a view as possible about who the potential users are, how they will find the resources, how they might want to use them, and then to think about how the arrangements will continue in the future. Good luck!

Posted in Uncategorized | Leave a comment

Discovering Babel: technical issues

The Discovering Babel project aims to make the digital resources in the Oxford Text Archive easier to discover for potential users. The technical issues in the project relate to the ways in which we are making the OTA catalogue data available in new ways. There are several aspects to this work:

  1. making the catalogue records available to be collected by online resource discovery services;
  2. transforming the catalogue records into a variety of different formats for the different services;
  3. updating catalogue records for the items in the archive.

Making the records available

Before Discovering Babel, the OTA metadata was available only in abbreviated form in the catalogue list on the website, and on the webpages for each resource, or in full when a user downloaded the resource. An important additional service made available as part of the project workplan was to make the full metadata available for online services to collect, or harvest it. We chose to do this using the most widely used protocol for this purpose, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH for short).

In order to do this we had to follow the following steps:

  • add the appropriate Apache and Perl modules to our web server to allow OAI-PMH queries to our web service;
  • implement crosswalks (using XSLT) from our metadata in TEI Header format to the Dublin Core format;
  • register as a metadata provider with relevant aggregators;
  • set up procedures to ensure the ongoing availability, persistence, maintenance and updating of the OAI-PMH service

We have chosen to make metadata available in a number of formats via OAI-PMH, to fit the expectations and requirements of a number of harvesters relevant to our field. We therefore deliver Dublin Core, with extensions for the Open Language Archives Community (OLAC) and the TEI Headers. We were also planning to provide CMDI metadata for the CLARIN aggregator, but this format has not yet achieved sufficient maturity and stability, so we aim to add this later. In the meantime, the CLARIN aggregator is harvesting OLAC metadata, and in this way they are presenting OTA resources in the Virtual Language Observatory service at http://www.clarin.eu/vlo.

The OTA records are harvested from http://ota.oerc.ox.ac.uk/oai2/XMLFile/ota/oai.pl.

Crosswalks: transforming the metadata to different formats

We initially wrote the crosswalks using XSLT 2.0, but we found that the performance was very poor, and too slow for the harvesting services. We therefore backported the code to XSLT 1.0, which provided adequate performance and enabled the harvesters to operate. We plan to investigate these issues further together with other CLARIN centres to see if future improvements to the performance can be achieved.

What we understand so far is that the repeated calls to the Java-based XSLT 2.0 processor Saxon (in our case, using the saxonb-xslt package on Ubuntu) seem to be the problem. The original stylesheet which we wrote to transform the TEI Headers worked on a directory of header files. However, due to the way in which the OAI-PMH architecture works, the stylesheets had to be written to work on a file-per-file basis. So the Java Virtual Machine starts again and again for each call of Saxon, i.e. for every metadata item. This was very costly computationally, and simply providing more computing power would not have been a very good solution, since the procedure seems to be simply not easily scalable.

A key point for us to consider at this stage was that the our original stylesheets made use of XSLT 2.0 features, but there are few 2.0 processors available. None seem to be based on C or C++. The only real alternative to Saxon of which we were aware were the closed-source AltovaXML products, only available for Windows 32-bit architectures.

We therefore ran tests with C-based XSLT 1.0 processing (with the xsltproc package on Ubuntu), which was lightning fast in comparison for the hundreds of metadata records, with a time factor improvement of 100-200 times compared to Saxon. We therefore rewrote the pertinent parts of the stylesheets to conform to XSLT 1.0 and implemented this solution.

We also considered another possibility, of moving to a servlet-based solution. There is a Java-based OAI implementation (jOAI), for example, to be deployed on a Tomcat Server. Another option would have been to investigate setting up the Java-based Saxon XSLT 2.0 as a service in its own right, which could be consumed by the Perl Code. Both solutions would not involve starting up the JVM again and again. However, either solution would make it necessary to set up a server (Tomcat or Jetty, respectively), and we considered that as well as the additional effort to implement, this would raise an additional maintenance overhead, with serious risks to the robustness, persistence and sustainability of the service.

Updating the records

The OTA has always made freely available the descriptions of the electronic resources in the archive. These descriptions take the form of catalogue records, or metadata, and contain information useful to potential users about the resource, including its title, a summary of the content, where the electronic resource came from (its provenance), technical formats, types of annotation, size of the files, any restrictions on its use, etc..

This metadata for each resource is encoded in an XML file, and the information is encoded according to the guidelines of the Text Encoding Initiative (TEI), following the latest (P5) version of the guidelines. In the area of literary and linguistic computing, the TEI Guidelines are a widely recognized and respected reference point and standard for the encoding of data and metadata. The metadata for OTA resources is therefore in the form of a TEI Header.

The work in Discovering Babel on making this metadata more visible, and on transforming it into other formats has revealed some areas where it was necessary to update, correct or add to the existing information in the metadata. For example, it was found that the description of the language of a resource was missing in some cases, usually where the language was English, and was perhaps considered the default value in the past!

Posted in Uncategorized | Leave a comment

Discovering Babel workshop

A workshop on How to make your language resources discoverable was held at Oxford University Computing Services on Friday June 24th, as part of the JISC-funded Discovering Babel project.

Ylva Berglund-Prytz from OUCS welcomed the participants, who introduced themselves and revealed that they came from numerous universities, representing teachers, researchers, post-graduate students and archivists, from the UK and abroad. See slides (pptx).

Andy McGregor introduced the work of the Resource Discovery Task Force and the JISC programme ‘Infrastructure for Resource Discovery’, with a refreshing willingness to acknowledge the different standards and practices in different disciplines. See slides (pptx).

Martin Wynne then spoke about Discovering Babel, the project within the programme which relates to language resources, focussing on the issues relating to the different ways of describing and cataloguing language corpora (and other resources) and making those descriptions available to users in a variety of ways. See slides (pdf).

Alexander König of the Max Planck Institute for Psycholinguistics then gave a demonstration of the CLARIN Virtual Language Observatory, which is collecting and making available to users in a single place the information about language resources from all around Europe.  Most impressive was the overlay of the geographical data on Google Earth, allowing users to find resources via the map. See slides (ppt).

James Wilson then spoke about the suite of projects (many of them JISC-funded) in OUCS which are addressing the more general data management needs of researchers. After the discipline-based and pan-European scope of the CLARIN initiative, it was fascinating to compare the idea of service provision which we might hope to find within an institution. See slides (pptx).

In the afternoon, a ‘show-and-tell’ session then allowed participants to share information about the resources and services that they were sharing with other researchers. This fascinating whirlwind tour of a snapshot of the resources available in the UK showed us all what a variety of extremely valuable datasets continue to be created.

The presentations included:

The final session was a discussion which went beyond concerns about discovering resources, and focussed more on the re-use of resources, and on ways in which they can be exploited online, cross-searched, combined together, and connected with online tools and services.

From a very open and frank discussion about our needs, concerns and frustrations there emerged a strong feeling that a UK network was needed to express our requirements more forcefully to funders and other relevant organisations who can help us to build the kind of services that we need.

Recent informal meetings with partially overlapping set of people in Glasgow, Newcastle and Oxford have reinforced my impression that there is a strong desire to form a UK network of researchers interested in language data and tools. The motivations and proposed activities are to:

  • find ways to find, share and reuse resources;
  • develop joint projects to build resources and services;
  • promote interoperability of resources so that they can more easily be used with generic tools, and combined with each other;
  • lobby for UK funders to invest in infrastructure for creating and using language resources;
  • lobby for language data and tools to be included in national computing infrastructure;
  • lobby for UK participation in the European CLARIN infrastructure;
  • provide channels of communication between UK researchers and CLARIN, e.g. to feed in our requirements, get access to services, participate in technical discussions, etc.).

Clearly this meeting was only a starting point!

Posted in Uncategorized | Leave a comment

How to make your language resources discoverable

The Oxford Text Archive will host a one-day workshop on Friday June 24th entitled How to make your language resources discoverable, as part of the JISC-funded Discovering Babel project. The workshop is aimed at researchers who create and use corpora and other digital language resources, and will address the following questions:

How can we help users to find corpora and other digital language resources? Can we hope to have a one-stop shop where we can find them all?

Are there ways to describe the content of language resources in ways that help users to compare them, and find the right ones for their research?

How can I make MY language resources easier to discover and use?

Once we discover what we want, how can we make it easier to use language resources and tools? Can we create virtual research environments for corpus users?

What existing initiatives at the national and international level are addressing these problems, and what are the solutions? What can a grass-roots initiative at the UK level do?

Speakers will introduce Discovering Babel and the CLARIN Virtual Language Observatory, including presentation of the work that has been done in the OTA to make it easier for users to find and use the language resources, and how this work might help support other creators and providers of resources and services. A ‘show-and-tell’ session will then allow participants five minutes each to showcase the resources that they wish to share, or would like to have access to. Discussion will then go beyond the discovery of resources to how we can provide the services and tools that we need for online access to a variety of corpora and lexical datasets.

The workshop will also be one of the events which will launch CLARINET, a new network for UK-based researchers with an interest in furthering digital research in the language sciences and related disciplines. CLARINET will be loosely affiliated to the CLARIN European research infrastructure, and other relevant initiatives, but the focus will be on the requirements of researchers in the UK. The workshop will conclude with a round-table discussion on what CLARINET should aim to achieve.

Click here to sign up for this free workshop.

Posted in Uncategorized | 1 Comment

Do we still need language corpora?

Language corpora were originally developed as datasets for linguistic research, in a world where researchers rarely had access to machine-readable language data. Pioneers such as Stig Johansson (in whose honour this year’s ICAME conference is dedicated) provided an invaluable service and helped to create a new paradigm in linguistic research. Corpus linguistics subsequently developed sets of procedures and methodologies based on discrete, bounded datasets, created to represent certain types of language use, and studied as exemplars of that domain. The growth of the field and advances in technology mean that corpora have become bigger and more plentiful and various, with huge reference corpora for a vast range of languages and time periods, and numerous specialist corpora representing a wide range of language varieties.

Nowadays, the enormous wealth of digital language data at our fingertips brings the role of the corpus into question. Large-scale digitization projects are delivering the writings of the past to our desktops in ways that allow us to configure bespoke datasets to help address our research questions. Much current language data is “born digital” and is easily captured and shared. Books and newspapers are published in electronic form, and made available in large collections. Online tools allow us to search for texts and collect them in virtual corpora. The boundaries between the corpus and other ad hoc datasets are blurring. What is the case for the carefully crafted corpus today?

At a session at the ICAME conference on 1st June 2011 I will be chairing a formal debate, with two speakers for and two against the motion, questions from the floor, and a summing up by the speakers, ending with a vote by the audience.

The motion will be:

“Language corpora are no longer necessary for linguistic research.”

Participants in the ICAME conference are warmly encouraged to come along and participate in what promises to be an entertaining debate on the key question confronting corpus linguistics today.

Speakers for the motion:

Silvia Bernadini, University of Bologna
Elena Tognini-Bonelli, University of Siena

Speakers against the motion:

Gregory Garretson, Uppsala University
Janne Bondi Johannssen, University of Oslo

We plan to release a recording of the event as a podcast from the University of Oxford.

Posted in Uncategorized | Leave a comment

Discovering Babel

I am currently working on the project Discovering Babel: enhanced language resource discovery, as part of a wider project to bring the latest technologies for finding and sharing data to the Oxford Text Archive. The project is funded by JISC, as part of the  Infrastructure for Resource Discovery programme. This blog post explains out the project plan.

Aims, Objectives and Final Outputs of the project

The Oxford Text Archive is home to almost 2,000 literary and linguistic resources in electronic form. Many of the resources are the outputs of projects funded by UK and other funding agencies, including the British Academy, AHRC and AHRB. In the past two and a half years since the demise of the Arts and Humanities Data Service (AHDS), there have been more than 60,000 separate downloads of resources from the OTA.

This project aims to enhance and upgrade the resource discovery mechanisms of the OTA to ensure that it continues to offer free services to researchers in response to their needs, and to ensure that the technical infrastructure of the OTA is in line with the latest good practice in the Resource Discovery Task Force vision (see http://rdtf.jiscinvolve.org/wp/).

In particular, the OTA will implement procedures to ensure that:

  • each resource in the collection is assigned an http URI
  • each resource URI is registered as a persistent identifier with the EPIC Handle Service;
  • URIs are made available via Open Archives Initiative Protocal for Metadata Harvesting (OAI-PMH) in a machine-processable form in XML, in the following formats: Dublin Core; Open Language Archives Community (OLAC) metadata set; Text Encoding Initiative (TEI) Header; CLARIN Component Metadata Infrastructure element set (CMDI);
  • metadata is made freely available as open linked data;
  • every URI is resolved to a machine-processable resource containing relevant metadata.

As the vast majority of the collections described by the metadata are made freely available by the OTA, these enhancements will not only facilitate aggregation of metadata, but will also be key steps towards enabling enhanced access services for the collections, where standards-conformant web services can perform operations on them, and they can be deployed in virtual research environments.

Wider Benefits to Sector & Achievements for Host Institution

The OTA provides a free service to Higher Education institutions and other users worldwide. The enhancements to the service will be of benefit to all users, the vast majority of whom are outside of the University of Oxford. Indeed, a successful outcome for the project would lead to easier and more widespread discovery of the OTA’s services. The workshop, manual and other dissemination and networking activities will help to spread across the sector information about the lessons we learn and mistakes that we make.

For the University of Oxford, these changes to the technical set-up of the OTA will help to connect together more of our services, lowering maintenance costs, sharing facilities, spreading expertise and allowing new services to be built linking together disparate datasets. Learning to deploy these technologies will help us to acquire skills which will be transferable to other projects.

Risk Analysis and Success Plan

The risks associated with this project appear to be relatively low. No new recruitment has been necessary, and the key technical tasks are the responsibility of a team of developers in OUCS, so the potential risks associated with staffing are low. There is the chance that some of the services with which we aim to interoperate will change – for example new protocols for harvesting metadata might be applied – but we have sufficient flexibility to adapt our workplan. The biggest risk with this type of project is usually the sustainability of the outputs. While this can never be certain, we have tried to embed, as far as possible, our work in ongoing production services which are part of Oxford’s core institutional IT infrastructure. The OTA has 35 years of success at ensuring the sustainability of our services, and we aim to continue this proud record!

IPR

The issues relating to intellectual property rights will not be a barrier to the successful completion of this project. Metadata for the resources in the OTA have been created by OTA staff and have always been freely available.

The datasets to which the resource discovery metadata refer are not owned by the OTA, but the OTA has permission to make the resources available subject to a user licence, which restricts use to exploitation for the purposes of education and research.

Project Team Relationships and End User Engagement

Martin Wynne is the prinicpal investigator and project manager. The technical tasks will be carried out by the InfoDev team at OUCS, which includes Sebastian Rahtz, James Cummings, Joseph Tlabot, Alexander Dutton and Richard Buckner. Ylva Berglund of OUCS will also contribute to dissemination activities.

Projected Timeline, Workplan & Overall Project Methodology

The specific items of work and outputs of the project are as follows:

By the end of March:

  • Add records for British National Corpus datasets in OTA catalogue

By the end of April:

  • OAI-PMH target for harvesting OTA metadata (minimum 1000 records)
  • Establish persistent locations for metadata
  • Establish persistent locations for datasets
  • Enhanced and corrected metadata to ensure meaningful interoperation with access services and aggregators
  • Crosswalks for metadata from TEI Headers to DC, OLAC, CMDI
  • Crosswalks for metadata from TEI Headers to RDF

By the end of May:

  • Register persistent identifiers with a handle service
  • Ensure visibility of metadata in OLAC and CLARIN aggregators
  • Workshop ‘How to make your language resources discoverable’ in Oxford

By the end of July (end of project):

  • A freely available online manual ‘How to make your language resources discoverable’
  • Make XSLT for crosswalks available from OTA website
  • Deliver enhanced metadata in all above formats via OAI-PMH

Budget

The overall cost of the project is £37,465, of which the JISC grant pays £28,632, and the University of Oxford is providing the rest. The breakdown of expenditure in the major cost categories is broken down in the figure below:

Budget breakdown graphic

Breakdown of major budget categories

Posted in Uncategorized | Leave a comment

CHAIN Panel Session

In a panel session at the Digital Humanities 2010 conference, representatives of key organisations and infrastructure initiatives explained some of the ways in which they are engaged in building services to support research in the Humanities, and were asked to address the following questions: What are the main barriers to progress? What are the most exciting opportunities?

The panel was organised under the umbrella of CHAIN, the Coalition of Humanities and Arts Infrastructures and Networks. Session chair Martin Wynne introduced CHAIN as a forum for discussion, with a very light-weight organisational structure, with fluid membership and boundaries, no budget, and meeting only when necessary. CHAIN participants have resolved to work together on advocacy for improved infrastructure, and on aligning our infrastructure initiatives to allow the maximum interoperability of services.

John Unsworth for the Association of Digital Humanities Organisations (ADHO) gave an interesting overview of the current state of the digital humanities, and how it was becoming ‘sexy’, to the surprise of many long-standing practitioners. He made an appeal for the audience to be more open and outgoing beyond our own cliques and sub-communities in order to make bigger impacts.

Neil Fraistat unveiled centerNet’s exciting new website, and emphasised the importance of Digital Humanities Centers. He also drew attention to the dangers of creating digital silos, where collaboration only occurs on small scale projects, with no coordination on “big issues” such as linking and sharing data sets and providing shared services. Neil promoted centerNet’s ambitious expansion plans, with already affiliate regional steering committees in Asia Pacific, Europe, North America, and Britain and Ireland. Neil saw the current challenges relating to crossing national boundaries, cultural divides, language communities, especially given the lack of truly international funding opportunities. A key challenge now is for centerNet to deliver a sustainable governance and business plan, and there is the perennial problem of translating good intentions into effective action.

Steven Krauwer presented CLARIN, European federation of digital archives with language data and resources, building an e-infrastructure with uniform single sign-on access to language and speech technology tools to retrieve, manipulate, enhance, explore and exploit data, with a target audience of humanities and social sciences scholars. In terms of barriers and problems, Steven was happy to report good progress in all technical areas, but that the risks appeared in relation to sustainability and take-up by users. CHAIN offers an opportunity for CLARIN to work together with other infrastructures on issues such as defining business models and criteria for success, in order to make the infrastructures, data and tools sustainable. There are also opportunities for us to work together to address legal and IPR issues, to introduce simplified licensing schemes and to influence legislators. CHAIN also offers important routes to access communities of potential users of infrastructures outside of CLARIN’s current network. Coordination is the key, otherwise a multiplicity of infrastructure initiatives risks making the problem of fragmentation even worse.

Sheila Anderson presented the ambitious plans of DARIAH to enhance and support digitally enabled research across the humanities and arts through the establishment of virtual competency centres, focusing on four areas: research and education; infrastructure; content and legal; and advocacy, outreach and promotion. Sheila explained how the vision for DARIAH was not just about technology, but as much social and cultural, involving changes in the ways in which we think about research objects and practice, and in a new social model for the exchange of data and ideas. Effecting these changes will involve harnessing the collective intelligence of the digital humanities in new and innovative ways.

David Robey introduced the Network of Expert Centres in Britain and Ireland, a new network aiming to preserve some of the gains of former programmes such as the AHDS and Methods Network, and the large investments that have been made in digital outputs. The Network is based around arts-humanities.net, a community support and knowledge base that provides the minimum necessary virtual infrastructure in the current situation of a lack of national provision or funding.

Chad Kainz joined us via Skype in the middle of the night from Chicago and told us the latest developments in Project Bamboo. The Bamboo Technology Project has been proposed to the Andrew W. Mellon Foundation, and is planned to start in October 2010. Bamboo is a multi-institutional, interdisciplinary, and inter-organizational effort that brings together researchers in arts and humanities, computer scientists, information scientists, librarians, and campus information technologists to tackle the question: How can we advance arts and humanities research through the development of shared technology services? The Technology Project will provide such tools and services and platforms, and is set to be a key part of the jigsaw in the emerging international e-infrastructure. One of the key challenges today is to engage with European and other international initiatives in order to ensure that the maximum possible interoperability of tools and datasets.

This year’s panel session demonstrated the gains of a year of development work and increased coordination activity. There is still much to be done, but there was a refreshing openness and willingness to cooperate which will be the key to overcoming the current fragmented environment.

Posted in Uncategorized | Tagged , , , , , , | 1 Comment

Summit meeting of Digital Humanities Centres

centerNet had their first international summit at King’s College London onthe 3rd and 4th July 2010. The summit was supported by the NEH and organized by Neil Fraistat and Kay Walter. The summit was a chance for directors of centers and funders to talk to each other, to to develop collaborations, and to develop regional groups.

I was invited as the initiator of CHAIN as well as wearing my hats as member of the CLARIN Executive Board, member of the steering committee of the Network of Expert Centres in Britain and Ireland, and representative of Oxford University, along with David Robey.

For an overview of the proceedings, I recommend Geoffrey Rockwell’s blog:

I will focus here on the elements of relevance to Oxford.

I am pleased to say that we are involved in many of the most important initiatives: CLARIN, DARIAH, CHAIN, Bamboo, Network of Centres, centerNet; we certainly seem to be involved in more things than anyone else!

The regional breakout group for Britain and Ireland discussed recommendations to funders. We identified a barrier to collaboration in that  institutions are in competition with each other for funding. And we discussed how this could be addressed by financial incentives for collaboration. There are funds for regional collaboration in devolved countries (e.g. Wales, Scotland) which have produced useful results. One way to foster cooperation would be to give more incentives to share resources and services.

Funders insisting on sustainability plans involving institutional buy-in and embedding (as JISC do, for example) can help to improve institutional policies and develop capacity. Funders could also help with the promotion of infrastructure and standards: they could give a big boost to (bottom-up) initiatives that promote collaboration and cooperation by using grant conditions and recommendations, at least suggesting these them as ways to promote re-use and linking of data, and thus obtain impact and value for money. There would be no cost to funders to do this. But not all institutions can build a DH centre, or a comprehensive institutional repository, or other services. What is the incentive for big centres to collaborate with small ones? What could be a business model for institutions with well-developed infrastructure to support others?

The AHRC have said that they won’t fund or get involved with infrastructure, so there seems little to discuss with them, unless we can suggest small and cheap things to make an impact. Networks and workshops can be useful, but current schemes are directed at new initiatives, and are short term. Short term funding doesn’t help to sustain the outputs of these activities.

It was strongly felt that we, the researchers, should provide evidence of value in terms of improved or transformed research and teaching and other impacts, via “compelling case studies”. And we felt that the current impact agenda, for all of its faults, could be an opportunity, because it may be a route to rewarding reusability and sharing of resources.

Discussions about the mission, structure and business model for centerNet foundered a little on the notion of ‘center‘. I argued that it was not necessarily desirable for an institution to organize itself with a digital humanities centre, but rather that computing in the Humanities could be promoted and supported by other means. Furthermore, the promotion of centres, and the promotion of the ‘discipline’ of Digital Humanities, risk ghettoization and a reduced relevance of digital activities to the mainstream of research in the various disciplines. It seems that the experience and outlook of the University of Queensland, at least, is in line with ours.

Invited speaker Jon Orwant from Google tried to be controversial, and succeeded with the provocative assertion that funders should only promote bottom-up initiatives  I pointed out (the “good question” cited in Geoffrey’s blog!) that we have decades of experience of bottom-up creation of tools and data, which has resulted in fragmentation, with a variety of standards, data formats and licensing arrangements, and that this is currently the biggest barrier to progress. So the provision of some infrastructure, or at least promoting the adoption of some shared policies and standards, is the key challenge today. Although I would agree that this could be done in as light-weight a manner as possible and so as not to thwart innovation and bottom-up initiatives.

In fact, successful infrastructure initiatives, such as CLARIN, are bottom-up in the sense that the researchers and technologists identified the problem of fragmentation and went to the funders asking for money to build research infrastructure.

In summary, I believe that centerNet is a very useful vehicle for us here in Oxford as a way to connect with numerous centres, communities, regions and funders. In particular, our ongoing involvement can play an important role in:

  • linking our services and resources to users;
  • building collaborative projects;
  • dissemination of our research and other activities;
  • advocacy for digital humanities to funders and politicians and other bodies;
  • international expansion of research communities and collaborations.

To get a visual flavour of the proceedings, you can see some photos: John Unsworth’s photos

The centerNet website is at:

And the new beta site:

Posted in Uncategorized | Tagged , , , | Leave a comment