I was recently invited to Helsinki by Varieng to teach a workshop on TEI XML, and specifically on TEI XML concentrating on transcription. The workshop slides and materials are at http://tei.oucs.ox.ac.uk/Oxford/2010-10-helsinki/. Though these were largely based on the TEI Summer School 2010 that we taught earlier in the year. We may hopefully be partnering with Varieng to convert the Helsinki Corpus to TEI P5 XML.
simple dynamic transformation of xml with htaccess, php, and xslt
I often transform from TEI XML to XHTML as part of projects, but in some instances it is more difficult to manage using things like the eXist XML Database or Apache Cocoon, or even AxKit. This is because the hosting arrangement means that only a limited number of technologies are available.
In most cases these days a linux-based server will have Apache’s http server installed, and hopefully the Apache ReWrite module installed. In addition most hosting, even shared hosting, has PHP installed with libxml for XSL processing. Sadly, this only copes with XSLT1 not XSLT2.
However, one way to use this is to have one’s .htaccess file rewrite incoming URLs to run an xml2html.php conversion.
Basic preceding stuff:
#Turn on Rewriting RewriteEngine On RewriteBase / # Redirect any svn requests RewriteRule ^.svn/(.*)$ http://subversion.tigris.org [R] # utf-8 please AddDefaultCharset UTF-8 # change directory index to index.xml as default DirectoryIndex index.xml index.php index.html index.shtml #ErrorDocuments ErrorDocument 404 /unavailable.html ErrorDocument 403 /forbidden.html
Here we start by turning the RewriteEngine on and setting the RewriteBase to the root of the domain. I’ve also got a RewriteRule that takes any requests for stuff in subversion directories and redirects it to the subversion site instead. (Though actually I’m thinking of having that just 404 or 403 instead.) After that we set the default character set to UTF-8 and change the default directory index file names. and specify some error documents for 404s and 403s. (These are of course actually unavailable.xml and forbidden.xml, and are transformed by the rule further down.)
After this comes the bit where the rewriting of requests for HTML files get turned into parameters on a PHP script:
# If I ask for .xhtml then give me xml2html
RewriteRule ^(.*).xhtml$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]
# If I have asked for .html then if the .html file exists, then give it.
RewriteRule ^(.*)\.html$ $1 [C,E=WasHTML:yes]
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule ^(.*)$ $1.html [L]
# else provide XML dynamically with xml2html.php
RewriteCond %{ENV:WasHTML} ^yes$
RewriteCond %{REQUEST_FILENAME}.xml -f
RewriteRule ^(.*)$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]
The first of these says that when I ask for any url on the site ended in .xhtml then take an XML file named the same thing and transform it using the xml2html.php script and the site.xsl stylesheet both in the /scripts directory. This is just for me, so that I can force it to run the transformation if a foo.xml and foo.html exist in the same directory.
After this the next RewriteRule matches anything on the site that is asked for that ends in .html and takes the first bit of this (the path and filename). Simultaneously it uses ‘C’ to chain this with the next rule and ‘E’ to set an environmental variable ‘WasHTML’ to be ‘yes’. Then there is a Rewrite Condition testing if this filename with a .html extension exists. If so, it rewrites this to be that filename.html and ends. If not, it tests whether the environmental variable WasHTML is set to yes (because remember we’ve taken off the extension), and whether the filename we’ve asked for ending in .xml exists. If so, then it runs the script giving the filename with .xml as the xml parameter and in this case site.xsl (in the same scripts directory) as the xsl.
That .htaccess file as a whole looks like:
#Turn on Rewriting
RewriteEngine On
RewriteBase /
# Redirect any svn requests
RewriteRule ^.svn/(.*)$ http://subversion.tigris.org [R]
# utf-8 please
AddDefaultCharset UTF-8
# change directory index to index.xml as default
DirectoryIndex index.xml index.php index.html index.shtml
#ErrorDocuments
ErrorDocument 404 /unavailable.html
ErrorDocument 403 /forbidden.html
# If I ask for .xhtml then give me xml2html
RewriteRule ^(.*).xhtml$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]
# If I have asked for .html then if the .html file exists, then give it.
RewriteRule ^(.*)\.html$ $1 [C,E=WasHTML:yes]
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule ^(.*)$ $1.html [L]
# else provide XML dynamically with xml2html.php
RewriteCond %{ENV:WasHTML} ^yes$
RewriteCond %{REQUEST_FILENAME}.xml -f
RewriteRule ^(.*)$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]
The PHP script this is using (which I borrowed from a colleague) uses the http://www.php.net/manual/en/book.xsl.php libxml based XSLT processing in PHP. It is fairly short and consists of:
<script language="php">
#Basic check for directory/site traversal
if(preg_match('/\.\.\/\.\./',$_REQUEST['xml'])) { die("invalid input"); }
if(preg_match('/http/',$_REQUEST['xml'])) { die("invalid input"); }
if(preg_match('/http/',$_REQUEST['xsl'])) { die("invalid input"); }
if(preg_match('/\.\.\//',$_REQUEST['xsl'])) { die("invalid input"); }
#load xsl document into XsltProcessor
$xp = new XsltProcessor();
$xsl = new DomDocument;
$xsl->load($_REQUEST['xsl']);
$xp->importStylesheet($xsl);
#load xml document
$xp->setParameter( null, 'xml', $_REQUEST['xml']);
$xml_doc = new DomDocument;
$xml_doc->load($_REQUEST['xml']);
#Process any xincludes
$xml_doc->xinclude();
#Transform the XML with the XSL or put out error
if ($html = $xp->transformToXML($xml_doc)) {
echo $html;
} else {
trigger_error('XSL transformation failed.', E_USER_ERROR);
}
</script>
The first bit of this is just a security precaution against directory (or site) traversal which rejects anything that has ‘../..’ in it or ‘http’. I’m sure there are a lot better ways to do this, but just checking the xml and xsl parameters seemed the easiest. I could have made a function and then passed it to each of them, or had the regex look for either of these two things, but I think it all works out the same and doesn’t seem to have much of a speed implication. Then we start a new XsltProcessor(), and a new xsl DomDocument, we load in the xsl file given in the xsl parameter, and also pass to this the parameter ‘xml’ so that we can use this in our XSLT if we want. Then we start a new xml_doc DomDocument and load in the requested XML file, and we do any XIncludes in that XML file. We then transform the XML doc to HTML with transformToXML otherwise trigger and error and put that out.
This is a fairly lightweight way to transform XML to HTML on the fly using the technologies (PHP and .htaccess) that most hosting solutions provide. I’m using something like this on one of my personal sites and it is in use in a slightly different form in a number of work sites.
Hope it is useful to someone.
For Loops in XSLT2
A colleague asked me the other day about the proper way to do for-loops in XSLT2 or more specifically in XPath2. He knows all about xsl:for-each and xsl:for-each-group iteration over things, and of course recursively calling a template while passing a variable to let you count how many times you’ve done it.
I’ve always found that kind of recursion annoying, and in XSLT2 if you just want to do something a number of times, then it is also unnecessary. XPath2 allows you to do XQuery-like for-loops as part of your path statement. Take this short and stupid XSLT2 stylesheet for example:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output indent="yes"/>
<xsl:param name="start" select="1"/>
<xsl:param name="end" select="10"/>
<xsl:variable name="from" select="$start"/>
<xsl:variable name="to" select="$end"/>
<xsl:template match="/" name="main">
<foo>
<xsl:for-each select="
for $i in $from to $to
return $i
">
<blort><xsl:value-of select="concat('value is: ', . )"/></blort>
</xsl:for-each>
</foo>
</xsl:template>
</xsl:stylesheet>
Let’s break this simple example down a bit.
First we have some starting stuff:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output indent="yes"/>
<xsl:param name="start" select="1"/>
<xsl:param name="end" select="10"/>
<xsl:variable name="from" select="$start"/>
<xsl:variable name="to" select="$end"/>
<!-- ... -->
</xsl:stylesheet>
All this is doing is starting up the stylesheet, saying that we want the result indented, and saying that there are two parameters ‘start’ and ‘end’ which if they aren’t set should be ’1′ and ’10′ respectively. I then copy these to global variables ‘from’ and ‘to’ just to make my life easier.
<xsl:template match="/" name="main">
<foo>
<xsl:for-each select="
for $i in $from to $to
return $i
">
<blort><xsl:value-of select="concat('value is: ', . )"/></blort>
</xsl:for-each>
</foo>
</xsl:template>
The whole template here is fairly simple. It either matches the root node ‘/’ or if called by its name (i.e. with “saxon -it:main for-loops.xsl”). We then output a ‘foo’ root element of our output document. Then we have an xsl:for-each which isn’t really the for-loop itself but does something for each iteration of this loop. Each time we create a new number we put out a ‘blort’ element whose content says what the value is. But in order to create the series which the xsl:for-each is iterating over we have made our select statement be “for $i in $from to $to return $i”. This says for a new variable ‘i’ for each of the things in the range from the ‘from’ variable to the ‘to’ variable give use back the value of the ‘i’ variable. So in our case it will create a series from 1 to 10 for the xsl:for-each to operate on.
Hopefully that is the last time I hear that XSLT can’t do for-loops. I’ve put this here to remind me later when I’ve forgotten.
ENRICH
Until December 2009 I worked on the ENRICH project, and as it has now finished, I thought that I should reflect on some of what the project has done and the aspects we’ve been involved with here in Oxford. For the most part the project has been attempting to both aggregate manuscript descriptions into the manuscriptorium framework and standardise these manuscript descriptions to a single, common, agreed format. For the background to the ENRICH project, see the website, and especially this article on the ENRICH Project and TEI P5. A list of deliverables is also available.
Standardisation of Specification
The workpackage we were most involved with, partly because we were leading it, was workpackage 3 whose object was:
To ensure interoperability of the metadata used to describe all the shared resources by analysing the various standards used by different partners and ensuring their mapping to a single common format, which will be expressed in a way conformant with current standards.
As one might expect, in practice, this common format was a more tightly constrained subset of the TEI recommendations on Manuscript Description. The difficulty in any such endeavour is getting coherent agreement between a large number of representatives on a wide variety of customisations. As part of this process we undertook a comparison of MASTER, TEI P5, and Manuscriptorium formats. A number of revisions were made to the ENRICH schema through the course of the project. Deliverable D3.1 was a “Revised TEI-Conformant specification” available in a number of schema languages. The ENRICH Schema is publicly and freely available as as DTD, RELAX NG, and W3C Schema, but we recommend the RELAX NG format:
- DTD
- RELAX NG (compact)
- RELAX NG (XML)
- W3C Schema (also needs xml.xsd)
Documentation
The next deliverable, D3.2, was “Documentation and training materials for use with the ENRICH Specification”. Because the TEI ODD had been written with documentation in it, the same TEI ODD which generated the schemas above could also be used to generate project-specific documentation. This meant that in addition to the documentation written specifically for the ENRICH project, it had access to all the internationalised reference material available in the TEI Guidelines as a whole. This meant that we could produce versions of the documentation which while still primarily in English, contained glosses of the elements in another language. So for example:
<msIdentifier> (manuscript identifier) contains the information required to identify the manuscript being described.
in the English documentation for the ENRICH Specification became, in the French:
<msIdentifier> (identifiant du manuscrit) Contient les informations requises pour identifier le manuscrit en cours de description.
While this is admittedly of limited benefit, since the bulk of the documentation remains in English, it can aid comprehension to those reading in a foreign language to have the element descriptions in their own language. The ENRICH Specification documentation is available in the following languages and formats:
- English, HTML
- English, PDF
- French glosses, HTML
- French glosses, PDF
- Spanish glosses, HTML
- Spanish glosses, PDF
- Italian glosses, HTML
- Italian glosses, PDF
(HTML needs odd.css and tei.css)
Training Materials
Training materials were also created as part of D3.2 and took the form of slide sets as PDF, HTML, and TEI XML that project partners were free to take, modify, and use in teaching the ENRICH schema:
- What is XML markup for? (PDF; also HTML and XML source)
- Live long and prosper! Lessons from the TEI (PDF; also HTML and XML source)
- Using the basic TEI structural elements (PDF; also HTML and XML source)
- Names, People, and Places (PDF; also HTML and XML source)
- Handling primary sources in TEI XML (PDF; also HTML and XML source)
- booklet with all the above
Migration Tools
While the primary migration tools from other formats to the ENRICH Specification were undertaken by the lead technical partner, we were tasked with undertaking a case study based analysis of the construction of migration tools and the make recommendations to the project based on these. The Migration case studies focussed on MASTER records that we had accumulated as a testbed and EAD records given to us by the Bodleian Library. The Case Studies on Migration to the ENRICH Specification and all their materials are freely available online. The case studies examined methods for transformation of MASTER and EAD records to TEI P5, mainly using XSLT-based conversions. The report on the Development and Validation of Migration Tools is available online.
ENRICH Garage Engine
Originally D3.4 of the ENRICH Project was a “Report on METS/TEI interoperability, best practice with respect to handling of Unicode and non-Unicode data in Manuscriptorium and P5 conversion techniques”. However, after much investigation it was determined that the use of METS was unnecessary for our extension to the Manuscriptorium platform. (This is not to say that it would not have been suitable for this or other uses.)
Part 1 of D3.4 and some of the work on it was replaced through the development of the ENRICH Garage Engine (EGE) and a report on the Documentation and Use of the ENRICH Garage Engine. This is a primarily web-service based format conversion engine developed by PSNC which enables document conversion through a number of formats. The engine itself consists of a web service and website frontend and underneath consists of a recognizer, a validator, and a converter. As the EGE website explains:
- Recognizer – this plug-in is responsible for the recognition of the Internet Media Type (MIME type) of the given input data. For example, it will receive the input data and state that the input data has text/xml MIME type. The recognized data may then be further validated to check the format of the data.
- Validator – this plug-in is responsible for validation of the input data. For example it may be used to validate the ENRICH TEI P5 data stored in a MIME type (e.g. text/xml) either received from end user or created by one of the converters. The following notation is assumed: ENRICH TEI P5 (text/xml) – it means that validator is able to validate ENRICH TEI P5 format encoded in text/xml.
- Converter – this plug-in is responsible for converting the input data. It may be, for example, conversion from XML to Word, conversion from Word to PDF, conversion of the XML from one form to another (e.g. MASTER -> ENRICH TEI P5) or even cleaning the input data (e.g. removing redundant information).
You can try the EGE at its website:
ENRICH gBank and Non-Unicode Characters
One problem encountered in the migration of legacy documents to the ENRICH Specification might be that these records use characters which are not currently present in Unicode. The Medieval Unicode Font Initiative (MUFI) campaigns for inclusion of some of these specialized characters into the Unicode Specification. The second half of the D3.4 deliverable we produced was a report on Best practice in handling non-unicode characters. This included the description of a software tool, the ENRICH gBank produced to assist in normalization and documentation of non-Unicode characters. This contains a list of all of MUFI non-Unicode characters in the Private Use Area (PUA), images of them, and a representation of them using a TEI <char> element. For the most part these were automatically generated from the MUFI Spec. Conversion of this involved exporting the Adobe InDesign file as RTF, converting this to a basic presentation TEI XML, running a transformation script on this to extract just the data we needed for our own tables. In addition, the PUA references were used, in conjunction with the Andron Scriptor Web font, to produce first SVG files (using Apache Batik) and then specific-sized PNG files from this. This allowed us to have character images for each of the characters in the PUA.
You can see the ENRICH gBank on the ENRICH beta website at:
ENRICH Templates
As part of the ENRICH teaching materials we also created some ENRICH templates, to assist those who wanted a guide as to the kind of material that should be present in an ENRICH manuscript description.
- A CSS file for manuscript descriptions (e.g. for use in oXygen’s author mode)
- A basic ENRICH template file for manuscript descriptions (view source)
- A detailed ENRICH template file for manuscript descriptions (view source)
A number of projects have taken these templates as starting points to further customise in their own use of the the ENRICH Specification or TEI P5 msDesc.
Conclusions
Working for any large and dispersed EU project always has its benefits and drawbacks. In the case of ENRICH we were able to draw on a wide range of experience, technologies and data because of the diverse nature of the project. One of the major drawbacks stems from being partnered with commercial organisations. While all the work they did in their development and support of the Manuscriptorium platform was top notch, they naturally have commercial interests of their business model at the forefront of their activities. This meant, for example, that while the ENRICH Specification and all the software, documentation, training materials and tools that we (OUCS) produced were licensed under an open licence, the same was not true of the main commercial company behind Manuscriptorium. The platform itself is not open source, at no point were we able to see the workings of the platform, nor contribute patches or bug fixes to it. This meant any of our development took place in an isolated manner and at arm’s reach.
Fair enough, the EU (via its eContent+ programme) funded this project with the understanding, presumably, that this would be the case. However, I feel that it is wrong for the EU to fund projects with commercial partners where those partners are not required to release the products of the funded work under an open licence of some sort. I’m not in any way against these commercial companies, but there are plenty of workable business models which enable them still to profit from materials they have developed and released under an open licence.
The ENRICH project has produced a lot that is good and interesting, and one of its major achievements is the network of individuals, projects, and institutions which are all approaching medieval manuscript description in the same manner. Although ENRICH (as a schema or project) is certainly not the last word in large-scale projects for the aggregation and standardization of medieval manuscript descriptions, it is a good development and milestone along that road.
List of Deliverable Reports
- D 3.1 — Revised ENRICH TEI P5 Specification
- D 3.2 — Documentation and training materials for use with ENRICH Specification
- D 3.3 — The development and validation of migration tools
- D 3.4 — 1: Documentation and use of the ENRICH Garage Engine and 2: Best practice in handling of Unicode and non-Unicode data
Thunderbird + Lightning Nexus Calendar Export to Google Calendar
There are plenty of ways to sync one’s work (nexus, Oxford’s version of Exchange) calendar with google if you are using Windows and Outlook. However, I’m using Ubuntu Linux. The solution I’ve chosen for getting mail and shared calendaring is Thunderbird + Lightning + Davmail. This works, but had idiosyncrises such as not allowing you to share calendars (but use calendars you have shared through another method such as Outlook2007 or OWA-Messageware).
Let’s be clear here, I do not need full synchronisation. What I want to do is:
- when looking at my google apps calendars (which I intentionally separate from my work ones) I want to be able to have at least read-only view of my work calendars. Basically I want to just see them so I know that work activities are not overlapping with personal ones.
- make my calendars available read-only to specific other people who either are not inside *.ox.ac.uk or whose departments do not use calendaring aspects of nexus
The solution I’ve come up with is an ad-hoc one involving a mozilla thunderbird extension called automatic export. Once installed and the icon is added to toolbar you can select from a dropdown menu on this icon a cyclical export. I have this set to export my calendar every 10 minutes. As long as you export this to a web accessible location then google calendar can subscribe to this. In addition, I store mine on a remote server, so have a shell script that scp’s it to the correct location every 10 minutes…so at very worst it is 20 minutes out of date. On google you just subscribe to the remote .ics file… though it sometimes takes awhile for google to finally realise it is there.
Drawbacks
- The export only works when you have a copy of thunderbird that is set to do this is currently running. So, for example, TB on my laptop is not set to do this, or if I add an appointment with OWA-lite it doesn’t end up in my google calendar until I load up TB at work on Monday.
- It is fairly insecure. The entire calendar is exported as an .ics file that is world readable. While it is in a place that is fairly obscure, security by obscurity isn’t really security.
- I tried having it on a passworded WebDAV storage, but even giving google the username/password in the url, it had problems finding it.
- Private events are shared with those with whom you share the calendar… so they basically see anything you see.
- You need to have a constantly web-accessible location in which to put the calendar, exporting it to your desktop machine isn’t sufficient since google will think it has disappeared when the machine is off. (And we all hibernate our desktops and use OUCS’s Wake-On-Lan service to wake them up when needed… don’t we?)
I don’t know if this will be useful to anyone else… but that is how I export my Thunderbird+lightning+davmail Nexus Calendar to my Google Apps Calendar.
-James
TEI-Comparator
I have just finished my poster for DRHA 2009 which is about the TEI-Comparator that RTS worked on for the Holinshed Project. My poster is available online in PDF and PNG formats. (Though for the record it was created in Inkscape as an SVG file).
The poster discusses the creation of the tool for the Holinshed Project at the University of Oxford. Holinshed’s Chronicles of England, Scotland, and Ireland was the crowning achievement of Tudor historiography and an important historical source for contemporary playwrights and poets. Holinshed’s Chronicles was first printed in 1577 and a second revised and expanded edition followed in 1587. EEBO-TCP had already encoded a version of the 1587 edition, and the Holinshed Project specially commissioned them to create a 1577 edition using the same methodology. The resulting texts were converted to valid TEI P5 XML and used as a base to construct a comparison engine, known as the TEI-Comparator, to assist the editors in understanding the textual differences between the two editions.
Using the TEI-Comparator has several stages. The first was to decide what elements in the two TEI XML files should be compared. In this case the appropriate granularity was at the paragraph (and paragraph-like) level. The project was primarily interested in how portions of text were re-used, replaced, expanded, deleted, and modified from one edition to another. This first stage ran a short preparatory script which added unique namespaced IDs to each relevant element in both the TEI files. It is the proper linking of these two IDs which the TEI-Comparator hoped to facilitate.
The second stage was to prepare a database of initial comparisons between the two texts using a bespoke fuzzy text-comparison n-gram algorithm designed by Arno Mittelbach (the technical lead for the TEI-Comparator). This algorithm, called Shingle Cloud, transforms both input texts (needle and haystack) into sets of n-grams. It matches the haystack’s n-grams against the needle’s and constructs a huge binary string where they match. This binary string is then interpreted by the algorithm to determine whether the needle can be found in the haystack and if so where. The algorithm runs in linear time and, given the language of the originals, was found to work better if the strings of text were regularized (including removal of vowels).
The third stage in using the comparator was for the research assistant on the project to confirm, remove, annotate, or create new links between one edition and the other using a custom interface to the TEI-Comparator constructed in Java using the Google Web Toolkit API. The final stage was to produce output from the work put in by the RA through generating two standalone HTML versions of the texts which were linked together based on the now-confirmed IDs.
Shortly the TEI-Comparator will be publicly available on Sourceforge with documentation and examples to make it easy for others to re-purpose this software for other similar uses, and submit bugs and requests for future development.
Although known as the ‘TEI-Comparator’, the program does not require TEI input, it works with XML files of any vocabulary as long as the elements being compared have sufficient unique text in them.
For more information about the TEI-Comparator e-mail: tei@oucs.ox.ac.uk
addingIDs
Rehdon asked me about giving @xml:id attributes to things, so I whipped up this quick XSLT stylesheet. Some people prefer to use generate-id() to get a truly random and unique ID without semantic baggage. In many cases, where IDs are exposed to the public, I prefer to use some which make sense and are human readable.
Warning: there is a distinct flaw in the lack of testing I’ve done before applying the @xml:id. If something other than a <p> element already has xml:id=”p5″ then it will still add ‘p5′ as an @xml:id to the fifth paragraph. This means that it will produce an xml document that is not well-formed since one of the requirements of @xml:id is that it is unique in the document. Also it would number paragraphs in other namespaces as well. (This may be a bug or a feature depending on your outlook.) It numbers from tei:text so if you don’t have that in your document you should change that variable.
The XSLT stylesheet takes a parameter ‘e’ which you can pass the local-name of the element in question. It assumes ‘p’ otherwise, but you could use it number div, head, w, or really any element just by passing it e=w (or whatever).
Update: Rehdon asked about a configurable optional prefix to the ID and a 4-digit zero-padded number for it. So I changed the script to do that.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:tei="http://www.tei-c.org/ns/1.0"
xmlns="http://www.tei-c.org/ns/1.0"
exclude-result-prefixes="tei"
version="1.0">
<!-- Parameter to pass to the stylesheet, assumes 'p' if nothing given -->
<xsl:param name="e" select="'p'"/>
<!-- If it exists, a prefix string: include a separator, like 'text1_' to get 'text1_p0005' -->
<xsl:param name="pre"/>
<!-- typical copy-all template -->
<xsl:template match="@*|node()|comment()|processing-instruction()" priority="-1">
<xsl:copy><xsl:apply-templates select="@*|node()|comment()|processing-instruction()"/></xsl:copy>
</xsl:template>
<!-- higher priority one to match elements -->
<xsl:template match="*" >
<xsl:copy>
<!-- If the local-name is the element we've passed it, and there is not an @xml:id attribute -->
<xsl:if test="local-name() = $e and not(@xml:id)">
<!-- make a variable numbering current nodes at any level from tei:text -->
<xsl:variable name="num"><xsl:number level="any" from="tei:text" format="1111"/></xsl:variable>
<!-- Then create an @xml:id attribute with the name and the number concatenated -->
<xsl:attribute name="xml:id"><xsl:value-of select="concat($pre, local-name(), $num)"/></xsl:attribute>
</xsl:if>
<!-- apply any other templates (i.e. copy other stuff) -->
<xsl:apply-templates select="@*|node()|comment()|processing-instruction()"/></xsl:copy>
</xsl:template>
</xsl:stylesheet>
Hope that is useful. I’ll try to remember to add it to the TEI wiki as well.
adding word-level markup
Rehdon and snail and others occasionally have asked me recently about marking up words inside another element where there may be markup (sometimes containing more than one word) inside this so I thought I’d write it up.
So for example we might have an XML file that looked like:
<?xml version="1.0" encoding="UTF-8"?> <root> <line>This is a test</line> <line>Only a <seg type="foo">test</seg> ok?</line> <line>And <seg>so; is</seg> this as well.</line> </root>
Let’s say we want to mark up each of the whitespace-separated words, and for some reason the randomly added semi-colons, as words with a element. What we can use is and a regex. For example:
<xsl:template match="line//text()"> <xsl:analyze-string regex="(\w+|;+)" select="."> <xsl:matching-substring><w><xsl:value-of select="."/></w></xsl:matching-substring> <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring> </xsl:analyze-string> </xsl:template>
In this example we’re matching any text() inside an element anywhere and if it matches the \w regex (or is a semicolon) it will get wrapped in a element. If it doesn’t match, then the text that was there gets output. Because this is l//text() (as opposed to l/text()) it will recurse down into grandchildren elements and further.
So assuming we have a copy-all template something like:
<xsl:template match="@*|node()" priority="-1"> <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy> </xsl:template>
(where we basically copy any nodes and attributes unless something else matches them) then we should get the result:
<root> <line><w>This</w> <w>is</w> <w>a</w> <w>test</w><w>;</w></line> <line><w>Only</w> <w>a</w> <seg type="foo"><w>test</w></seg> <w>ok</w>?</line> <line><w>And</w> <seg><w>so</w><w>;</w> <w>is</w></seg> <w>this</w> <w>as</w> <w>well</w>.</line> </root>
Of course that is only the beginning, as your documents will probably have weird special cases and punctuation that you want to handle differently. And also it would, of course, be useful to create an @xml:id attribute for each word element.
-James
Evaluate a string as an XPath
Looking at ways to process a suggested change in TEI P5, I wanted to test that there is a straightforward way to evaluate a string that exists in a document as if it was an XPath you had included in your document.
So say I have a made-up document where I store some xpaths relating to that very document in the document itself as bits of text.
Input
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<paths>
<path>/foo/blort/wibble[1]</path>
<path>/foo/blort/wibble[2]</path>
<path>//*[@xml:id='wibNum2']/splat/@att</path>
</paths>
<blort>
<wibble>test text 1</wibble>
<wibble>Another wibble </wibble>
<wibble xml:id="wibNum2">This is <splat att="value1">a
test</splat></wibble>
</blort>
</foo>
To grab these and evaluate them as XPaths, you need to use an extension in saxon, unfortunately, saxon:evaluate(). For example in this stylesheet:
XSLT
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0" xmlns:saxon="http://saxon.sf.net/"
exclude-result-prefixes="#all">
<xsl:output indent="yes"/>
</xsl><xsl:template match="/foo">
<foo>
<xsl:for-each select="paths/path">
<out>
<xsl:value-of select="saxon:evaluate(.)"/>
</out>
</xsl>
</foo>
</xsl>
This should produce the output:
Output
< ?xml version="1.0" encoding="UTF-8"?> <foo> <out>test text 1</out> <out>Another wibble </out> <out>value1</out> </foo>
This does use the saxon:evaluate(.) extension. There are similar extensions in a variety of other implementations for XSLT1 as well.
-James
XSLT2 collection() with dynamic collections from directory listings
Something I didn’t know about XSLT2′s collection() function. I had previously used it in the form:
<xsl:variable name="files" select="collection(docs.xml)"/>
where docs.xml has a structure of:
<?xml version="1.0"?>
<collection>
<doc href="blort1.xml"/>
<doc href="blort2.xml"/>
</collection>
You can then address, via the variable, the structure of those files blort1 and blort2 and iterate over them etc. e.g. you can do something like:
<xsl:for-each select="$files/tei:TEI/tei:text/tei:div"> <xsl:apply-templates mode="TOC" select="tei:head"/> </xsl:for-each>
Ok… I already knew how to do that and have used it to run XSLT on a whole raft of files. To get the docs.xml file I used to run “xmlstarlet ls” and then I have a dir2collection.xsl that transforms its output to the correct format.
However, what I didn’t know is that I didn’t need to bother creating the collection file at all. Saxon can generate the collection file from a parameter on the URI that you hand collection(). That is you can do something like:
<xsl:variable name="files" select="collection('../foo/?select=blor*.xml')"/>
And $files is then addressable in the same way as if you had made a collection document of all the files matching blor*.xml in the directory ../foo/ (and of course you can just do *.xml)
But wait, that’s not all. You can get a bit more complicated about it, pass the path as a parameter, and supply the collection() function extra parameters. So something like:
<xsl:param name="path2collection">../foo/</xsl:param>
<xsl:variable name="path">
<xsl:value-of
select="concat('../',$path2collection,'?select=*.xml;recurse=yes;on-error=warning')"
/>
</xsl:variable>
<xsl:variable name="docs" select="collection($path)"/>
And thus forth $docs contains a recursive collection of anything in the path2collection parameter you give it.
Isn’t that fun? Ok, maybe only me.