More about @rend

Lou Burnard has provided a technical summary of some of the recently issues discussed concerning @rend, but I thought I might provide some more explanation for those not as familiar with the technical background to the discussion. I would have done so sooner but was driving around too narrow farm roads in Cornwall on holiday without much reception on my phone. What follows are my own opinions and interpretations of the TEI Guidelines which are continually evolving based on community consensus.

The @rend attribute

The TEI provides a @rend attribute which indicates an interpretation of how the element in question was rendered or presented in the source text. It has nothing to say about what should be done with the element in any particular output from processing or displaying the TEI text. The assumption that many people make is that processing TEI means outputting HTML designed to help you read the text, but this is certainly not necessarily the case. The TEI text might have any number of outputs, just for reading it might be HTML, ePub, PDF, DOCX, and many more, moreover those encoding the texts might not be intending to read it but process it for other forms of text analysis in any number of formats. While individual projects can provide project documentation on how they intend certain elements to be presented in particular forms of output, other people processing those texts could choose to do something completely different.

@type and @rend values and their whitespace

During a TEI-L discussion concerning why the @type attribute did not allow spaces it was explained that this is because the @type attribute does not contain free text, but a special token that categorises the element in some way. Moreover, the recommended practice is for projects to customise the TEI to constrain the choices available for the value of the @type attribute on some elements and document in their customisation exactly what those special tokens mean. @type attribute values are a datatype of data.enumerated which means that they are “expressed as a single XML name taken from a list of documented possibilities”. That means that this value has to obey the rules of what it means to be an XML name, and it should be from a set list that the project has documented (preferably in its TEI customisation, but possibly just in prose documentation preserved with the TEI file). Most elements that have a @type attribute get it from claiming membership in the att.typed attribute class, and if a secondary type classification is allowed they also get @subtype.

The discussion moved on (possibly because I referenced my earlier post on @rend) to the difference with the @rend attribute and using CSS inside it. However, with the @rend attribute though the situation is slightly more confusing. It allows 1 to infinity occurrences of the datatype data.word in it. A data.word datatype “defines the range of attribute values expressed as a single word or token.” As I’ve discussed elsewhere, this means if someone marks up a text using:

<hi rend=”It looks a bit like that other one”>text</hi>

This actually has 8 tokens “It”, “looks”, “a”, “bit”, “like”, “that”, “other”, “one”. The point is that the whitespace between these words in the attribute make these each separate values or tokens, not a phrase. The encoder might just have written:

<hi rend=”big bold beautiful”>text</hi>

or indeed

<hi rend=”largeStyle42″>text</hi>

The data.word datatype says that “Attributes using this datatype must contain a single ‘word’ which contains only letters, digits, punctuation characters, or symbols: thus it cannot include whitespace.”

Some encoders believe that the TEI should reverse its decision on free text in attributes and allow @rend to contain “It looks like that other one” and this not to be a set of discrete tokens. Personally, I disagree and feel that would be a retrograde step.

@rend values and their order

Other than defining it as a set of data.word occurrences the TEI does not dictate what the @rend values should look like. In my opinion it would be wrong if the TEI try to codify all the possible rendition values that appear in every sort of text. Moreover, describing the way something appears in a text is always an interpretative process and two separate encoders looking at the same text, or looking at it for different reasons, might perceive it in very different ways. In fact the Guidelines explicitly say:

“These Guidelines make no binding recommendations for the values of the rend attribute; the characteristics of visual presentation vary too much from text to text and the decision to record or ignore individual characteristics varies too much from project to project.” (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.html)

Some encoders believe that it is a shame that the TEI has not defined a syntax by which they should specify the @rend attribute values. I disagree because I feel that the greatest flexibility should be given to projects and sub-communities to customise and constrain such values for themselves. It could be argued that the TEI has indeed provided a syntax, but in a very general way, that these are whitespace separated tokens containing only letters, digits, punctuation characters or symbols. The point is that these are entirely meant to be intended as magic tokens that individual projects can decide for the meaning for their own use (and document). If I put in the magic token ‘bold’ it might mean in my project something different than it means in yours.

It came out in the TEI-L discussion that some encoders believe that the order of @rend values provided should be important, as if they are making a phrase. Others tend to put the most important rendition classification first, and still others always provide different types of classification in the same order. I find these all prone to human inconsistency and so I choose to believe that they are an unordered set of values that could be entered in any order. i.e. that:

<hi rend=”big bold beautiful”>text</hi>

should be understood to be semantically equivalent to:

<hi rend=”beautiful big bold”>text</hi>

My beliefs here are, perhaps unduly, influenced by long and painful experience in processing hand-encoded texts (which also influences my beliefs on the value of automatic and semi-automatic up-converting markup). In my encoding projects I recommend that no special significance be granted based on the order of the tokens present in the @rend value. The TEI, I think sensibly, allows individual projects to do what they want but does specify that these are individual tokens.

Some projects decide to put various standard presentation-description formats, e.g. Cascading StyleSheets, into the @rend attribute. I personally feel that this is misguided and sloppy. Partly this is because I suspect that some of them are actually encoding for a particular output format (rather than documenting what the original source looked like) and this is the wrong place to store this information. Partly this is because such presentation-description formats often use significant whitespace (which then means an abuse of the data.word datatype). And partly this is because I feel there is a better and easier way to do this more consistently using the @rendition attribute.

@rendition and <rendition> really aren’t extreme

As with many other things in the TEI, the Guidelines provide a simple use-case (@rend’s magic tokens) and a more complex system (@rendition). The @rendition attribute allows you to point to a <rendition> element up in the header where you can use any form of free text to describe how this was rendered in the original source. This means that instead of putting a set of magic tokens or classifications like “largeStyle42″ an encoder can completely transparently point to a fuller description using the standard URI fragment pointing mechanism that is common throughout the TEI recommendations. Thus instead of writing:

<hi rend=”largeStyle42″>text</hi>

And having it documented somewhere what this meant. The encoder can point to a <rendition> element by its @xml:id attribute and have a fuller description there. For example this could be:

<hi rendition=”#largeStyle42″>text</hi>

and while that doesn’t look much different the URL fragment ‘#largeStyle42′ points to a place inside the TEI file’s <teiHeader> (specifically inside the <tagsDecl> element) where there is a better description:

<rendition scheme=”free” xml:id=”largeStyle42″>This text is really big, bold, and beautiful</rendition>

Okay, admittedly that might not be a very useful description. But the point with the ‘free’ scheme is that it is free text. It can be any prose, in any language, and way of describing it. The @scheme attribute also allows for ‘css’ for those people wishing to use cascading stylesheet language, ‘xslfo’ for those wanting to use extensible stylesheet language formatting objects, and ‘other’ for those using another set rendition description language. So ‘#largeStyle42′ could point to something using CSS that looked like:

<rendition scheme=”css” xml:id=”largeStyle42″>
font-weight:bold;
font-size: 75pt;
font-family:”brushstroke”, fantasy;
color:#002147;
</rendition>

If a more precise description (in whatever language) is able to be provided for ‘largeStyle42′, then this can be changed at a later date. Equally this could be broken up into multiple <rendition> elements and you can have:

<rendition scheme=”css” xml:id=”bold”>font-weight:bold;</rendition>
<rendition scheme=”css” xml:id=”big”>font-size:75pt;</rendition>
<rendition scheme=”css” xml:id=”beautiful”>font-family:”brushstroke”, fantasy;</rendition>
<rendition scheme=”css” xml:id=”oxBlue”>color:#002147;</rendition>

and in the text:

<hi rendition=”#big #oxBlue #bold #beautiful”>text</hi>

Moreover, because @rendition is one of the TEI’s many pointing elements it does not need to point to a <rendition> element in the very same file! Instead a project could centralise all their rendition information to a single place. So that might look like:

<hi rendition=”renditionFile.xml#largeStyle42″>text</hi>

or indeed

<hi rendition=”http://www.example.com/renditionFile.xml#largeStyle42″>text</hi>

Some encoders feel that pointing to a <rendition> element is a lot harder than just sticking some tokens into the @rend attribute. Others argue that as part of the process of hand encoding users should be able to add whatever they want to @rend, and for this to be valid because rationalising these in advance is more difficult than doing so afterwards. Or indeed that it is more convenient to encode unusual variants ‘in-line’ rather than pointing back to the header. Both of these are good points, and have some truth to them. In the first case, it depends on the level of specification needed. Most encoders in my experience use very general and imprecise @rend categorisations. That is, they could have a rend value of ‘big72pt’ but they tend to just use ‘big’ (or small/medium/large/x-large).

How much time and energy one wants to spend worrying about specifying @rend and/or @rendition values depends on how important to your project that that this information is documented and done so in a consistent manner. If it is just that you want record whether something is in one of a handful of different colours, sizes, or styles, then you probably just want to agree a project specification of @rend values (and what they mean) for your TEI customisation.

Other @rend issues

Some encoders believe that there is no formal way of indicating what syntax you have used for your @rend values. I disagree because I believe these are magic tokens which are most properly documented in the TEI customisation. This enables an encoder to give a free text description for every magic token used in @rend attribute values, and moreover if they wish it enables a project to constrain it to be just this set of values. If a project is using a specified syntax inside their @rend attribute values (so-called ‘rendition ladders’ are one such format) then this should be documented inside the <encodingDesc>, perhaps in prose or perhaps the TEI will add a mechanism in response to the TEI-L discussion which enables categorisation and description of the taxonomy of @rend attribute values.

Changing @rend

My arguments here are based on my own views and understanding of the current (P5 2.0.2) version of the TEI Guidelines. However, these are subject to change (both my views and the Guidelines). I’ve often been told that the TEI recommendations seem like dictates coming down from on high saying “do it this way”, but that is really not how I view the TEI Guidelines or the community that creates them. The TEI is an open source project which takes solicitations for bug and feature requests from anyone and everyone. This can be from someone encoding their very first TEI document, reading the Guidelines for the first time, or it can be from those with a long history of experience with the TEI. Each and every bug and feature request should be considered on its own merits by the TEI Technical Council elected by the TEI community. [Note: there is scope for electoral reform, but this is a very different topic.] The recommendations of the TEI are not a fixed quantity but an evolving record of the concerns and experience of the community that produces it. In many ways hearing what users new to the TEI have difficulty with, or where they find the Guidelines confusing is more valuable in the long run than some of the more arcane technical discussions.

Posted in TEI, XML | 1 Comment

Self Study: Introducing XML and Markup

I’m occasionally asked what people should read and do if they want to teach themselves TEI P5 XML. Where should they start? This depends, obviously, on what time they have and what resources. I tend to recommend directed intensive training such as the Digital.Humanties@Oxford Summer School as good ways to get an introduction to such topics.

However, some people are unable to participate in such training and prefer self-directed learning. What should they do? There are lots of resources online such as TEI By Example and the TEI Guidelines. Where to start?

When people are taking an Introduction to TEI workshop I usually introduce markup but move onto TEI and XML very quickly because in such intensive workshops time is limited. Instead, when people are undertaking self-directed learning I think they should use the time they have to learn more about HTML and then XML before starting to learn about the TEI vocabulary of XML itself.

There is so much reading that is possible to suggest for an initial exploration of XML and Markup.  I would suggest at least looking at:

as a good start.

If I were to suggest a series of assignments someone might undertake based on this reading it would be to do the following, writing up answers to the questions.

  1. Read the W3Schools HTML basic section and XHTML section, do the HTML and XHTML quizzes
  2. Read the W3Schools XML basic section and XML Namespaces page, do the XML quiz
  3. Read the TEI Guidelines Gentle Introduction to XML; and the wikipedia article on XML.
  4. How does XML differ from HTML? Why might it be more powerful to describe what some piece of data is, rather than say how it should be presented?
  5. Download and install the oXygen XML editor (you can get a 1 month free trial license, otherwise costs $64 USD)
  6. Choose a very short (1 page) sample of a document you are interested in.
  7. Create a list of the overall structural aspects you feel define this sort of document. Create a list of any of data-like entries (like names or dates) in the document. Create a list of presentational aspects of the document that you think important to record.
  8. Funding challenge part 1: Hypothetically, imagine you had funding to mark up several thousand pages of this material. Look at the list of aspects you would like to record. Why is each one important? What benefit does recording each of these things give those wanting to use or understand the text (or culture from which it originates)? Which would you choose to markup? How consistently can you mark up this feature? Such document analysis should be done long before any project starts (or asks for funding).
  9. Funding challenge part 2: An uncaring government has slashed its funding for higher education research projects and has reduced your project’s funding by 50%! What would you do? Will you mark up only 50% of the material? If so, how do you decide which parts? Will you only mark up certain aspects? If so, which ones and why?
  10. Using the ‘Text’ (code view) mode of the  oXygen XML editor create a well-formed XML file of your sample document with elements and attributes that you have invented yourself. What difficulties do you encounter doing this?
  11. Why might it be better for communities of users to agree on elements, what they mean, and how they should be used?
  12. What are the central ideas of Michael Wesch’s youtube video? How do they relate to the nature of XML and how it is used?
  13. Read the wikipedia article on RSS, and find an RSS feed to subscribe to in google reader to see its application.
  14. Does order really matter in an XML document?  What is the difference between:

    <list><item n=”1″>item 1</item><item n=”2″>item number 2</item></list>  and
    <list><item n=”2″>item number 2</item><item n=”1″>item 1</item></list>

    And how much difference does this make when viewing XML as a data storage format rather than a presentational one?

  15. Join the TEI-L mailing list and start lurking.

This certainly isn’t exhaustive, but with a bit of support, I suggest someone undertaking this would be much better placed to start learning about TEI P5 XML from the online sources available.

Posted in TEI, XML | 2 Comments

@rend and the war on text-bearing attributes

In discussing that the TEI attribute @rend from att.global although it allows you to type just about anything in it, doesn’t actually allow anything more that a set of single tokens. I recently explained to John, Paul, George, or Ringo (can’t remember which), that it really doesn’t mean that spaces are allowed, simply that whitespace is the delimiter in the attribute value.

The definition of @rend is “(rendition) indicates how the element in question was rendered or presented in the source text.” but it is very often used by some encoders to signal to processing how you want the output to appear.  In the remarks on the values allowed for the attribute it says:

may contain any number of tokens, each of which may contain letters, punctuation marks, or symbols, but not word-separating characters.

The point here being the ‘word-separating characters’ part. So although you can say <hi rend=”It looks a bit like that other one”>text</hi>, this actually has 8 tokens “It”, “looks”, “a”, “bit”, “like”, “that”, “other”, “one”. Sometimes people stick CSS or CSS-like rendition information into @rend so have values like “text-align: right”. Which I would say was wrong… or at least saying that there are two classifications applicable to its rendition in the source material, one that it is “text-align:” and another that it is “right”.  Of course they could solve this just be deleting the space “text-align:right” would be better, or even “text-align:right; font-size:large;” if you wanted to add another token.  However, even better would be to use @rendition to point to at least one @xml:id of a <rendition> element in the header.  This allows you to specify exactly what scheme you are using (e.g. CSS) and to give multiple statements for one classification.

Why does this matter you might ask? Well, of course, it doesn’t really — they are all magic tokens of one sort or the other to be interpreted (or not) by your processing for whatever reason you are undertaking the encoding. The <rendition> method is the most detailed in documenting precisely how you are interpreting the rendition in the original document.

However, the reason it matters to me is that there are NO attributes in the TEI which allow free-text.

By that I mean that all attributes are assigned to one datatype or another, and in none of them can you just type sentences of prose and have it be semantically meaningful.  This is as a result of the long War on Text-Bearing Attributes that was undertaken in the run-up to the first release of TEI P5. This took as one of its many principles that because any bit of free text might have a need to use a non-Unicode character, and that the TEI’s method for documenting non-Unicode characters was to use its <g> element, that you couldn’t have free-text attributes because you can’t use an element inside an attribute value. This is the reason for the creation of many new child elements like <desc> which are intended to contain free text concerning the nature of the element that contains them.

In the case of the @rend attribute it allows one to infinity of the data.word datatype.  This data type, even in P5 1.0.0 “defines the range of attribute values expressed as a single word or token.”  Thus when people put space separated characters into it, they are really putting in multiple tokens.  The war of text-bearing attributes attempted to limit the places where people were able to do this by the use of datatypes and the removal of free text in attribute values.

This helps to highlight the difference between syntactic and semantic validity. Just because your document validates against a schema, does not mean that it is semantically valid.  You can put the text of a title inside an <author> element and vice-versa and there is no way your schema can know that you have done this.

So really, I’ve posted this post so I can point to it later when people ask me about spaces in @rend and similar datatype kerfuffles.

Posted in TEI, XML | 2 Comments

Is it Bill or Ben that is speaking of flowerpot men?

A friend asked a question about how to encode a dramatic speech that possibly should be considered two speeches. Owing to a printing mistake, the second speaker’s name was omitted, so some consider it a single speech by the first speaker. However, a later hand has added the second speaker’s name in the margin after the fact, so some may wish to understand it as two speeches. The question was how do you encode these two possibilities simultaneously. Of course an entire stand-off solution is possible where you just mark the words and simultaneously mark word 1 to 20 as belonging to one speaker and the other speaker. But ignoring that more complicated solution here is some of the thinking I went through.

Let’s say we have some play, where Bill has two paragraphs. In the first he says “Bill and Ben, Bill and Ben,” and in the second he says “Bill and Ben, Bill and Ben, flowerpot men”. In TEI we might encode this as:

<!-- bill is speaker -->
<sp who="#bill">
   <!-- #bill points to more information about this speaker somewhere else in the document -->
  <speaker>Bill</speaker>
   <p>Bill and Ben, Bill and Ben,</p>
   <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

Now let’s say that the speaker marker ‘Bill’ was there and it had been crossed out by a later hand and replaced by ‘Ben’. We could indicate who we thought the real speaker was with the @who attribute whilst still retaining the orthographic distinction that a substitution had been made inside the <speaker> element.

<!-- Ben is speaker but a substitution noted-->
<sp who="#ben">
 <speaker>
    <subst>
      <del>Bill</del>
      <add>Ben</add>
    </subst>
  </speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

But this means we have to make the editorial decision, for all outputs, that one of them (here ‘Ben’) is the speaker. Another similar type of occurrence might be when Bill and Bill both say the paragraphs at the same time. In this case, we just note both of them as speakers:

<!-- bill and ben are both simultaneously speakers-->
<sp who="#bill #ben">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

Similar to this, is the case where the entire speech is spoken by either Bill or Ben, but the text just says Bill. In this case one solution (of a number of them) is not to post to a <person> element but instead point to a <listPerson> identified as ‘billOrBen’. Then in processing we can choose to assign this to one or the other, even though the text still says ‘Bill’. We’ve documented that we can only have one of them by using the @exclude attribute to point to the other <person> element.

<!-- billOrBen listPerson is speaker, but contents are mutually exclusive, so sort out in processing -->
<sp who="#billOrBen">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

<listPerson xml:id="billOrBen">
  <person xml:id="bill" exclude="#ben"><persName>Bill</persName></person>
  <person xml:id="ben" exclude="#bill"><persName>Ben</persName></person>
</listPerson>

But in the case that I was asked about the speaker’s name is added partway through a speech. Now, one way to deal with this is just to say the ‘Bill’ is the speaker, and the name ‘Ben’ is just an addition in the text. There is nothing wrong with this, you’re just documenting the original printing and the addition of the new name, but not changing the structure of the text.

<!-- bill is speaker but addition of  name partway through noted -->
<sp who="#bill">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p><add place="left"><name>Ben</name></add>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

The other option, of course, would be to understand the intellectual content of the addition as splitting the two speeches, and encode not the original printed work, but the final version with the editorial additional provided by a later hand. (So this would just be ).

<!-- bill is speaker but addition of  name partway through noted -->
<sp who="#bill">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
</sp>

<sp who="#ben">
   <speaker rend="left">Ben</speaker>
   <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

But that isn’t really what was asked for… this says that there are two speeches, and while they want to have this as a possibility, they also want to record that it is possible that the ‘flowerpot men’ paragraph was actually said by ‘Bill’ and this ‘Ben’ in the margin is just an addition. One way to do this is to use the @exclude attribute again and to do so at slightly different levels of granularity.

<!-- bill speaks first bit, and possibly second bit, but possibly ben speaks second bit -->
<sp who="#bill">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p exclude="#benPara" xml:id="billPara">
    <add place="left"><name>Ben</name></add>
     Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

<sp who="#ben" exclude="#billPara">
  <speaker rend="left">Ben</speaker>
  <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

In this case we’re saying that the second paragraph of Bill’s speech is mutually exclusive with the whole speech by Ben. In processing for any particular output we need to decide how to handle this, do we have the speech by Bill (which has the addition of a name to the left of the second paragraph) or do we have the speech by Bill consisting of only the first paragraph, and a speech by Ben.

Another way to do this is to use the <alt> element to record this elsewhere. In this case you just need to make sure there are proper @xml:id attributes on all the elements you want to point to, so here ‘billPara2′ is the second paragraph of Bill’s speech, and ‘benPara2′ is the whole of Ben’s speech. We then use the <alt> element to say that these two IDs are mutually exclusive, and specifically that we think it 70% likely that ‘billPara2′ is the correct one to choose and only 30% that ‘benPara2′ should be the correct choice.

<!-- bill speaks first bit, and possibly second bit, but (less) possibly ben speaks second bit  stand-off alternation-->
<sp who="#bill">
  <speaker>Bill</speaker>
  <p>Bill and Ben, Bill and Ben,</p>
  <p xml:id="billPara2"><add place="left">Ben</add>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

<sp who="#ben" xml:id="benPara2">
  <speaker rend="left">Ben</speaker>
  <p>Bill and Ben, Bill and Ben, flowerpot men</p>
</sp>

<alt mode="excl" targets="#billPara2 #benPara2" weights="0.7 0.3"/>

It is important to note that all of this is just a way to document whichever interpretation the encoder wishes to record. I’m not aware of any off-the-shelf processing which will do anything with @exclude or <alt> elements, however, I can picture that doing this in XSLT would not necessarily be too onerous depending on what circumstances it is used.

Oh, and obviously the original enquiry did not use a play based on the Bill and Ben theme song, but a much more famous Renaissance poet and playwright.

Posted in TEI, XML | Leave a comment

TEI P4 Support, Survey Results

Introduction

This post contains the results of a survey that  collected information which the TEI Technical Council will use to assess the need for ongoing support for the TEI P4 version of its Guidelines. These have largely been replaced by the TEI P5 Guidelines since November 2007. At that point it was promised that support would continue for TEI P4 for 5 years, until November 2012. As that is just over a year away we are starting a slow process of phasing out support for the TEI P4 Guidelines. The TEI Technical Council is planning to de-emphasize the appearance of TEI P4 as an offering since support for it will be ending in November 2012. We will continue to support it over the next year but may take steps to stop it being indexed by search engines or make it less prominent on the website. These are the results of this survey, which I’ve also transformed to TEI P5 XML at http://users.ox.ac.uk/~jamesc/SurveySummary.tei.xml.

1. Are you involved with projects that are still using TEI P4?

Answers for Question 1

My reading of these results is that many people are either not using TEI P4, or planning to migrate it to TEI P5. I suspect, given the other answers that those with TEI P4 projects probably do not rely on a lot of support from the TEI Consortium.

2. How important is ongoing TEI P4 support to you?

Answers to question 2

This seems fairly clear: out of 54 respondents 44 said it was not important, unnecessary or that we should get rid of it. But that it is important or very important for 18.5% of respondents is still significant and must be remember when making decisions concerning ongoing support for TEI P4.

3. How much should the TEI Consortium begin to de-emphasize TEI P4 on its website before November 2012?

Answers to question 3

There seems to be a strong vote for making TEI P4 available only from the TEI Vault and making sure existing links redirect.

4. Should search engines be dissuaded from index TEI P4 materials?

Answers to question 4

This result is less clear cut with some people feeling it shouldn’t be indexed, and some people thinking it should be (with slightly more weight on it being indexed than not indexed).

5. Approximately how many TEI P4 projects have you been involved with?

Answers to question 5

This is simply a statistical question (and of course depends how the respondent interprets ‘projects’). It is interesting that the majority of people seem to be involved with more than one project, but that is hardly unexpected. More were involved with 6-15 projects than I thought.

6. Approximately how many TEI P5 projects have you been involved with?

Answers to question 6

It is interesting that the percentages are vaguely the same as with TEI P4 projects, though slightly higher overall.

7. What amount of TEI P4 data do your projects have? (In documents, number of files, how many megabytes, or whatever convenient measure makes sense for your project)

This was a textual question, attempting to get a measure of how much TEI P4 stuff people have. It was deliberately left vague as to how it should be expressed, partly because I was interested to see how people would quantify their TEI P4 data, and partly because I recognise that it would be difficult to provide all the same form of measurement.  I was interested to see that this ranged more widely than I had expected.

  • 0
  • none
  • zero
  • Several hundred files.
  • I have about 500 texts
  • 3,200 files, 170Mb.
  • nil
  • Very roughly: 60,000 books = 5 million pages = 10 GB of marked-up text.
  • 40 megabytes in the one P4 project I still manage; a bunch more in ones I’m no longer involved in.
  • This varies a lot, but projects range from 3-150 MB In practice, the TEI files are a small part of the overall operation, which includes authority information usually in non-TEI format, and various generated TEI XML files used for web publication only
  • 50 files
  • Appx. 7000 files, 29 MB total data
  • Appr. 6500 documents (mostly letters)
  • 0
  • less than 10%
  • 0
  • about 3,000 XML files currently in P4.
  • in summa: about 4 Mb
  • All of the [Institution]‘s projects are in migration from p4 to p5, so this is a snapshot of the migration process. The data is migrated, but the sites are not all rewritten yet. My hope is that by May of 2012, all of the current [Institution] sites will be serving out texts based on p5.
  • 0
  • Help files used by about 1000 Modes users.
  • 5 text-critical editions
  • 7000+ [P4 Customization] encoded letters
  • Main current project: several dozen megabytes including a few large files but mostly 10-20 kb: roughly 3000 files.
  • Roughly twelve published electronic editions, with at least a dozen more in the pipeline, in process of being finished (though they now have to be migrated to be published).
  • I have no clue, but it’s a lot.
  • The [Institution] has 113MB bytes of P4 documents, of archival interest only.
  • None, since we upgraded.
  • I’m not sure. I think I might have one project that is in TEI P4, but it’s a legacy project and I’m actually not positive. I haven’t looked at it in a while.
  • 2.5 million text pages
  • zero
  • None
  • Between 300 and 600 files.
  • ca. 70 files
  • dozens of documents.
  • Lots. Can’t access the figures quickly.
  • 700MB

This ranges from zero to multiple gigabytes of TEI text. What I should have asked was “And is all the TEI freely available for download?” as, of course, that is something I’d like to encourage.

8. Please list the URLs of any TEI P4 projects you want us to know about.

I’ve decided not to provide these on this summary, if projects wish to provide samples they should add them to  http://wiki.tei-c.org/index.php/Samples and/or describe their projects on the wiki.

9. Please list the URLs of any TEI P5 projects you want us to know about.

I’ve decided not to provide these on this summary, if projects wish to provide samples they should add them to  http://wiki.tei-c.org/index.php/Samples and/or describe their projects on the wiki.

10. Have you submitted a Bug or Feature Request to the TEI Technical Council?

Answers to question 10

Lots of people have provided bug or feature requests,  but most people have either contributed to discussion or not contributed them. We should, of course, strive to increase feedback from the TEI community. I’d be interested in any ideas on how to make this easier for the community to participate.

11. Where do you think the TEI Technical Council should expend its time and effort?

Answers to question 11

This is also an interesting result.  Scoring highest on ‘top priority’ is the idea that the TEI Technical Council should spend its time fixing bugs and implementing feature requests by the community. This, and analysing where the TEI Guidelines could be improved and undertaking these improvements was also ranked highly, along with developing the infrastructural basis for future versions of the TEI Guidelines. What  scored lower was the idea of the TEI Technical Council setting up a repository of TEI texts, or developing software to make publication of TEI texts easier. I would suspect that this is because that maintaining the Guidelines is the central mandate of the TEI Technical Council, and looking for how it can be improved is related to that, while the creating of repositories is already done better by people who already focus on those activities.  Although it is a community-based activity only the TEI is really in charge of maintaining the Guidelines, whereas any third-party can develop software or archives.  We should certainly encourage those activities and implement community suggestions which facilitate the greater development of community software.

12. Any other comments?

Here are the comments that I received (lightly edited), with my personal responses:
For people with large repositories of transcriptions (where the text content will never be updated), markup stability is essential. P4 to P5 is not essential but recommended, but it’s going to mean a huge effort. My worry is that there will be a far too rapid succession to P6, P7, P8, etc which adds bells and whistles but does not contribute anything meaningful to static repositories.
There is not necessarily any reason to migrate if your systems are set up and working fine with P4. I would, personally, recommend using P5 in any new project.  And then you probably reach a state where it is easier to migrate the P4 to P5 than support multiple systems, but different people’s experiences will vary.  The Birnbaum Doctrine suggested that the TEI Council should only move to new major versions (P6 etc.) when a large external technological change meant that it would be beneficial (e.g. SGML to XML) or a large internal infrastructural change (e.g. development of the P5 class system) was deemed significantly beneficial. I personally do not believe that we are at a juncture which would necessitate development of P6, rather I’d prefer to see P5 2.5, P5 4.5, P5 35.2, etc. than have people feel they need to move major versions.  This has its own challenges, of course, and your project in its TEI ODD can point to the very specific version of TEI P5 that it uses.
Yes – thanks for doing such a great service to the community!
You’re welcome, it was my pleasure. Although I know filling in surveys can be annoying I think it is a quick and easy way to get at least a vague indication of the community’s feeling on certain issues.
I think that lack of easy tools for presentation / publication od TEI documents is a serious drawback. Many of my younger colleagues would learn (or actually have learned) the TEI editing in Oxygen, but they are unable — and not willing! — to learn XSLT for the presentation of their texts (not to mention the publication – servers etc.). An average user who is not able to modify Sebastian’s stylesheets for his edition is left completely alone with his/her TEI document (only *exceptionally*, an XSL-expert is available for help in big institutions). As for now, the TEI is an ideal tool for only one part of the communication chain — but not for the whole …
This is of course difficult, but so is the publication of research in print or other mediums. Usually these forms of publication involve the work of other people, for which researchers pay in one way or another.  Perhaps it is because I happen to help manage a service, InfoDev,  which would be more than happy to undertake paid work in this area for you and other external institutions, but I don’t see this as much as a hurdle.  If the research is worthwhile, then hopefully funding is available, and some of this could be budgeted for technical development.  However, that said, researchers often spend years learning ancient languages or obscure discipline-based technicalities, and arguably they should be able to learn some basic XSLT and HTML with a very small dedication of their time.  Whether they should and could do this is, of course, a personal decision, but these are just more tools in a toolbox that might also include knowledge of how to write complex statistical queries or how to collaborate using version control systems. But again, we’re happy to undertake work, especially TEI-related work, from any part of the digitization to publication, analysis and visualization aspects of research projects.
Perhaps, a marketing campaign would help.
This would perhaps help get more people involved in the TEI. We would want, I suggest, that anyone doing a humanities text project applying for funding should feel (or get the advice that) they should be using the TEI (or at least justifying why they are using some other open standard instead). I feel this is probably more in the mandate of the TEI Board than the TEI Technical Council, but would encourage SIGs and indeed individuals to undertake whatever outreach activities are feasible.
about question 11 : it would be interesting to relate software/tools development and training/workshop. offering training sessions dedicated to one tool or category of tools, and looking at how people use tools IRL during the training sessions to get a better idea of need specifications… ?
This would be interesting, though those who have been just trained in tools are likely to perceive different needs from those who use them on a daily basis. But I do wonder whether this should be a priority for the TEI Technical Council, who has its hands full maintaining, improving, and extending the Guidelines themselves.  We should encourage tool development by third parties, and facilitate this development where it is in our power.
Please, please, please don’t spend time and money on building a TEI-wide repository. Instead, convince Google to recognize the TEI format so that one can easily do a web search for TEI texts. Then, get people to put their texts on the web. I think the building of publishing tools and education are very important, but that they shouldn’t be Council functions per se. Similarly, I think the interchange question is very, very important, but Council’s role in it should be limited. This is the kind of thing a SIG (or SIGs) should tackle, and Council should be involved in blessing/criticizing their output.
Personally, I agree with you about building repositories. I feel there are more than enough people with a lot more experience in undertaking this kind of activity.  There already has been discussion and work with Google regard exporting from Google Books in TEI P5 XML format which are promising. I agree the community, potentially through SIGs can handle a lot of these issues. I worry about the idea of it “blessing/criticizing” the output of SIGs, rather than just being on hand to provide support and implement changes recommended by them.
Creating and managing a content repository is vastly different from developing and maintaining markup guidelines, and would require a serious redirection of TEI-c’s resources. Let others who are already in the repo business (e.g., HathiTrust, OTA) take care of that.
I would agree with this, and it is what I would recommend to the TEI.
Thank you for undertaking this survey.

You’re welcome, it was my pleasure. I’m always interested in getting a sense of where the TEI community agrees on certain issues.


13. You may optionally include your email address so we can contact you if (and only if) we have any follow-up questions concerning your responses.

I’m certainly not going to provide these for spam-bots!

Conclusion

My recommendation to the TEI Council is going to be that we slowly start phasing out TEI P4 support. Closer to the end-of-support date (November 2012) we should move the TEI P4 materials to the TEI Vault but redirect links to there. I think this survey bears out my belief that the TEI Technical Council should focus on the maintenance and improvement of the Guidelines, and looking for ways to improve these in the future.



Posted in TEI | 5 Comments

TEI Consortium and its Future

John Unsworth, interim chair of the TEI Consortium (TEI-C) has asked those running for TEI Board or TEI Technical Council, and those who are remaining in place to answer some questions regarding the development of the TEI.  I’m already serving a term through 2012 so not up for potential re-election this year. I’ve chosen to write my answers up as a blog post because I found it difficult adhere to John’s plea for brevity.

1) Should the TEI cease to collect membership fees, and cease to pay for meetings, publications, services, etc.?

I feel it would be difficult for the TEI Consortium to continue its work without collecting membership fees. However, I think the majority of this money should not be reserved for travel. The majority of it should be available for application in the same manner as we have done the SIG grants in the past. (However, this might be used for travel for a particular TEI Technical Council additional workgroup, or bursaries for the conference, or targeted tool development (‘bounties’) for tools useful to the TEI-C’s mission, amongst many other things.) There should not necessarily be any limits on what could qualify for an application for funding. Not all revenues would need to be spent in a single year.

2) Assuming paid membership continues. should institutional members have a choice between paying in cash and paying by supporting the travel of their employees to meetings, or committing time on salary to work on TEI problems?

The cost of running meetings for the TEI Board or Technical Council should mostly be born by the institution and agreed to at time of nomination. (i.e. if your institution won’t commit to fund your attendance (travel and subsistence) at a couple meetings a year, then you should not necessarily be accepted as a candidate.) I realise this is unfair but so is participation in most standards-creating bodies, but there is nothing stopping significant participation by any member of the community (i.e. they don’t need to be on Board/Council to affect change).  It may be that public funds could be sought to further supplement this by the institution or individual. TEI-C money would be used for any overall expenses, such as the costs of room hire, or such things not covered by institutions. If an institutional member was in dire straits financially, but the participation of a person elected from that institution was deemed to be of such a benefit to the TEI-C, they could apply for support from the TEI-C. However, this should not be the norm. All Partner-level institutions should offer services as part of their partnership agreement in addition to the top-level membership fee. These partnership agreements should be made public on the TEI-C website. ‘Membership’ at a lower non-Partner rate might be replaced solely by services.  There should be nothing stopping voluntary participation in TEI-C activities by motivated individuals who are not institutional members.

3) Should the TEI have individual members (paying or not) who can vote to elect people to the board and/or council?

All members at every single level, especially including individual subscribers should have a single vote.  Institutions become Partners to support the TEI Consortium and tend to view it as participation in a standardization body, I doubt many care strongly about their privileged position of having a vote at election time. One vote for one member (whether individual, Partner, or otherwise).

4) Should the email discussions of the TEI Board be publicly accessible?

Yes. The TEI Technical Council archives were made public partly because of my suggestion that they should be done so. See http://lists.village.virginia.edu/pipermail/tei-council/2006/005757.html … in this post I assumed that TEI Board mailing list might contain details that would be detrimental if made public.  Having had reports back from institutional representatives on the mailing list I no longer believe that this is true for the majority of posts there. I would recommend that when something of an extremely confidential nature is discussed that this happen off the TEI Board mailing list, but that an edited summary of this discussion be posted back on the list for all to see. However such in camera discussions should be very unusual and justified before taking place.

5) Should the Board and the Council be combined into a single body, with subsets of that group having the responsibilities now assigned to each separate group?

I agree that the TEI Board and TEI Technical Council might seem a bit cumbersome. I’ve been on the TEI Technical Council since 2004 and have enjoyed that it is not in its remit to worry about the fiscal, marketing, and organizational aspects of the TEI-C. Although I think the TEI Board could do a better job in these areas, especially marketing, these are not my strengths.  If they were merged together I think it might distract from the technical work. If we then made sub-groups with responsibilities for Board-like activities and Technical Council-like activities, aren’t we just reinventing the Board and Technical Council?  If the activities and discussions of the TEI Board were conducted publicly (i.e. the mailing list archives were public), then I think that would be enough. The community could then lobby elected individuals if they wished to get their points of view heard.

6) Assuming we continue to collect funds, we will still have limited resources. Given that, in the next two years, which of the following should be the TEI’s highest priority? Pick only one:

a) providing services that make it easy for scholars to publish and use TEI texts online
b) providing workshops, training, and other on-ramp services that help people understand why they might want to use TEI and how to begin to do so
dc) encouraging the development of third-party tools for TEI users
d) ensuring that large amounts of lightly but consistently encoded texts (e.g., TEI Tite) are generated and made publicly available, perhaps in a central repository or at least through some centrally coordinated portal
e) developing a roadmap for P6 that positions the TEI in relation to other standards (HTML5, RDF, etc.)
f) tackling hard problems not addressed in other encoding schemes, in order to maximize the expressive and interpretive power of TEI

This is a difficult choice because so many of these are things that I feel strongly need to be encouraged.

a) is very vague and I feel it is not the role of the TEI-C to be providing lots of services, rather maintaining a standard.
b) also sounds good, but we already have lots of people providing training (my own institution included) at cost-recovery basis. Some more basic guides might be beneficial.
c) The TEI-C can encourage these through SIG grants and bounties where appropriate, but third-party tools should be developed by third parties.
d) I’m highly resistant to the idea that any TEI users should even see TEI Tite documents at all! This schema is not TEI Conformant or Conformable by itself as it breaks the TEI Abstract Model in several ways. Tite is fine as a mass-digitization schema, but should be transformed instantly and internally to the project to a proper TEI file with a <teiHeader>. I have nothing against lots of sample TEI texts being made available, in TEI Lite or better a different slimmed down mostly structural encoding. However, I think that having these all in one place is unlikely, and distributed collections of archives (all linked to from http://wiki.tei-c.org/index.php/Samples or another location) or through some OAI-PMH or RDF aggregator is probably an easier start). Again, this should be done by the community not the TEI-C. There are no barriers to the community just doing this and I know the Oxford Text Archive has some plans in this area.
f) Is a possibility, but the suggestions and developments for the TEI Guidelines should come from the community. However, the TEI Guidelines are not Guidelines of the Gaps handling just those things not done by other standards. It plays nicely with other standards where at all possible and developments should continue to improve it in this area.
e) Which I’ve cunningly left to last is probably central to what the TEI-C or at least the TEI Technical Council should be doing. We already have a statement on the conditions for maintenance of P5 and developments of such things like P6 http://www.tei-c.org/Activities/Council/Working/tcw09.xml and I do not believe we have reached such a major change in technology or infrastructure to warrant TEI P6, yet. However, I agree that there are things we can do with the TEI Guidelines to help those seeking transformations to HTML5, RDF, and other newer formats and recommendations to be made in this area. I disagree entirely that these somehow replace the need for TEI.  A roadmap is a good idea, but a lot of the necessary changes can be done under the umbrella of TEI P5 and its intended deprecation mechanisms.

So, on balance, I plump for ‘e)’, however I think all the other ideas are beneficial things, with c) and f) being my second choices.

Overall, I do not think the TEI-C is horribly broken, and believe that the TEI has a good and useful role to play in the development of digital resources. The suggested revisions moving towards openness and transparency would be beneficial. I feel the problems people have had with the TEI Board stem from not knowing what is going on there (lack of transparency) and members of the Board acting as individuals rather than remembering that they are there are representatives of the community at large.

-James

Posted in TEI | 1 Comment

Digital Humanities 2011

Digital Humanities 2011

My report from Digital Humanities 2011 is below. If anyone wants any more information about the various sessions I attended, I’m happy to try and dredge my memory for a recollection of my impressions. Otherwise the book of abstracts is available. Most of the interesting things were really in between sessions and in the evenings, in talking to people about possible future projects, advertising InfoDev services, etc.

Friday 17 June 2011

Sebastian and I took an afternoon flight to SFO where we were attending the Digital Humanities 2011 conference. I was lucky enough to get a row to myself, but Sebastian kept to his assigned seat rather than join me and be tormented by my cackling at juvenile films. I watched four films, the only one of which I’d recommend is Submarine whose screenplay and direction was by Richard Ayoade. Sebastian’s estimate of c.250 is a bit off, there were about 375 registered participants with various other hangers-on according to the organisers.

Saturday 18 June 2011

Sebastian and I woke early (thank you jetlag) to teach our Introductory TEI ODD workshop at 8:30am. Unfortunately, nothing on campus that serves anything which even vaguely resembles food opens until 8am on a Saturday. The course materials are at: http://tei.oucs.ox.ac.uk/Talks/2011-06-18-odd/ and we had about 15 participants. We went perhaps a bit too fast, and talked too long, but most of them made it through the first exercise. Some had difficulty with the idea that we weren’t teaching the stated prerequisite of TEI and XML but the TEI’s customization language instead. It really would have been better to do it as a full day workshop.

Afterwards a Craig Bellamy and I drove (in a mustang he had rented) down to Santa Cruz and ate a burrito on the beach. It was better than the ones I get here in Oxford and was not dissimilar to the real thing. We also went to look at UC Santa Cruz where Craig had spent some undergraduate time, a truly bizarre campus. Craig is responsible for setting up the Australasian Association for Digital Humanities see http://www.craigbellamy.net/2011/05/31/australasian-association-for-digital-humanities-aadh/ and http://aa-dh.org/ which is seeking to join ADHO (Alliance of Digital Humanities Organizations) alongside ACH, ALLC, and SDH-SEMI. Much of our conversation related to this topic and the AHDO Steering Committee meeting the next day. (Boy, don’t we know how to spoil a beach!) We returned to Stanford and met up with various other DH conference goers for ‘food’ and ‘drink’ in the local student’s union.

Sunday 19 June 2011

I intended to go swimming this day, but the lane swimming wasn’t open until the afternoon, so instead I rented a bicycle. I purchased a variety of items to put in the huge fridge that was part of the full-sized kitchen (with stove, sink, dishwasher, microwave, etc.) that was in my room. Sadly the kitchen didn’t come with anything useful to, you know, cook or eat with. It didn’t come with anything at all. Since Sebastian also had a bicycle we cycled to the Stanford Shopping Centre, where we looked around at things we could possibly buy, had lunch, and eventually cycled back to the residences. The conference’s opening plenary was by David Rumsey http://www.davidrumsey.com/ talking about “Reading Historical Maps Digitally: How Spatial Technologies Can Enable Close, Distant and Dynamic Interpretations” but partly seemed to be demonstrating the proprietary Luna Browser (http://www.davidrumsey.com/view/luna)(java servlet based image viewer) which I didn’t like at all. At the reception afterwards there was much pleasant conversation.

Monday 20 June 2011

I attended a morning session consisting of the following papers:

  • Maciej Eder & Jan Rybicki “Do Birds of a Feather Really Flock Together, or How to Choose Test Samples for Authorship Attribution “
  • Jan Rybicki “Alma Cardell Curtin and Jeremiah Curtin: The Translator’s Wife’s Stylistic Fingerprint.”
  • David L. Hoover “The Tutor’s Story: A Case Study of Mixed Authorship”

And then one with:

  • Yves Marcoux, Michael Sperberg-McQueen, & Claus Huitfeldt ”Expressive power of markup languages and graph structures “
  • Gary F. Simons, Steven Bird, Christopher Hirt, Joshua Hou, & Sven Pedersen “Mining language resources from institutional repositories”
  • Thomas Eckart, David Pansch, & Marco Büchler ”Integration of Distributed Text Resources by Using Schema Matching Techniques”

Of these the one by Yves Marcoux on OO-TexMECS was the most interesting (though Eckart’s showed some promise). However, I fundamentally disagreed that breaking XML is necessary for recording the majority of the graph data-structures he was presenting. TEI-style basic fragmentation, or even basic stand-off linking seems to do the trick in 99% of cases. It is an interesting discussion for markup geeks interested in the theory behind markup languages, but solving a problem that I feel isn’t really a problem for the majority of work we do here.

After lunch I went to a bit of:

  • Reinhild Barkey, Erhard Hinrichs, Christina Hoppermann, Thorsten Trippel, & Claus Zinn “Trailblazing through Forests of Resources in Linguistics “
  • Michele Pasin ” Browsing highly interconnected humanities databases through multi-result faceted browsers “
  • Alan Galey “Approaching the Coasts of Utopia: Visualization Strategies for Mapping Early Modern Paratexts”

before nipping off to the location where the posters were to be displayed and put up my Wandering Jew’s Chronicle poster as well as Sebastian’s Claros poster both right in front of the doors where you walk in, ensuring maximum throughput of people to look at them. The poster session was quite busy, shortly before I took photos of all the posters, however, this is on the camera which later went missing. There was a reception that followed this, but I was so busy talking to people about the poster that I seemed to miss it. Luckily someone brought me a drink (and we arranged a tour of SLAC for the next day).

Tuesday 21 June 2011

Sebastian woke up extra early to go on a punishing ‘fun run’ up huge mountains, whereas I slept in. From 08:30 we interviewed a
potential ePub and/or OpenData intern via skype). Since we’d missed the beginning of the sessions (and from the abstracts of them I didn’t feel cheated), while Sebastian went off to catch the end of the sessions, I cycled to the nearby B. Gerald Cantor’s Rodin Sculpture Park and looked at a bronze cast of Rodin’s “The Gates of Hell” see http://museum.stanford.edu/view/rodin__1985_86.html

Afterwards I caught one of the next sessions, specifically the one of a panel discussing “The Interface of the Collection”
consisting of: Geoffrey Rockwell, Stan Ruecker, Mihaela Ilovan, Daniel Sondheim, Milena Radzikowska, Peter Organisciak, & Susan Brown.

Over lunch, instead of nattering away to people about visualization Mike Toth had arranged a visit to the Stanford Linear Accelerator Complexhttp://yfrog.com/ke3m7tmj now ‘SSRL’. He had done work here in xray fluorescence to uncover the archimedes palimpsest and they wrote up a glowing press article about our visit. https://news.slac.stanford.edu/features/digital-humanities-experts-learn-how-ssrl-can-shed-light-past

We can,indeed, use real science tools to help digital humanities.

After this I ate some lunch in the back of the following session:

  • David Beavan “ComPair: Compare and Visualise the Usage of Language “
  • Trevor Muñoz, Virgil Varvel, Allen Renear, Kevin Trainor, & Molly Dolan “Tasks vs. Roles: A Center Perspective on Data Curation Needs in the Humanities “
  • Deborah Anderson “Handling Glyph Variants: Issues and Developments “
  • Scott Weingart & Jeana Jorgensen “Computational Analysis of Gender and the Body in European Fairy Tales “
  • Hiroyuki Akama, Maki Miyake, & Jaeyoung Jung “Automatic Extraction of Hidden Keywords by Producing “Homophily” within Semantic Networks”

Later we went to the Zampolli Prize Lecture in the Dinkelspiel Auditorium and listened to the winner, Chad Gaffield tell us
about “Re-Imagining Scholarship in the Digital Age”. This was a very motivational session by the president of the SSHRC funding body. I wouldn’t have been surprised if he had got everyone up and singing praises, but the auditorium was far too hot for that kind of thing.

Wednesday 22 June 2011

This morning I went to the panel on “Integrating Digital Papyrology” featuring Gabriel Bodard, Hugh Cayless, Ryan
Baumann, Joshua Sosin, & Raffaele Viglianti.

After a break I attended “The “#alt-ac” Track: Digital Humanists off the Straight and Narrow Path to Tenure” featuring Bethany Nowviskie, Julia Flanders, Tanya Clement, Doug Reside, Dot Porter, & Eric Rochester . Partly I attended because I have an article (as the last word) in the open access book they were launching http://mediacommons.futureofthebook.org/alt-ac/.

After lunch there was a panel on Funding Digital Humanities, with funders from the USA and Canada. There was not a UK, European, Australian, Japanese, or Mexican funder represented. Still, was good to hear what they said.

After this there was the closing plenary by JB Michel & Erez Lieberman-Aiden who had worked with Google to produce the Google ngram viewer. The long ‘s’ problem in OCR’ed data clearly visible by looking at ‘best,beft’ from 1700 to the modern day in http://ngrams.googlelabs.com/. (Something I tweeted about a couple days after its launch but using presumption vs prefumption.) Unlike Chad, who seemed to be celebrating what Digital Humanities had done, these two seemed intent on telling us quite obvious things that DH as a community should be doing… most of which I’m pretty sure we already are doing or striving to do. Because it was so hot during Chad’s talk on the way there I stopped to get a mango smoothie which made the talk more tolerable.

Following this there was a banquet at the Computer History Museum in Mountain View. The food and drink were so-so, the company was excellent, the museum was fairly usa-centric in its outlook.

Thursday 23 June 2011

While most people went on organised tours to Silicon Valley or the Sonoma Wine Country, instead Craig Bellamy (with his mustang) and Peter Organisciak and I drove up Highway 1 stopping off for delicious mexican food, beaches, and crossing the golden gate bridge. In S.F. we walked around fisherman’s wharf and some other places, before returning to Stanford. There was

simultaneously a meeting on the curation of digital humanities data which I followed via twitter.

Friday 24 June 2011

I was flying home in the evening, so accompanied by Raffaele Viglianti I went to S.F. on the train, where we met up with some
other people, wandered up and down the hills of china town, had some dim sum, and eventually I caught a shared van to SFO to
catch my flight. This time I got a seat in the much smaller ’upper deck’ of the plane, but still didn’t capitalise on it and watched several more films. Arrived back Saturday midday horribly jetlagged.

Posted in Conference | 2 Comments

grouping by group-adjacent=”boolean(self::lb)”

A project I was doing some work for had some input that looked like:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader xmlns:xi="http://www.w3.org/2001/XInclude" type="text">
<fileDesc>
   <titleStmt>
      <title>A sample file</title>
   </titleStmt>
   <publicationStmt>
      <distributor>InfoDev</distributor>
   </publicationStmt>
   <sourceDesc>
      <p>VSARPJ project</p>
   </sourceDesc>
</fileDesc>
<profileDesc>
   <creation>
      <date/>
   </creation>
   <langUsage>
      <language ident="ojp">Old Japanese</language>
   </langUsage>
   <textClass>
      <catRef target="#bussoku"/>
   </textClass>
</profileDesc>
<encodingDesc>
   <samplingDecl>
      <p>This text was transcribed phonemically and edited to parallel the content from the
         corresponding item in the <title>Nihon koten bungaku taikei</title> version of the
            <title>Man'yôshû</title>, <ref>Man'yôshû I</ref>. </p>
   </samplingDecl>
</encodingDesc>
</teiHeader>
<text>
<body xml:id="BS.1">
   <div>
      <ab type="original" xml:lang="ojp"> 美阿止都久留 <lb xml:id="BS.1-orig_1"
            corresp="#BS.1-trans_1"/> 伊志乃比鼻伎波 <lb xml:id="BS.1-orig_2" corresp="#BS.1-trans_2"
         /> 阿米爾伊多利 <lb xml:id="BS.1-orig_3" corresp="#BS.1-trans_3"/> 都知佐閇由須礼 <lb
            xml:id="BS.1-orig_4" corresp="#BS.1-trans_4"/> 知知波波賀多米爾 <lb xml:id="BS.1-orig_5"
            corresp="#BS.1-trans_5"/> 毛呂比止乃多米爾 </ab>
      <ab type="transliteration" xml:lang="ojp-Latn">
         <s>
            <phr>
               <phr>
                  <cl>
                     <phr type="arg">
                        <w>
                           <m type="prefix">
                              <c type="phon">mi</c>
                           </m>
                           <w>
                              <c type="phon">ato</c>
                           </w>
                        </w>
                     </phr>
                     <w type="verb" function="adnconc" ana="#L031144">
                        <c type="phon">tukuru</c>
                     </w>
                  </cl>
                  <w type="verb" function="adnconc" ana="#L031144">
                     <c type="phon">tukuru</c>
                  </w>
                  <lb xml:id="BS.1-trans_1" corresp="#BS.1-orig_1"/>
                  <w>
                     <c type="phon">isi</c>
                  </w>
                  <w type="particle" subtype="case" function="gen" ana="#L000520">
                     <c type="phon">no</c>
                  </w>
               </phr>
               <w>
                  <c type="phon">pibiki</c>
               </w>
               <w type="particle" subtype="top" ana="#L000522">
                  <c type="phon">pa</c>
               </w>
            </phr>
            <lb xml:id="BS.1-trans_2" corresp="#BS.1-orig_2"/>
            <cl>
               <phr>
                  <w>
                     <c type="phon">ame</c>
                  </w>
                  <w type="particle" subtype="case" function="dat" ana="#L000519">
                     <c type="phon">ni</c>
                  </w>
               </phr>
               <w type="verb" function="infinitive" ana="#L030170">
                  <c type="phon">itari</c>
               </w>
            </cl>
            <lb xml:id="BS.1-trans_3" corresp="#BS.1-orig_3"/>
<!-- etc -->
         </s>
      </ab>
   </div>
</body>
</text>
</TEI>

What they wanted as output was a table-layout (icky) that aligned two nested tables of the original and the transliteration like:

<table>
   <tr>
      <td>
         <table>
            <tr>
               <td><span class="origLine">美阿止都久留</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">伊志乃比鼻伎波</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">阿米爾伊多利</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">都知佐閇由須礼</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">知知波波賀多米爾</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">毛呂比止乃多米爾</span></td>
            </tr>
         </table>
      </td>
      <td>
         <table>
            <tr>
               <td><span class="w">miato</span>
                  <span class="w">tukuru</span>
                  <span class="w">tukuru</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">isi</span>
                  <span class="w">no</span>
                  <span class="w">pibiki</span>
                  <span class="w">pa</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">ame</span>
                  <span class="w">ni</span>
                  <span class="w">itari</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">tuti</span>
                  <span class="w">sape</span>
                  <span class="w">yusure</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">titipapa</span>
                  <span class="w">ga</span>
                  <span class="w">tame</span>
                  <span class="w">ni</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">moropito</span>
                  <span class="w">no</span>
                  <span class="w">tame</span>
                  <span class="w">ni</span>
               </td>
            </tr>
         </table>
      </td>
      <td>BS.1</td>
   </tr>
</table>

If we ignore the icky aspect of using tables for layout and alignment purposes, then the solution has something interesting to learn from. This is, at heart, a grouping problem. The solution I came up with was:

<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xpath-default-namespace="http://www.tei-c.org/ns/1.0" version="2.0">
    <xsl:template match="TEI">
        <html>
            <head>
                <title>test corpus</title>
            </head>
            <body>
                <xsl:apply-templates/>
            </body>
        </html>
    </xsl:template>

    <!-- You can put things you want to do nothing to all in one template -->
    <xsl:template match="teiHeader | note | entry | list"/>

    <!-- Or similarly things you want to just have the tags vanish from.  w is here and elsewhere, hence priority. -->
    <xsl:template match=" choice | m |w | s |phr|cl " priority="-1"><xsl:apply-templates/></xsl:template>

    <!-- If you are using tables for layout purposes (icky) then you don't need to change lb's to BRs. -->
    <!--
    <xsl:template match="lb">
        <br/>
     </xsl:template>-->

    <xsl:template match="body">
        <table>
            <tr>
                <td>
                    <!-- Nesting an icky table for each cell, so you can get one tr per line. -->
                   <table>
                       <xsl:apply-templates select="descendant::ab[@type='original']"/>
                   </table> </td>
                <td>
                    <!-- Nesting an icky table for each cell, so you can get one tr per line. -->
                    <table><xsl:apply-templates select="descendant::ab[@type='transliteration']"/></table>
                 </td>
                <td>
                    <xsl:value-of select="@xml:id"/>
                </td>
            </tr>
        </table>
    </xsl:template>

    <!-- Not really necessary but in case you wanted to be able to do something with the original lines, wrap an element around them. -->
    <xsl:template match="ab[@type='original']//text()"><span class="origLine"><xsl:value-of select="normalize-space(.)"/></span></xsl:template>

    <!-- For original things group by any child nodes or text, and create the groups adjacent to whether there is a linebreak or not. -->
    <xsl:template match="ab[@type='original']">
        <xsl:for-each-group select="child::node()| child::text()"  group-adjacent="boolean(self::lb)">
        <tr><td><xsl:apply-templates select="current-group()"/></td></tr>
        </xsl:for-each-group>
        </xsl:template>

    <!-- For transliterations first flatten hierarchy (you could do this a variety of ways), by copying just the top w elements and linebreaks, and for each of these group adjacent to the line breaks. -->
    <xsl:template match="ab[@type='transliteration']">
        <xsl:variable name="test"><xsl:copy-of select=".//w[not(ancestor::w)] | .//lb"/></xsl:variable>
        <xsl:for-each-group select="$test/*" group-adjacent="boolean(self::lb)">
                    <tr>
                        <td><xsl:apply-templates select="current-group()"/></td>
                    </tr>

        </xsl:for-each-group>
    </xsl:template>

    <!-- Since we have w's nested inside w's when we have one of the top ones wrap and element around it, and then take the value stripping out any spaces. (other ways to do this as well). -->
    <xsl:template match="w[not(ancestor::w)]"><span class="w"><xsl:value-of select="translate(normalize-space(.), ' ', '')"/></span><xsl:text> </xsl:text></xsl:template>
</xsl:stylesheet>

Most of this is pretty straightforward, and I’ve included comments in the XSLT to help anyone wondering why I’m doing something. But if we look at just one bit of it:

  <!-- For original things group by any child nodes or text, and create the groups adjacent to whether there is a linebreak or not. -->
    <xsl:template match="ab[@type='original']">
        <xsl:for-each-group select="child::node()| child::text()"  group-adjacent="boolean(self::lb)">
        <tr><td><xsl:apply-templates select="current-group()"/></td></tr>
        </xsl:for-each-group>
     </xsl:template>
 

The reason this is interesting is using @group-adjacent=”boolean(self::lb)”. I’m using the truth or falseness of whether the current node is a line-break element as a test to group the adjacent nodes. In XSLT2 there are basically two types of grouping conditions, patterns and expressions. @group-starting-with and @group-ending-with require their values to be a pattern, but @group-by and @group-adjacent accept any XPath expression. This means with those two you can have a bit more fun! In these the condition is being applied to each item in the population you are grouping in order to calculate grouping keys. In those accepting patterns, the condition must match specific nodes in this population that will either lead or terminate a newly-created group. This is an important distinction to keep in mind and means that with group-adjacent you can use things that calculate the key to be matched rather than being that key. So in this case we use boolean(self::lb) to test whether the current node being matched is a or not. If it is, then the grouping condition is true so it creates the group based on its siblings.

Posted in TEI, Uncategorized, XML, XSLT | Leave a comment

Ubuntu Twinview Maximizing Windows problem

This is more of a note-to-self. I had a problem in my recent upgrade to the latest Ubuntu in that my two monitors, when set to ‘twinview’ meant that the panels and task bars, and maximized windows spanned both monitors. What you really want is for these to be able to be moved from one monitor to the other, but when you maximize them they stay maximized in only one monitor.

The solution that I guessed might work, and it turned out did, was to comment out the ‘metamodes’ option in the Screen section of my xorg.conf. I.e.:


Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
DefaultDepth 24
Option "TwinView" "1"
Option "TwinViewXineramaInfoOrder" "CRT-0"
#Option "metamodes" "CRT-0: 1280x1024 +0+0, CRT-1: 1280x1024 +1280+0"
SubSection "Display"
Depth 24
EndSubSection
EndSection

That sorted out the problem as soon as I logged back in again.

Posted in Uncategorized | Leave a comment

Thunderbird Calendar Automatic Export

Previously I wrote about thunderbird, davmail, exchange and exporting to google calendar and my system was setup and working fine. Then I upgraded (full-wipe and install) to the latest Ubuntu operating system and I had to set things up again. Part of the problem was that the thunderbird Automatic Export add-on wouldn’t work with the new version of thunderbird. While I know sometimes changes of software mean that the plugin will no longer function, I didn’t think this might be a problem with Automatic Export… I mean all it does is take the calendar you’ve set and export it which hopefully isn’t too reliant on the way the program itself works. Hopefully.

It turned out that if I unzipped the thunderbird plugin package available from Automatic Export on Mozilla add-ons site then I was able to edit the install.rdf file which tells thunderbird about the package. When I did I found that it had a em:maxVersion attribute and all I did was change that to be far past the current version. (Note: there were two of these, I changed both since I wasn’t sure which applied to what.) Zipping the file back up again and renaming to .xpi was all that was needed for a successful install.

Everything now working again perfectly.

Posted in Uncategorized | Leave a comment