Text Encoding Initiative (TEI) for Oracc

An ATF corpus can be turned into a TEI corpus; this document describes the TEI conventions used and discusses some of the issues in this conversion process.

The ATF processor turns ATF into XTF--a multi-stream XML output which separates the transliteration, the lemmatization and multi-word analysis of phrases, named entities, measures etc. The TEI implementation converts this to a single-stream representation conformant to the TEI P5 guidelines which can be validated using a schema generated by Roma.

In a few cases, the mapping of XTF to TEI is suboptimal as a result of the lack of tags with exactly appropriate semantics or the forced use of infelicitous constructs.

Preamble

Each ATF text is turned into a TEI text within a teiCorpus. Support for the kinds of information that can go in the teiHeader is weak in ATF corpora; this should be corrected.

The individual documents are available from a link on the project page for each text, under the label `Analytic View'. The reason for this choice for the label is that the TEI version of the transliterations integrates the results of running various kinds of content-analyzers on the texts with the text data itself. This makes it easy to colourize the various components identified by the analyzers.

The final version of the TEI corpus is the concatenation of all of the files in a project prefaced by a TEI header which includes elements derived from the project glossaries. Thus, the TEI corpus has the potential to represent all of the project's textual and glossary data in a single file. Further developments of the XTF to TEI conversion will aim to make the TEI corpus a complete representation of the project's glossaries, texts and metadata.

Implementation

Schema

The schema is very nearly vanilla TEI P5 as generated by Roma--the full text can be browsed from the Resources section below.

The only additions which have been made by hand are the definition of a simple XLink attribute set (att.xlink.attributes), and the referencing of that definition as an optional part of the name, note and title elements. This allows a few key links to be implemented directly in the browsable XML (when viewed with FireFox, at least).

Header

A very basic header is generated in order to meet the TEI minimum requirements.

Discourse

TEI div elements are used for discourse blocks (body, witnesses, document-date and others). Blocks which come before the body are placed in the TEI front section; blocks which come after the body are placed in the TEI back section.

Structure

XTF structural divisions are rendered with milestone tags. In the case of the outer structural division type `surface' we use the TEI milestone tag. For column and line breaks we use cb and lb respectively.

Inline

Almost all of the inline markup used by XTF (more precisely, by GDL) is handled well by TEI. A few exceptions are noted here.

There is no suppliedSpan, though it would be a natural since there is addSpan, delSpan and damageSpan. As a result, the equivalent to square-bracketed text is implemented using paired anchor tags (it is not possible to use an anchor ... ptr pair because ptr is not allowed in w).

There is no direct TEI tagging for the Assyriological practice of indicating collations (in this case, collation as in `checking of tablet' rather than collation of manuscript folios as in TEI). The XTF/TEI implementation uses a conventional mapping of flagged graphemes to TEI tags based on the corr element. The values high/medium/low are defined to be the specific equivalents of ATF flag combinations as in the following list.

? = <corr cert="low">
*? = <corr cert="medium">
* = <corr cert="high">

Lemmata

The lemmatization is partly integrated in the use of the w element. We push the definition of the @lemma attribute to include the full citation-form/guide-word/POS triple that is the standard referencing mechanism between XTF texts and their corresponding Corpus-Based Dictionaries. The additional annotation that is encoded in the forms structures in XTF files is not presently included. This will be rectified in a future release, probably by defining a TEI fs (feature-structure) item for each form and including it in the corpus preamble, then referencing it using the @ana attribute on w.

Handling of orthographic forms which contain more than one grammatical word is not discussed in TEI P5. The approach taken in the XTF/TEI conversion is to wrap the entire orthographic form in the first w tag, then to emit additional w elements with empty content as hosts for the subsequent lemmata.

Persons

Persons are handled in conformance with TEI P5. Two lists are generated from the names glossary, a listPerson and a listNym. These are then referenced from the forename tags in the body of texts. At present the export of data from the names glossary to listPerson/listNym is not complete, but the father, gfather and ancestor properties are emitted as relations in the listPerson.

Places

Not yet annotated.

Measures

Not yet annotated.

18 Dec 2019 osc at oracc dot org

Steve Tinney

Steve Tinney, 'Text Encoding Initiative (TEI) for Oracc', Oracc: The Open Richly Annotated Cuneiform Corpus, Oracc, 2019 [http://oracc.museum.upenn.edu/doc/about/standards/tei/]