The main aim of the LEADERS Project is to enhance on-line access to archives. In achieving this high level aim, we are seeking to link two established and largely complementary encoding schemes, Encoded Archival Description (EAD) to provide metadata about archival collections, and the Text Encoding Initiative (TEI) used to encode structured electronic versions of the archival documents themselves. We are also using a third encoding scheme called Encoded Archival Context (EAC) which is a standard for the provision of information about the creators of archival material. This information takes the form of an authority record with an emphasis on the provision of a biographical or administrative history for the person or organisation concerned. In linking these encoding schemes we are using the expanding Extensible Markup Language (XML) family of standards and tools to create a generic toolkit which will be reusable by archivists and others who wish to provide on-line access to archives. To illustrate the potential of the toolkit we have produced a demonstrator application based on sample documents from the Orwell Archive and UCL's own archive. In relation to this material, we have created EAD encoded finding aids, EAC encoded authority records, TEI encoded transcripts and digital images of the original archive documents to act as test data for the demonstrator.
The LEADERS Project uses the term 'archives' to indicate a set or collection of documents which are being preserved because they have long-term value. The work of the archivist is concerned with providing the means through which individuals can access archives. Users of archives need access tools that will describe the contents of archive collections. Such access tools are often called 'finding aids' and they are produced through a process of capturing, collating, analysing, and organising any information that serves to identify, locate and interpret archive documents within archive collections and explain the contexts and record systems from which the documents have been selected.
EAD is an encoding standard for archival description created and used by archivists to structure and exchange electronic finding aids using technology that is independent of proprietary hardware and software platforms. However, as a stand-alone tool, EAD cannot give users access to the actual content of archive documents. Therefore, there is real potential for the development of a resource that enables the integration of EAD with other tools that do allow for remote use of the contents of archival collections. The TEI enables electronic texts of all kinds to be searched and presented to users in a variety of different ways.
When EAD and TEI are brought together alongside digitised images of archival material the potential benefits for users are significant. Within a single environment the user can find items in archival collections; learn about their contexts; view representations of the items themselves; and read, study, analyse and manipulate their content.
On the surface, EAD and TEI appear to be perfectly compatible with one another and so integrating the two could be as simple as putting in a link at the relevant part of the EAD finding aid to the TEI transcript, and in fact this is exactly what many archival repositories have started to do to bring the two together.
However, on closer examination an issue begins to emerge which complicates the simplicity of this integration method, relating to an overlap between information held in an EAD encoded finding aid and information usually held in a TEI document. Although TEI is primarily concerned with re-presenting the intellectual content of documents and texts it also necessarily contains metadata that can put the text in context. This metadata must describe the newly created object of study and describe the original object from which the text has been derived. When used to provide access to archive documents, overlaps between EAD and TEI therefore exist in relation to metadata that:
In any LEADERS application we need to capture metadata that describes the TEI transcripts and the digitized images, and, at the same time, provide an adequate description of the original archive documents from which these digital forms were derived. On the one hand we have EAD, which provides an excellent framework for seeing archive documents in relation to a whole archive collection. Through the use of multiple levels of description the collection can first be described as a whole and then as smaller parts, which get more specific at each level, until at the lowest level the individual archive documents are described. However, EAD is primarily designed to hold data about original archive documents and not digital representations and so cannot adequately cover all of the metadata we are interested in giving the user. On the other hand we have TEI which through its header provides the metadata relevant to the electronic transcription but in its description of the original archive document lacks the capability to describe the overall context gained through the use of EAD's hierarchy of descriptive levels.
Given that the overlap between EAD and TEI exists we are left with four choices as to how it can be handled:
The first of these choices is unacceptable because it means that archivists will have to input the same information in different places, twisting it each time to reflect how metadata is recorded within each framework. Furthermore, from the viewpoint of data management, repetition of this kind is a major cause for potential errors and widens the possibility of inconsistencies creeping into the data.
The second of these choices would not solve the problem of overlap between EAD and TEI because metadata that relates to the content of the document is the same for both the original and its derived digital representations. Therefore, if the metadata for the original, the transcript and the image were held separately then the same content information would need to be repeated across the three.
When considering the third choice it is important to acknowledge the work that has been carried out by the MASTER Project. MASTER seeks to establish a standard for the creation of computer-readable descriptions of manuscripts, using TEI. The structure devised by the MASTER team allows for the creation of simple manuscript inventories or highly-formalised descriptions which include complete manuscript transcriptions and complete digital facsimiles of the manuscript. The aim of MASTER in describing the original manuscripts and providing access to digital representations is similar to our goals on LEADERS. However, ultimately we decided it was not compatible with our particular needs because MASTER is designed to provide access to manuscripts where each document is a separate and unique entity. However, we are interested in providing access to archive collections where the individual documents are not described separately but as component parts of a larger whole, and EAD is directly suited to this need. Furthermore, there is an intellectual question over whether the <teiHeader> is a suitable place for recording metadata relating to the digital image.
The fourth option stands out as being the best, both in terms of intellectual principles and practical solutions because EAD allows archivists to describe archival materials in a way that respects their provenance. The concept of provenance is a vital component of archival description. It asserts that the contents of archive documents are intrinsically bound up with the life of the individual or the functions of the organisation from which they emanated. If the relationship between the emanating organization or person and the documents that they create or assemble is lost then the documents themselves cannot be fully understood. It is therefore vital that the documents are retained and described as a body of related materials connected by the person or organizations that created or assembled them. If EAD, with its multiple levels of description and its ability to describe the whole and its component parts, is not used as the metadata framework then the principle of provenance and the concept of the archive collection as a whole is lost.
Expansion of EAD to include metadata about the transcripts and the digital images does not involve major restructuring or extensive revision but expansion. Furthermore, having all the metadata in one place for both the original and the digital documents makes sense when there is both convergence and divergence between the two (i.e. since the content metadata are the same across all three, it makes sense for the all the metadata to be brought together so that metadata relating to content isn't repeated in three separate places).
Our solution to the overlap problem is to use EAD as our overall metadata framework, while enriching it so that the finding aid can act as an adequate holding place for metadata relevant to the search, retrieval and interpretation of both the original documents and their derived digital representations. In order for this to work the similarities and differences between the original and the digital forms needs to be understood as it has a direct impact on the adequate capture of metadata relating to both.
The original, the TEI encoded transcript and the digital image of the archive document have one thing in common: their content. This content is captured in different ways but it is nevertheless the same for all three. Therefore any metadata about content is equally applicable to both the original and its digital forms.
The original, the transcript and the image are likely to differ in every other respect. Therefore their identification, administrative and contextual metadata will vary and will need to be recorded separately for each form.
In order to successfully apply the EAD framework to the kind of applications that will be built using the LEADERS toolkit, EAD needs to be expanded so that it can adequately capture the different metadata for the digital forms. A close examination of the EAD DTD for the most relevant 'hooks' in which to place this metadata led to a consideration of the EAD element called <altformavail> which according to the tag library should be used to hold:
"information about copies of the [original] materials being described, including the type of alternative form, significant control numbers, location, and source for ordering if applicable. The additional formats are typically microforms, photocopies, or digital reproductions."
It is the element within the EAD framework that is closest to our requirements for holding information about digital forms. However, the child elements that can be used within <altformavail> are limited and do not allow for the detailed and structured metadata required.
It is necessary to distinguish between the two types of digital form that will be held within LEADERS applications, namely transcripts and images. In relation to the transcripts, the <teiHeader> provides a useful starting point for considering what information needs to be recorded to make the transcript identifiable and intelligible. In relation to digital images, the NISO MIX Schema provides a comprehensive overview of the technical metadata that needs to be associated with images and it becomes clear that some of these elements from these two schemes should be added to the EAD framework.
The need to expand EAD lead to the development of a LEADERS EAD Schema which is based on the EAD2002 DTD but includes fragments from TEI and NISO MIX. The schema development was driven largely by the practical experience and needs of the project.
Within the EAD framework, all the levels of description within the hierarchy lead to the lowest level of description (item level) where metadata about the actual physical archive document is held. Therefore it is at item level that the intersection between the archive document and its digital representations occurs and it is here that the <altformavail> element can be used with references to TEI and NISO MIX as applicable.
This example is an item level description from the George Orwell Papers for one particular document entitled 'Notes on the Spanish Militias', and shows how metadata about the original and the digital forms is brought together in a finding aid.
This item level description within the finding aid for the George Orwell Papers begins with a description of the original document providing details such as the title, date of creation, unitid, physical description, controlled vocabulary terms, and a description of the document's scope and content. It is important to remember that this item level description sits within a full multi-level finding aid where a rule of inheritance is applied to the data so some information relevant to this original document, such as repository details on where the document is held, and creator details is actually sitting at the highest level of description in the document tree because it applies to all the material held within the collection. This explains why the item level description of the original looks a little less detailed than might be expected.
Having given details of the original document, we then move on to the provision of a description of any digital representations of the original. The first of which is the TEI transcript which is described in terms of its title, date of creation, id, creator, funder, rights information and encoding description. Following on from the description of the transcript are the descriptions of the digital images relating to this item. As this item is made up of three pages there are three images which are described separately, each with a title, date, creation information, id and technical metadata.
The raw XML illustrates the mixture of EAD, TEI and NISO MIX tags that have been used to create the metadata.
The item level description begins using EAD tags to describe the original archive document. Any one who has used EAD will be familiar with the core elements available in the <did> element such as <unittitle>, <unitdate>, and <abstract> etc. The controlled vocabulary terms are enclosed in the <controlaccess> element and a description of the document's content within the <scopecontent> element. All of which is standard EAD, but then we have used the <altformavail> element with a type attribute on it with a value of "transcription" and it is here that we are using a mixture of EAD and TEI elements to describe the TEI encoded transcription. So firstly we have some EAD tags, for giving the title of the transcript, its id, its date of creation and creator. The we have introduced the TEI tag for funder, the TEI publication statement for assigning copyright, and the TEI's <encodingdesc> which is vital for recording sampling and editorial decisions imposed on the transcription. Then we close that <altformavail> element and open another one this time with the type attribute value set to 'online image' and it is here that the image metadata is placed using a mixture of EAD and NISO MIX elements.
Because there is usually a one to many relationship between a document and images of pages that make up the document this <altformavail> element contains a series of <altformavail> child elements, one for each separate image and within each of these the image is described starting with EAD elements for title, id, date of creation and then using NISO MIX elements for more technical metadata such as MIMETYPE, DEVICESOURCE and SOURCETYPE.
The use of EAD as an all encompassing metadata framework means that the metadata about the digital documents will not be stored in the same file as the transcripts and images themselves. In relation to the TEI transcript, the <teiHeader> is effectively made redundant because all the information that is usually stored there will be in the finding aid. In an integrated system where the finding aid, the transcripts and the images are managed together this is acceptable. However we are also keen to create documents that can be exported and used in other systems and applications for other purposes as stand-alone documents. By stand-alone we mean documents that contain all the necessary components that make them complete and intelligible, and metadata is one such component. In relation to the TEI transcript it means that the transcript and its metadata must be able to be brought together in one file to enable successful export out of the LEADERS application. Fortunately, this can be achieved through the use of XML stylesheets.
Another area which we had to consider in the development of our solution relates to the use of metadata for controlled vocabulary subject-based search and retrieval within any LEADERS application, i.e. the retrieval of relevant documents from the application that provide subject-coverage for people, places, topics or dates as specified in the user's search request.
In any given LEADERS application, controlled versions of names, topics and dates are encoded within the EAD finding aid in the <controlaccess> element. However, we are also encouraging through the guidelines that will accompany our TEI for Archives DTD, encoding of names and dates within the TEI transcript itself. So there is a potential overlap here between the EAD finding aid and the body of the TEI text, as data placed in EAD's <controlaccess> and data placed in the TEI's names and dates tags could both potentially be used as controlled vocabulary index terms from which it would be possible to retrieve documents relating to particular people, places and dates.
In the end we came to the conclusion that this overlap is actually a help and not a hindrance in building up controlled vocabulary indexes that can be used for search and retrieval within the application, and we have developed a stylesheet which can read the information stored in <name> elements within TEI and read the terms placed in <controlaccess> within EAD, check and discard any that are repeated across the TEI text and the EAD finding aid and then formulate an index term list combining the two with other descriptive data extracted from the EAD finding aid for the formulation of a hitlist. The search engine then interrogates this index document in response to a user query. A full description of the LEADERS indexing techniques can be found here.
Having determined the conceptual and practical relationships between EAD finding aids, TEI transcriptions and EAC contextual description, we still faced the need to provide links between the documents which could be parameterised and passed between application components if we were to make the system usable. Although all three schemes define linking mechanisms, (EAD and EAC are the same), they also all support the X-Link mechanism which is used in LEADERS. X-Link has been used to encode the links between the EAD finding aids and the EAC creator descriptions, the TEI transcripts and the images.
A consistent file naming convention has been used based on the reference codes used in the finding aid. These codes are themselves part of an international numbering scheme used by the archive community to ensure the unique identification of each unit of description. Thus in LEADERS we have derived the file names for the TEI documents and images from the unitid of the relevant item used in the finding aid. Using X-link enables our links to be parsed by our stylesheets and scripts, which can extract the file reference and use it as a parameter to be passed from one component of the application to another, thus ensuring that the creator history, images and finding aid fragments in the detailed displays match the transcript being displayed.
A different mechanism has been used to link the EAC descriptions to text fragments in the TEI transcripts. We have used the
key
attribute to associate an EAC document with a name element within the transcript thus:
.. I joined the <name key="GB01038-a">POUM</name> militia at the end of <date>1936</date>....
The name of the EAC document is the same as the id assigned to the record in the description which again is part of an international scheme for unique identifiers. As with the other links this mechanism allows the stylesheet to construct links to the correct EAC document, and also enables the indexing stylesheet to tunnel through the TEI transcript to the EAC document in search of authorised forms of name to index.