LEADERS Documentation > Architecture and Technology

Design considerations

The LEADERS project has developed a generic Extensible Markup Language (XML)-based toolkit for use on multiple projects and with a wide variety of archival source materials. This toolkit will enable the development of applications where transcripts and images of archive documents are delivered to the end-user alongside descriptive information.

The overarching generic requirement has led us to concentrate on reusability at all levels in our design.

There are three principal components of the system, each of which can be re-used independently of the others:

Data files
LEADERS Tools
LEADERS Demonstrator Application

In order to meet the generic requirement of the project from a technical viewpoint, the system is built entirely using open source, standard and platform independent technologies. Thus the principal files are XML and related formats, and the images are JPEG format. LEADERS specific programs are written in Java and XSP with some small elements using JavaScript. Third party programs where used are all Java based and available under GNU licenses.

The encoding of the finding aids, the source documents and the EAC files, along with the basic system design and application design has been done by the LEADERS Project team. We contracted BookMARC, a software development organisation based at the University of Coimbra, Portugal to implement the designs for both the toolkit and the application. BookMARC were selected because of their experience in using XML and Web Services in a bibliographic context, and because of their commitment to open source developments. During the course of development BookMARC worked very closely with the project team and played a major part in refining and enhancing the architecture in order to realise it.

Data files

The data files consist of XML files and images in JPEG format. The JPEGs were derived from high resolution TIFF images which were created by photographing the original documents. The image files are sized for viewing on 1024x768 resolution screens.

Standard XML encoding frameworks

There are three encoding frameworks that form the building blocks for the LEADERS toolkit. These are:

The Text Encoding Initiative (TEI) for transcribing and encoding archive documents.
The Encoded Archival Description (EAD) for encoding archive finding aids.
The Encoded Archival Context (EAC) for encoding administrative and biographical information about the organisations and people connected with the archival material.

These encoding schemes have been adapted in order to use them in coordination with each other. The issues and changes are described in the related encoding documentation. In the LEADERS Demonstrator there are:

2 finding aids (for the Orwell materials and for the UCL documents)
14 TEI documents
54 EAC documents
114 image files

The data files may reside on the same server and file system as the toolkit and/or the application, but in theory may also be in an entirely separate location provided it is http addressable by the toolkit server, but this has not been tested.

LEADERS Toolkit

The LEADERS toolkit comprises the DTDs and Schemas and encoding guidelines, indexing utilities, a search engine and web services.

The DTDs, Schemas and encoding guidelines are used to create the Data files. Any source materials encoded using these tools will be able to be exploited by the index, search and services of the toolkit and thus act as a platform for a client application.

The toolkit is designed to operate within a Cocoon framework, and run from an applet server, in our case Tomcat. See System environment below. The toolkit files may be hosted on the same or a different server from the data files and the application.

Indexing Techniques

The indexing utilities consist of an XSL stylesheet and a Java routine, which together harvest index data from the <controlaccess> tags in the EAD files, as well as from the <name> elements in the TEI documents and the <pershead> and <corphead> tags in the EAC documents.

The index stylesheet may be modified so that more, and/or different tags are indexed, and different descriptive information and link data is harvested.

The index stylesheet creates an index document for each finding aid, which contains all the index terms, together with basic descriptive information and links from the EAD file to the other data files.

The index document is used by the search engine - Lucene, to generate browseable indexes for searching and retrieval.

When creating the index with Lucene, a Lucene Document is added for each document element in the index XML file. The Lucene Document will contain the header element as a XML string and the entry elements as searchable fields. Having the header element stored in the index as a XML string, will simplify the display of the search results, because applying a XSL stylesheet to the information that comes from the index is enough.

The searchable fields are stored in the index as lower case strings, this will allow for case insensitive searches, but a problem arose when displaying the search fields that were in the index. The problem was that some of the values are acronyms and they must be displayed in upper case. To solve this, besides adding the lower cased searchable fields to the index, the original value of the searchable field is also added and used on the display.

This indexing approach has two benefits. Firstly it improves speed of retrieval and display by avoiding the need to search through large numbers of documents, in particular EAD documents which can become extremely large and complex. Secondly it saves the overhead of using a database purely to speed up searching. Since there is no data management functionality required by LEADERS, and the data files, once created will be non-volatile, there is no need to use a database.

Web Services

The other major element in the LEADERS Toolkit is the Web Services. These services, written in Java, provide the search, retrieve, and display functions which are consumed by the client application and are described in the LEADERS WSDL file. The main benefits achieved by using Web Services are:

System independence - the client application can exist on a different file system, and a different operating system, and be written in a different language to the toolkit and the data.
Application flexibility - the same data and services may be consumed by different applications, concurrently, or at different times. For example the encoded data may be consumed by an application which provides highly structured and pre-defined access, or by an application which allows a high degree of flexibility in search functions, and perhaps by a further application which is simply using the data as illustrative material for educational purposes, e.g to teach palaeography.
Extensibility and adaptation - the use of Web Services makes it simpler to add and/or amend the functions offered by the toolkit to exploit the data.

LEADERS Demonstrator Application

The present application has been constructed to meet three main purposes.

Primarily it is designed to show the potential of having transcripts of archival documents presented on-line alongside digitised images and contextual material, and as such to serve as a vehicle for obtaining user feedback.
It serves as an example of an application which can be built based on the LEADERS Toolkit, consuming the Web Services
It is built entirely from industry standard, open source components and is capable of modification and/or extension in its own right.

The application itself consists of a series of XML and XHTML documents; XSL stylesheets; CSS stylesheets and eXtensible Server Pages (XSP) scripts. The files are parameterised and, like the toolkit are designed to operate together in a Cocoon framework, and run from a Tomcat applet server, although not necessarily on the same physical server. Xerces and Xalan are used by the XSL stylesheets to parse and transform the xml files and the data files served up by the Web Services of the toolkit.

The colour scheme, fonts etc can be changed by modifying the CSS stylesheets, and the content, and layout of the search and results screens and the on-line help can be changed by modifying the XSL stylesheets. Alternatively a completely new application may be created, based around the services as described in the LEADERS WSDL file.

System environment

Tomcat Application Server

Tomcat is an open source servlet container that implements the Java Servlet and Java Server Pages specifications. The Tomcat version that is going to be used in this project, version 4.x, implements the Servlet 2.3 and JSP 1.2 specifications.

Tomcat is a widely used servlet container that can be used as a standalone server or it can be integrated with the Apache Web Server in Unix environments and with IIS in Windows environments.

Tomcat is maintained and supported by the Apache Software Foundation.

Cocoon Framework

Cocoon is a Java server framework that allows dynamic publishing of XML content using XSLT transformations. The use of XML to describe content and XSLT to transform that content into multiple formats, turn Cocoon into a platform for building applications with strong separation between content, logic and presentation.

Cocoon is supported by the Apache Software Foundation.

Tutorials: Introduction to Cocoon 2; Working with XSP in Apache Cocoon 2

Cocoon allows the configuration of pipe lines of transformations to support the functionality of the demonstrator application. Within Cocoon, html is rendered on the fly. When Cocoon sees a request for a file with a .html extension, it fetches the corresponding .xml files and applies the transformation. It caches as needed to improve performance.

Here is an example of how Cocoon maps transformations:

<map:match pattern="teidocs/*.html">
<map:generate src="teidocs/{1}.xml"/> 
<map:transform src="stylesheets/transcript.xsl"/> 
<map:serialize type="html"/> 
</map:match>

How it works:

When a request is made to http://someURL/teidocs/orwell.html
Cocoon gets the orwell name from the request
Gets the corresponding XML file as specified in map:generate
Applies the defined stylesheet in the map:transform
And outputs the result as HTML as defined with map:serialize
The HTML version is cached and Cocoon will provide it the next time if neither the XML or the Stylesheet have changed