The LEADERS project has developed a generic Extensible Markup Language (XML)-based toolkit for use on multiple projects and with a wide variety of archival source materials. This toolkit will enable the development of applications where transcripts and images of archive documents are delivered to the end-user alongside descriptive information.
The overarching generic requirement has led us to concentrate on reusability at all levels in our design.
There are three principal components of the system, each of which can be re-used independently of the others:
In order to meet the generic requirement of the project from a technical viewpoint, the system is built entirely using open source, standard and platform independent technologies. Thus the principal files are XML and related formats, and the images are JPEG format. LEADERS specific programs are written in Java and XSP with some small elements using JavaScript. Third party programs where used are all Java based and available under GNU licenses.
The encoding of the finding aids, the source documents and the EAC files, along with the basic system design and application design has been done by the LEADERS Project team. We contracted BookMARC, a software development organisation based at the University of Coimbra, Portugal to implement the designs for both the toolkit and the application. BookMARC were selected because of their experience in using XML and Web Services in a bibliographic context, and because of their commitment to open source developments. During the course of development BookMARC worked very closely with the project team and played a major part in refining and enhancing the architecture in order to realise it.
The data files consist of XML files and images in JPEG format. The JPEGs were derived from high resolution TIFF images which were created by photographing the original documents. The image files are sized for viewing on 1024x768 resolution screens.
There are three encoding frameworks that form the building blocks for the LEADERS toolkit. These are:
These encoding schemes have been adapted in order to use them in coordination with each other. The issues and changes are described in the related encoding documentation. In the LEADERS Demonstrator there are:
The data files may reside on the same server and file system as the toolkit and/or the application, but in theory may also be in an entirely separate location provided it is http addressable by the toolkit server, but this has not been tested.
The LEADERS toolkit comprises the DTDs and Schemas and encoding guidelines, indexing utilities, a search engine and web services.
The DTDs, Schemas and encoding guidelines are used to create the Data files. Any source materials encoded using these tools will be able to be exploited by the index, search and services of the toolkit and thus act as a platform for a client application.
The toolkit is designed to operate within a Cocoon framework, and run from an applet server, in our case Tomcat. See System environment below. The toolkit files may be hosted on the same or a different server from the data files and the application.
The indexing utilities consist of an XSL stylesheet and a Java routine, which together harvest index data from the <controlaccess> tags in the EAD files, as well as from the <name> elements in the TEI documents and the <pershead> and <corphead> tags in the EAC documents.
The index stylesheet may be modified so that more, and/or different tags are indexed, and different descriptive information and link data is harvested.
The index stylesheet creates an index document for each finding aid, which contains all the index terms, together with basic descriptive information and links from the EAD file to the other data files.
The index document is used by the search engine - Lucene, to generate browseable indexes for searching and retrieval.
When creating the index with Lucene, a Lucene Document is added for each document
element in the index XML
file. The Lucene Document will contain the header
element as a XML string and the entry
elements as searchable fields. Having the header
element stored in the index as a XML string, will simplify
the display of the search results, because applying a XSL stylesheet to the information that comes from the index is
enough.
The searchable fields are stored in the index as lower case strings, this will allow for case insensitive searches, but a problem arose when displaying the search fields that were in the index. The problem was that some of the values are acronyms and they must be displayed in upper case. To solve this, besides adding the lower cased searchable fields to the index, the original value of the searchable field is also added and used on the display.
This indexing approach has two benefits. Firstly it improves speed of retrieval and display by avoiding the need to search through large numbers of documents, in particular EAD documents which can become extremely large and complex. Secondly it saves the overhead of using a database purely to speed up searching. Since there is no data management functionality required by LEADERS, and the data files, once created will be non-volatile, there is no need to use a database.
The other major element in the LEADERS Toolkit is the Web Services. These services, written in Java, provide the search, retrieve, and display functions which are consumed by the client application and are described in the LEADERS WSDL file. The main benefits achieved by using Web Services are:
The present application has been constructed to meet three main purposes.
The application itself consists of a series of XML and XHTML documents; XSL stylesheets; CSS stylesheets and eXtensible Server Pages (XSP) scripts. The files are parameterised and, like the toolkit are designed to operate together in a Cocoon framework, and run from a Tomcat applet server, although not necessarily on the same physical server. Xerces and Xalan are used by the XSL stylesheets to parse and transform the xml files and the data files served up by the Web Services of the toolkit.
The colour scheme, fonts etc can be changed by modifying the CSS stylesheets, and the content, and layout of the search and results screens and the on-line help can be changed by modifying the XSL stylesheets. Alternatively a completely new application may be created, based around the services as described in the LEADERS WSDL file.
Tomcat is an open source servlet container that implements the Java Servlet and Java Server Pages specifications. The Tomcat version that is going to be used in this project, version 4.x, implements the Servlet 2.3 and JSP 1.2 specifications.
Tomcat is a widely used servlet container that can be used as a standalone server or it can be integrated with the Apache Web Server in Unix environments and with IIS in Windows environments.
Tomcat is maintained and supported by the Apache Software Foundation.
Cocoon is a Java server framework that allows dynamic publishing of XML content using XSLT transformations. The use of XML to describe content and XSLT to transform that content into multiple formats, turn Cocoon into a platform for building applications with strong separation between content, logic and presentation.
Cocoon is supported by the Apache Software Foundation.
Cocoon allows the configuration of pipe lines
of transformations to support the functionality of the demonstrator application. Within Cocoon, html is rendered on the fly.
When Cocoon sees a request for a file with a .html
extension, it fetches the corresponding .xml
files
and applies the transformation. It caches as needed to improve performance.
Here is an example of how Cocoon maps transformations:
<map:match pattern="teidocs/*.html"> <map:generate src="teidocs/{1}.xml"/> <map:transform src="stylesheets/transcript.xsl"/> <map:serialize type="html"/> </map:match>
How it works:
map:generate
map:transform
map:serialize