Building the Metadata
Trails of Hope, as a digital publication, consists of 59 diary volumes, 82 photographs and works of art, 43 contemporary maps, and seven trail guides. In all there are nearly 10,000 images, including 7000 images of trail diaries and letters, 2350 images of trail guides, 82 images of photographs and art works, and 43 map images. This publication adds a remarkable in-depth approach to the searching power of Extensible Markup Language (XML) coupled with the stability of a controlled vocabulary to produce over 10,000 unique headings. These headings include personal names, geographic names, ten broad topics, Library of Congress subject headings, and trail names.
The Lee Library Metadata Unit set as their overarching goal to apply both a controlled vocabulary and authority control to individual objects in order to provide the widest possible discovery.
To accomplish this goal full USMARC (subsequently referred to as MARC) parent records were created for the 49 separate diaries and letters, the 43 contemporary maps, and seven trail guides. These records are searchable in the online catalog at Brigham Young University, RLIN, OCLC, and the American Memory site. Each MARC record contains a URL that links directly to the digital representation of the object. Metadata records for each of the 82 photographs and works of art, and the 45 diarist's biographies, in addition to the maps, trail guides, and diaries, were created in CONTENTdm and are searchable at Brigham Young University and through the American Memory site, Library of Congress.
To greatly enhance the discovery process, the Metadata Unit set a goal to utilize and add to the rigorous work of the XML Mark-Up Team who created rich text-level tagging. This produced searching capability on every page of each diary and letter. The Metadata Unit used Library of Congress Subject Headings (LCSH) authority control in the creation of page level metadata.
Plan of Work
- A parent record was created for each of the digital titles for the diaries and letters, maps, and the trail guides. As each of these digital publications had a MARC record, previously created to identify the physical object, that record was used as the basis for creating the MARC record representing the electronic version of the original. First the records were upgraded, when necessary, by employing the Anglo-American Cataloging Rules 2nd ed. (AACRII), LCSH, and name authority.
- An updated 520 field (summary/description) was created in the MARC record for diaries and letters, as more information about the diarists was discovered when the biographies for each diarist were written. The upgrades were made to both the original and electronic version of the MARC records. Also the metadata records created for photographs and art work, and the biographies included this 520 field.
- To capture in the MARC records the description of the electronic publication, a number of fields were added to each record. To make this process easier, a form was developed that indicated what needed to be added to the new MARC records by each format type. Diaries and Letters -- Maps -- Trail Guides
- Once CONTENTdm was selected as the software to deliver Trails of Hope, it was then possible to create metadata fields and map these fields to Dublin Core and to MARC for each of the five different types of digital publications - diaries and letters, photographs and artwork, maps, biographies, and trail guides. Crosswalk links were developed between the metadata fields, the MARC tags, and Dublin Core elements. Metadata records were then created for all five types of digital publications. The crosswalks also include information regarding which fields are searchable, displayable, and under authority control. The administrative metadata, which documents the technical details of the creation of the digital image, is not stored on CONTENTdm, but on a separate server.
- The URL to the digital publication for the diaries and letters, the trail guides, and the maps, was added to the MARC records for both the digital and the physical object, allowing online user access from either record. The MARC records were contributed to RLIN and OCLC, which also allows access directly from those online union catalogs.
- The major work occurred within the diaries and letters. The architecture of CONTENTdm requires that every image have metadata attached. Each page of the diary or letter is considered an image. The Metadata Unit and the Overland Trails Team determined that it was now possible to utilize the XML previously created by the XML Mark-up Team for every place name, personal name, and topics. Early in this XML mark-up process, the team included authority control as an integral part of their work when tagging terms. The results allow incredibly deep searching within the diaries and letters. "Transcription and Mark-up of the Diaries"
- Each page of the diary transcription, with its XML mark-up, was programatically loaded into the metadata fields by extracting selective tags and importing to designated metadata fields. Diary -- page level Although considerable record cleanup was required, in general it was quicker than manually cutting and pasting.
- The decision was made, without totally understanding its ramifications, to run every single XML tagged place name, personal name, and topical heading that appeared on every page of the diary or letter, through name authority and to submit any new geographic or personal names to the Name Authority Control (NACO) and Subject Authority Control (SACO) programs of the Library of Congress. In this first edition, 12 of the 59 diary volumes have the page-level authority work completed for both geographic and personal names. Forty-eight of the diary volumes have the page-level geographic authority work completed. The personal name authority work is far from complete.
- Each geographic name was searched through the following sources until verified or a new heading created: Lee Library authority file; Library of Congress authority files via RLIN; Geographic Names Information System (GNIS); Getty Thesaurus of Geographic Names; The Columbia Gazeteer of the World; The Columbia Lippincott Gazeteer of the World; Webster's New Geographical Dictionary; the National Imaging and Mapping Agency (NIMA); Internet; various place name books; and county histories.
- An effort was made to establish family member names, and names of persons mentioned prominently in the diaries. The personal names chosen for name authority, a list more than twice as long as the geographic names, were searched in the following sources: Lee Library authority file; Library of Congress authority file via RLIN; the specific diary or letter; and the online family history sources of the Church of Jesus Christ of Latter-day Saints.
- Geographic and personal name work was submitted to the Library of Congress on the appropriate NACO and SACO forms.
- In many cases, the geographic or personal name proved impossible to establish. In these cases, the term was still used as a keyword and co-exists with the fully established terms. Its form was created following LCSH rules as closely as possible. These headings were not submitted to the Library of Congress.
- The authority work changes were given to the XML Mark-up Team for tagging corrections.
- The corrected XML encoded full-text of each of the transcription diary pages was next loaded into the metadata record to permit searching using the XML tags for geographic names, personal names, and topics. This data load revealed problems with the authority work due to hidden tags and errors created in part because the authority work in each diary volume is not yet complete. Some of the errors may also be the result of switching from the WordPerfect 9.0 software, with its XML editor, to that of Altova GmbH's XML Spy (3.5 and 4.3). An analysis of these errors has yet to be completed. There are also errors in the full-text transcriptions of each page, usually in the form of some words dropped from the text. None of these errors have been manually fixed, nor has a complete analysis been done on how these errors were introduced into the programmatic load from XML to the CONTENTdm metadata fields.
- Several "types" of subject categories were created. The work is not yet complete for each of those categories, but will be upgraded with a second version of the database. The category types are topics, personal names, geographic place names, trail names, subject, and date the diary entry was made. The topics category was specifically created to categorize broad topics for subject content within the pages of the diaries. The subject category also includes the names of newspapers, books, ships, organizations, groups, and events in addition to the diary entry date and all the Library of Congress Subject headings. Date entries were normalized to month, day, and year (i.e. June 6, 1853) and appear in the subject field. This permits a single search to cover all of the digital publications at both the page and parent level.
- Because it is not the practice of the Lee Library to provide full MARC cataloging for individual photographs and art works, a metadata record was developed which mapped to Dublin Core and was created and loaded on the CONTENTdm server only. Five separate institutions contributed these images, which created some challenges in uniformly gathering and entering information in the appropriate metadata fields. This included a field entitled "Patron Usage Restrictions" to identify intellectual property rights, attribution for citation, and an address to write for use permission when that use would fall outside of educational fair use. A field for "Provenance" is also included. Sample Image
- Biographies were researched and written for 45 of the 49 diarists. Three of the diarists were not named, leaving only Alphonse B. Day without any biographical information. Metadata records were created only in CONTENTdm, which included the full-text of the biography, trail names, personal names, and subject headings, but no geographic names. Including metadata for the biographies, however, will permit a simple keyword search to pick up the biographies as well as diaries, maps, photographs and art work, and the trail guides.
- The newly created MARC records for the diaries, maps, and trail guides, served as the basis for the creation of the metadata record which resides, along with the digital objects in CONTENTdm. Creation of these records was a manual process of "cut and paste" between the MARC and the metadata records, with additional fields added in the metadata record. Each volume of a diary or trail guide was treated separately, thus the 49 MARC records for the diaries turned into 59 metadata records and the seven trail guides are represented by nine metadata records in CONTENTdm. The newly created MARC records for the diaries also were used to build comprehensive Text Encoded Initiative (TEI) headers in the XML files. Because trail guides are not searchable text, but image only, the page-level metadata is minimal. Trail guides -- page level
- All software programs have specifications, field lengths, and sort features that may not work the way that the authors of digital publication might want. For example, in CONTENTdm all title lists are sorted alphabetically. This created some interesting arrangements for maps and the photographs and art works. To have the title lists display in the order that we wished, special title fields were created which display maps by the year of publication and photographs and art work in a geographic order, roughly from east to west. Titles were shortened in the case of trail guides and diary titles were altered to sort alphabetically by the author's name.
- Working in two physically separate teams for metadata creation and XML mark-up made the communication between these teams difficult. It seems desirable to either merge the two teams into one, or have the XML mark-up and the metadata creation be worked on simultaneously. This would help determine at what level tagging should occur on personal and geographic names.
- The extreme amount of page-level authority work, which was undertaken to provide a controlled vocabulary in conjunction with the XML tagging, caused numerous stops and starts on this digital publication. It is obvious that at the inception of each digital project, the depth of XML mark-up, normalization, and authority work must be determined prior to the commencement of the project. This process is akin to the level of description for a manuscript collection with its resulting register. The depth of description might be to the item level, folder level, or box level, depending on the need and expected usage of that collection.
- Selection of a software for managing and delivering a digital publication should be made early in any project as the architecture and the capabilities of such a software package help determine what can be delivered for a given publication. The Lee Library was within months of finishing this three-year project when CONTENTdm was purchased in the Fall of 2001.
- The opportunity to work closely with a database/software company in the development of programming to deliver the kind of digital publication, and the appropriate access to that publication, is critical. An example of one of the issues that will be discussed in the near future with CONTENTdm is the need to programmatically export XML into the metadata fields housed on the CONTENTdm server.
Susan L. Fales,
Curator of Digital Historical Collections,
Lee Library, Brigham Young University
Head, Metadata Unit,
Lee Library, Brigham Young University