DRAFT 04 - December 23, 2003
Observations on the relationship between container
elements of an EAD container list
Abstract
A number of EAD container list examples show uses of
the <container> element that seem to be at
odds with one another, sometimes within the same list. This note provides some
observations on the issues involved and raises a question about the lack of a
definition of implied relationships between container elements.
Background
Most of the textual documents describing a container
list use context to convey the various container elements like boxes and
folders for the items in the collection. As a human reader, there is generally
enough information within a page or two to correctly determine the complete container
description. Sometimes it is complete, like box2:folder3. Sometimes there is a
folder column, and entries that are the same are omitted after the first so one
looks back up the page to find the last folder noted. Sometimes boxes are shown
in a similar column, and sometimes the information is shown only when the box
changes so again one looks back to find the last box noted.
In any case, the need is to be able to unambiguously
find the full set of information to identify the complete container for an
item. Sometimes this is just a folder or a box or a folder within a box. A
notation like ':folder', 'box:' and 'box:folder' completely describes the
container in this case. In this note, only box and folder are described, but
the problem in general may have considerably more complexity (box:folder:page,
carton:reel:frame, oversize folder 3, etc.)
When trying to convert a container list from a textual
document to an EAD XML document, similar issues arise. The best discussion I've
found is in Chapter 3.5.2.4, "Physical Location and Container Information
<container>"
and Chapter 7.2.5 "The
PARENT Attribute on the Container <container> and Physical Location
<physloc> Elements" of the "EAD Application Guidelines for Version 1.0" document at http://lcweb.loc.gov/ead/ag/agcreate.html.
This document directly addresses several of the main issues when the container
element for a box needs to be referenced by the folder element. In particular,
it addresses the issue of a box containing parts of different logical elements
or components (the <c> and <cnn> elements), suggests the use of the
'parent' attribute. Much of this note is a rephrasing of that information and a
discussion of a few problem areas.
I note that these observations come from developing a
computer program to convert a textual container list into an EAD xml document.
I am what is referred to as a "Domain Idiot" in object oriented
computer terms, with little background in library science.
The container structure and the logical structure
The EAD makes clear that the hierarchy of logical
elements should be captured using components (the <c...> elements) rather
than the physical structure of folders within boxes. So there are two hierarchies
intertwined here, the logical and the physical. Although the organizer of a
collection tends to keep them closely related, they might be completely
independent if the physical structure cannot be changed, for example.
Here is a simple example of a logical structure that
highlights some of the issues.
Series I
item
a in box1:folder1
Series II
item
b in box1:folder2
item
c in box2:folder1
Series III
item
d in box3
item
e in box3
Here Series II is split between box 1 and 2, and box 1
contains items from both Series I and II. Series III and box 3 are matched
although there are no folders present (for example, items in an oversized box).
Examined as a physical hierarchy, this looks quite
different:
box1
folder1
item
a
folder2
item
b
box2
folder1
item
c
box3
item
d
item
e
Representing containers within the logical structure
One unambiguous way of representing the containers is
to use something like
<container
type="box-folder">box 1:folder
2</container>
or the types box and folder
for items in just a box or folder for each logical item of the collection. This
admits scattering the items of the logical collection throughout containers in
an arbitrary manner, and thus is general enough to handle any collection. The use of containers only at the leafs
of the logical hierarchy is sufficient.
Here is a complete 'dsc' element of the example above
using this approach:
<dsc type="combined">
<c01 level="series">
<did><unittitle>Series I:</unittitle>
</did>
<c02
level="file">
<did>
<unittitle>item a</unittitle>
<container type="box-folder">box 1:folder 1</container>
</did>
</c02>
</c01>
<c01 level="series">
<did><unittitle>Series
II:</unittitle>
</did>
<c02
level="file">
<did>
<unittitle>item b</unittitle>
<container type="box-folder">box 1:folder 2</container>
</did>
</c02>
<c02
level="file">
<did>
<unittitle>item c</unittitle>
<container type="box-folder">box 2:folder 1</container>
</did>
</c02>
</c01>
<c01 level="series">
<did><unittitle>Series III:</unittitle>
</did>
<c02
level="file">
<did>
<unittitle>item d</unittitle>
<container type="box">box 3</container>
</did>
</c02>
<c02
level="file">
<did>
<unittitle>item e</unittitle>
<container type="box">box 3</container>
</did>
</c02>
</c01>
</dsc>
Reservations and an alternative
However, this does require type attribute values for
all combinations of logical containment, with appropriate sets of identifiers.
So one might need a 'carton-box-folder-page' coding, for example. The only
composite other than box-folder I've encountered is reel-frame from the
document "The
Encoded Archival Description, Retrospective Conversion Guidelines. A Supplement
to the EAD Tag Library and EAD Guidelines" at http://sunsite.berkeley.edu/amher/upguide.html.
The 'parent' attribute gives a way of creating the necessary physical
hierarchy from a smaller set of attributes. So a container with
type="box-folder" might also be coded with attributes
type="folder"
parent="box1" where, somewhere, there is a unique container with attributes
type="box"
id="box1".
Here is the example where a container for a box occurs
with the first concrete item that needs it.
<dsc type="combined">
<c01 level="series">
<did><unittitle>Series I:</unittitle>
</did>
<c02 level="file">
<did>
<unittitle>item a</unittitle>
<container type="box" id="box1">box 1</container>
<container type="folder" parent="box1">folder 1</container>
</did>
</c02>
</c01>
<c01 level="series">
<did><unittitle>Series II:</unittitle>
</did>
<c02
level="file">
<did>
<unittitle>item b</unittitle>
<container type="folder" parent="box1">folder 2</container>
</did>
</c02>
<c02
level="file">
<did>
<unittitle>item c</unittitle>
<container type="box" id="box2">box 2</container>
<container type="folder" parent="box2">folder 1</container>
</did>
</c02>
</c01>
<c01 level="series">
<did><unittitle>Series III:</unittitle>
</did>
<c02
level="file">
<did>
<unittitle>item d</unittitle>
<container type="box" id="box3">box 3</container>
</did>
</c02>
<c02
level="file">
<did>
<unittitle>item e</unittitle>
<container type="box">box 3</container> <!-- Note lack of an id attribute -->
</did>
</c02>
</c01>
</dsc>
Notice here that every container for a folder has a
reference to the box containing it, even if that box container is part of the
same logical component. This creates the physical hierarchy. Thus the
information does not depend on context or any particular ordering of elements.
One nagging detail
The "EAD Application Guidelines" document does not show
a parent attribute in a folder contained in the same component as the box
element. So in the above example item a might be shown with two containers
coded like this:
<container type="box" id="box1">box
1</container>
<container
type="folder">folder 1</container> <!--
Note the lack of a parent attribute -->
This style of coding, without the id attributes, is
common in collections I have seen. For example, it occurs in the templates of
the "Retrospective
Conversion Guidelines"
at Berkeley:
<c04>
<did>
<container
type="box"></container>
<container
type="folder"></container>
<unittitle>[Title], <unitdate>[Date or
date range]</unitdate></unittitle>
</did>
</c04>
However, if we change the types involved to ones where
the hierarchal relationship between the container types is not so clear, the
need for an explicit parent becomes apparent:
<container type="folio"
id="folio1">folio
1</container>
<container type="volume" id="vol1">vol 1</container> <!-- volume in folio
or folio in volume? -->
In other words, a program trying to understand the
physical relationships between containers needs either an explicit parent
attribute or domain specific knowledge of container types. This argues for not
making an exception of coding the parent attribute when the parent container is
in the same component.
Note also that when trying to represent the containers
for items d and e, we run into a dilemma since they share a container, box 3.
They are not in containers with box 3 as a parent, so the parent attribute
cannot be used. We must generate two instances of the container element for box
3. The first one can carry the id and be referenced, if necessary, by any
folders in the box.
Unclear areas
There are also other contextual techniques that are
tempting to use, like that of placing container information higher in the
logical structure so that it is inherited by the lower level components and
hence they need no container themselves. This suits the situation with Series
III and box 3. A container may be factored out, moving it up the component
hierarchy until it covers all components within that container. This is similar
to what is done in textual documents.
<c01 level="series">
<did><unittitle>Series III:</unittitle>
<container
type="box" id="box3">box 3</container>
</did>
<c02
level="file">
<did>
<unittitle>item d</unittitle> <!-- note there is no longer any container
element -->
</did>
</c02>
<c02
level="file">
<did>
<unittitle>item e</unittitle> <!-- note there is no longer any container
element -->
</did>
</c02>
</c01>
This interprets the scope of a container as the hierarchy below it. More on this later.
Notice that, in general, the parent relationship is
still required for folders within boxes. This is apparent in Series II,
comprising boxes 1 and 2. Hence a folder with no parent reference would be
ambiguous without a parent reference.
As attractive as this contextual approach seems, this
style is at odds with the interpretation that a containers scope is the remainder of the
document, as
is shown in section 3.5.2.4 of the "EAD Application Guidelines", where the items with
titles "Correspondence" and "Scripts and screenplays" inherit the container
for Box 47 (not
box 46). A
computer program parsing an xml document would find that very peculiar since
Box 47 is defined in a part of the logical hierarchy 4 levels up and then 3
levels down a different branch. If the reference to box 47 were done via a
parent/id relationship, there would be little doubt since the id values within
an xml document are checked for uniqueness and the parent references must point
to a valid id.
Moreover, whichever interpretation of the scope of a
container is used, a document using the implied context higher in the hierarchy
or in an earlier part of the document will not be robust to changes in the
document. In the example, if another box is added to Series III, there will be
two box containers, and containers would have to be introduced back into all
the lower level components. Similarly, if the container element for box 3 were
inadvertently deleted, there will be no xml diagnostic to warn that items d and
e no longer have containers or perhaps have inherited a container from higher
in the collection. Both the lack of definition of implied relationships and the
lack of robustness argues against using contextual relationships between
containers.
Multiple containers
Another technique that is frequently used is listing
ranges of boxes or folders within one container:
<c02
level="file">
<did>
<unittitle>item h</unittitle>
<container type="folder">folder 1-3, 5</container>
</did> . . .
The intent here seems clear that item h is housed in 4
folders. There are other cases where multiple containers are used instead:
<c02
level="file">
<did>
<unittitle>item h</unittitle>
<container type="folder">folder 1</container>
<container type="folder">folder 2</container>
<container type="folder">folder 3</container>
<container type="folder">folder 5</container>
</did>
. . .
In this case, the relationship between the containers
is intended to be one of a union. This clashes with the more common use of
multiple containers of differing types (a box and a folder, for example) where
the intent is to express a parent child relationship.
This same structure is also seen at times at upper
levels of the hierarchy. Series II from the example might have shown a complete
list of containers used by all the items below it:
<c01 level="series">
<did>
<unittitle>Series II:</unittitle>
<container
type="box-folder">box 1:folder 2, box2:folder 1</container>
</did> . . .
Although a person reading these xml documents may have
a clear idea of the meaning or semantics, EAD tools which use the xml documents
as input need a clear definition of how to interpret multiple containers at the
same level, and how to process extents contained within them. This semantic
interpretation is missing from the standard.
Statement of the need for a clarification of the standard
The issue might be stated as follows. Given a number
of container elements within an EAD component hierarchy, what is the implied
relationship between them?
If two containers exist within one component, is one
contained within the other? Is the component contained in both containers?
If a component with a container has an ancestor with a
container, does the lower level container override the higher level container?
Is there an implied parent relationship?
If a component has no container, does it inherit the
container(s) of an ancestor? Does it have an implied container from an
unrelated part of the structure?
Summary
Intertwining a physical hierarchy of containers within
a logical hierarchy of components can be done unambiguously with the parent and
id attributes of the container element. It can also be done for box and folder
hierarchies using the box-folder type. However, using context implicit in
either the logical structure or from earlier parts of the document raises
questions about the definition of the relationship between container elements.
Paul
Jensen pdjensen@agileimage.com
Agile
Image Movers http://agileimage.com/