DOCAMÕ03, Berkeley 13-15 August 2003
joacim.hansson@hb.se
Helena Francke
helena.francke@hb.se
mats.dahlstrom@hb.se
Mikael Gunnarsson
mikael.gunnarsson@hb.se
Swedish School of Library and Information Science
University College of BorŒs and Gšteborg University
Development
of research within the humanities and the social sciences depends upon the
ability to formulate new problems and to identify new research needs in
society.[1]
This is particularly explicit in a young discipline like Library and
Information Science (LIS). Much due to rapid technological development and its
social implications, researchers have turned their interest towards Document
Studies (DS), paying attention to the phenomenology of discrete documents at
the micro-level as a complement to a more traditional collection-based view of
documents. Micro-level analyses are today becoming an increasingly important
area of LIS, and it contributes by raising new questions to the discipline.[2]
The past decade has shown an ongoing discussion in LIS over the scientific
legitimacy of the discipline.[3]
Arguments have been put forward for ÒinformationÓ as a key concept around which
the development of problem statements should evolve.[4]
This has in some ways been theoretically fruitful, but the argument has failed
to prove valid when it comes to letting ÒinformationÓ be the very founding
concept of the discipline. Researchers who oppose this view have formulated,
with some emphasis, a set of arguments that is often referred to as the
Òinstitutional paradigmÓ.[5]
This viewpoint considers that the basis and legitimacy of the discipline should
rest upon the social structure of institutions that contain documents and / or
distribute information in some way. The theoretical claims of these two
approaches can be defined as, in the former case, conceptually
intra-scientific, and, in the latter case, sociologically extra-scientific.[6]
In focusing on conceptual and social features of LIS both perspectives
contribute, in part, to the definition of the discipline. However, the
theoretical foundation of the discipline has lacked the perspective that comes
from the study of the core entities of LIS Ð documents. A theory of LIS based
on the materiality of documents is sought for, but has yet to be formulated.
Document
studies are of course not new to LIS. In many respects DS has been at the
centre of the intellectual development of the discipline since at least the
1950s, and the formulation of the set of DS related problems that LIS has
formed around was made already in the late 19th century by the
practical considerations of Melvil Dewey in the USA and somewhat later, in
Europe, Paul Otlet.[7] The
definitions of documents within LIS generally come close to what is sometimes
labelled a Òsocio-technicalÓ perspective in the sense presented by David Levy.[8]
The point of departure is the treatment of individual documents within a
collection or an environment that is understood to be culturally and socially
defined. Hj¿rland expresses this by analysing documents and the studies of
documents in relation to what he sees as a general division of labour in
society between so called memory institutions: libraries, archives, and
museums. He also includes journals and the systems of primary, secondary and
tertiary literatures as well as source literature as parts of traditional
document collection environments.[9]
This traditional view of documents and document environments has been at the
centre of the scientific discussion of the organisational aspect of documents
within LIS. Within the subfield Knowledge Organisation, DS have often been a
hidden gem, concentrated on the larger understanding of the role of individual
documents within collections. Michael Buckland has raised the question of the
nature of documents in this sense.[10]
In doing this he attempts something which has otherwise rarely been done within
LIS. The nature of a document has often been taken for granted and fundamental
questions concerning its nature and phenomenology have simply not been
considered necessary to raise. But in order to specify a theoretical foundation
of the discipline based on the materiality of documents, in our view the most
fundamental of entities investigated in LIS, we need to take into account more
complicated document definitions and also study the documents themselves in new
ways.
The aim
of this paper is not to present such a theoretical foundation, but, by showing
three examples of new and ongoing LIS research, to stimulate a discussion on
the role of DS as a third founding ÒperspectiveÓ within the discipline.[11]
As a point of departure in all three studies, the socio-technical perspective
on documents is accepted. This enables a wide range of combinations of
empirically and theoretically stated arguments and questions. In this paper we
present 1) a study on document genres with special focus on scholarly editions.
This creates a link between traditional material bibliography and studies of
book history and hypermedia editions. 2) an investigation of the concept
Òdocument architectureÓ as an analytical framework, with special empirical
focus on the scholarly journal. Attention is given to the media specific
characteristics and epistemological norms underlying the structure of
documents; and 3) a study of markup performance using a Transformation Based
Learning approach within the context of natural language processing.
Documents
are the real core units of bibliographical practice and theory. Potential
knowledge or information comes in the shape of documents of some sort.[12]
We cannot take for granted that library system units deliver the same
information to everyone. We cannot even assume that they pass along information
at all. What we can settle for is that the objects are documents. Documents and
document surrogates are the interface with which patrons interact in order to
extract information. So document management is in a sense what library-like
institutions and bibliography are really about.
Acting as
containers of information is however not the sole purpose of documents. They
might also be the extant expressions of aesthetic acts, and so documents have
both informational and aesthetic potentials. In fact, if we think of documents
as surrogates for all kinds of human activities, extending their reach beyond
the individual, we will see that the varying categories of document potentials
are equivalent to whatever list of categories we ascribe to human activity
itself. If we restrict our thinking of human activities to communication, we
will regard documents as conveyors of human utterances and communication. To
that kind of view, any document serves as a component in a communication
process. (One might even feel tempted to taxonomise documents along the lines
of various act or even speech-act categories, such as sanctioning, authorizing,
positioning, or controlling.[13] Such a categorisation however runs the risk
of reductionism, not least when dealing with document genres whose ÒfunctionÓ
is unclear or multiform, e.g. aesthetic documents.) We can further restrict
this communication process to verbal utterances, and see documents as verbally
encoded artefacts, coming down to what are normally (but not exclusively)
understood as the documents managed in libraries, and in particular in various
bibliographical situations. If we are interested in categorising documents
according to functionality, this class of documents can be labelled bibliographic
documents, bearing in mind the fact that there are other functionality
subclasses of documents as well.[14]
Bibliographic documents are thus a special class of objects serving as
documents in particular bibliographic situations and for bibliographic purposes
in libraries and library-like institutions.
But
documents are not only interesting for what contained material Ð informational
or aesthetic Ð they might or might not provide. As intersections where many
different social and medial ecologies meet, they prove to be complex systems to
analyse in themselves, exhibiting both mental, social and material dimensions.[15]
Documents are sociotechnical: they are shaped by (and are shaping)
sociocultural communities, but are also always manifest products of media and
technology. How we define documents depends not insignificantly on what
material and technical media are at work in the sociocultural groups we study.
When a certain technology enjoys hegemony, the nature of its particular
documents becomes a role model for documents in general and their definition.
This is why document definitions are generally modelled upon printed documents.
As digital media and distribution technologies enter the scene of media
ecology, the nature and culture of documents change, as does our notion of
them.
So as we
are currently experiencing not one but several parallel introductions of new
media and technologies, exhibiting radically different logistics and parameters
for document production and distribution than print-based technologies do, it
should come as no surprise that the concept and the epistemological assumptions
of documents are reconsidered. Most bibliographic theory work has hitherto
maintained a notion of documents as material, discrete and fixed phenomena, being
the more or less trivial containers of abstract works.[16]
Perhaps it is safe to say that as LIS began shifting its focus from ÒdocumentÓ
to ÒinformationÓ in the middle of the last century, there was a declining
interest in considering the material, technical and structural aspects of
documents per se as something posing scholarly problems.[17]
New media and distributive networks however challenge such illusory
transparency of texts and documents. They make us attentive to the material and
technical parameters of documents, and make us aware of how these Ð as in fact
all bibliographical concepts Ð are media typical if not media specific.[18]
For instance, in what way are printed and digital documents both documents?
What characteristics do they share that distinguish them from non-documents and
other classes of objects?
In the
background, material bibliographers, historians of the book and other media
(and recently scholars of electronic publishing and markup), are indeed
interested in the material, technical and structural matters of documents.[19]
To them, the document level in communication is, and has been all along, quite
crucial. Scholars such as e.g. Matthew Kirschenbaum have suggested that the
perspectives of textual scholarship and material bibliography might prove
fruitful in such a reconsideration and re-evaluation of media concepts.[20]
There is of course no denying that the conceptual apparatus of e.g. analytical
bibliography is largely formed by print culture, and needs careful and critical
reexamination if it is to be applied to digital objects in a meaningful way.[21]
This will best be accomplished if e.g. bibliographers and book historians begin
devoting some of their time to understand digital media and various aspects of
electronic publishing. Conversely, observations and theory attempts within
digital media studies still largely suffer from oversimplification, e.g. in the
way bifurcation and binary rhetorics tend to overemphasize revolution,
punctuation of equilibrium, and difference at the cost of continuum, evolution,
and similarity between media families.[22]
Digital media studies, still in its infancy, might in other words benefit from
the kind of historical perspective that studies of older media have developed.
So if we
want to reconsider the nature and quality of documents in the light of new
media and their various production, publication and distribution technologies,
how do we go about this? One way is to dig deep into digital documents
themselves, much like the way material bibliographers have done with written
and printed documents for a century, while at the same time performing
comparative DS Ð across time, space and media. This means we have to try to
understand both the technology and the sociology of the document.
A
manageable way to design such DS is to concentrate them around a particular
type of documents, i.e. a document genre. If documents accord with social
functions, as suggested by David Levy, it might be possible to extend the
categorisation to genres of documents, given the reservation mentioned above.[23]
With appropriate quality of analysis and level of scrutiny, any document genre
study Ð the scientific article, the grocery list, the receipt, the greeting
card, the home page Ð will teach us much about the history, nature, social
functions and technologies of documents and media, also at a general level.[24]
Acknowledging
both the linguistic (i.e. conventionalised media forms with inscribed text) and
the sociological (i.e. BazermanÕs Òforms of lifeÓ or what Orlikowski &
Yates labelled Òa distinctive type of communicative action, characterized by a
socially recognized communicative purpose and common aspects of formÓ) concept
of genre, we understand document genres as sociotechnical typifications of
primarily textual media.[25]
This means that genres are products characterised by particular medial and
textual qualities, and at the same time activities and tools formed by time,
social function and ideology (not least the ideologies and strategies of the
particular communities using the genre as intermediary tool). Conversely, the
bibliographical tools (standards, practices, software etc.) at use in various
library and information activities are never genre neutral, but on the contrary
steeped in particular document genre assumptions: their social functions,
structure and visual patterns. Comparative studies of how document genres and
their structures and forms change over time, space and medium might offer
valuable insights into the interplay between social context and the conceptual
tools of LIS.
The
medial and textual characteristics that genre members share are manifested in
particular conventions, i.e. genre typical visual shapes (or silhouettes, if
you will) and doceme composition.[26]
These silhouettes signal that a given document belongs to a particular genre,
and makes the reader more or less genre alert. But along with these silhouettes
at presentational interface level, digital document genre members also share
characteristics at layer composition, machine, and code level. Both these
interface and subsurface levels are parts of what we call document architecture
(DA, more about which in the following sections). Document genres are in other
words characterised by particular DA:s, which can thus be genre specific.
Simply put, the way a particular document genre looks like at a particular
moment in history is not haphazard but a result of particular historical and
sociotechnical parameters. A student familiar with the architecture of a
digital document on many levels stands better chances at recognizing the
potentiality as well as the limitations of the document at hand. DA analyses,
fertilized by insights into the historical and sociotechnical factors
influencing the genre as activity and tool, we believe is a fruitful way of
conducting DS that in the end might contribute to document theory development
within LIS in general and bibliography in particular.
The idea
that each genre is defined by a particular combination of both social function
and media architexture suggests interesting fields of research for LIS. What is
the balance between function and form? What historical relations exist between
old and new genres?[27]
How much do we consider genres to be bound to their social function and,
respectively, technological surroundings? When are genre parameters altered to
the degree that we can justifiably talk about new genres? These types of
questions relate to the way documents are subjected to categorizations,
classifications, and typifications by bibliography and its tools, themselves
being genres. In fact: what happens to the bibliographical genres themselves,
as there are changes in their social or technological surroundings?
A
bibliographic document genre where many academic areas and technological trends
converge is the scholarly edition (SE). This is a large and heterogeneous
genre, its subgenres ranging from facsimile, diplomatic, synoptic, genetic,
critical (also known as historical-eclectic), and variorum editions to
large-scale digital archives on compact discs or mounted on the web.[28]
The Ph.D. studies conducted by Dahlstršm deal with the SE as a bibliographical
document genre, looking at its form and function, ideology and epistemology,
architecture and media, variation and characteristics. The studies focus on how
the document genre as a mediating tool of a community (or activity system, if
you want to take that stance) reflects or causes changes in its social activity
systems and surrounding technological media ecology. In particular they look at
the structure, textual organisation and DA of web distributed SE:s, with a
special emphasis on the transition from print to digital media. How, if at all,
do the severe changes of both evolutionary and revolutionary nature in the
overall media ecology affect the labour division in the SE genre, on the one
hand between different tools and media, and on the other between the various
social groupings that by tradition have been actors in the entire activity
system producing and using SE:s? Are we e.g. witnessing a change in labour division
between various document outputs in order to cope with new media parameters?
The rest of this section will provide an example or two of the kind of
questions and observations that have sprung to mind along the SE document genre
study.
There are
particular circumstances making the SE a complex and interesting case for
document genre analysis, media history and textual theory: to editorial theory
and its methodological foundation, textual criticism, media and documents
themselves have never been regarded as transparent and therefore trivial, but
have on the contrary always been of primary interest and discussed at depth.[29]
So the whole discourse around SE:s is imbued with discussions on concepts,
matter and media. Further, the very genre class members, particularly the
critical editions, are themselves bibliographical tools working at metatextual
(or rather: meta-work) levels. Critical editing means examining a bulk of
documents and their texts, clustering these around the abstract notion of a
work, arranging them in a mental web of relations and trying to represent this
web in the particular document subgenre called the critical (or eclectic)
edition, purporting to represent the work. The way the edition positions the
documents to the work, and itself as mediator between them, is affected by
several crucial factors, e.g. ideology, epistemology, aim and function, context
/ community, tradition and supporting and distributing media, each of which a
fine candidate for analysis.
Among
these many factors, take epistemology. Throughout the history of editorial
practice and textual criticism, a prominent idea has been that the scholarly
editor mediates the work without affecting it, that she is a more or less
neutral filter ÒcorrectingÓ the hitherto corrupt edited work and passing it
along to future generations. This is of course true in a trivial sense. But it
also has brought about a tendency to suppress the bias of the editor. The
editorÕs interpretations, the tools she uses, the form of the media she works with,
and the theoretical foundations of her editorial principles, are all mediating
filters in the process. These are not always discussed as such by the editors
themselves. Reading many scholarly editorial prefaces and statements, one is
struck by their scientistic discourse and lack of acknowledging just how the
editor shapes and contributes to the edited work, through deliberate choices in
contents and versions, in typographical, visual and compositional form and many
other factors.[30] The user of
an edition is dependent on the more or less deliberate choices made by the
persons responsible for creating the edition and the particulars of the tools
they were working with (e.g. version and variant discrimination, typography,
resolution, colour or lighting parameters).[31]
Therefore the editions are interpretations of the edited material. You will
never be able to predict precisely what variables individual readers or editors
are likely to be or not be interested in, now or in the future. What editors
can settle for is to create an edition that hopefully is a meaningful and
powerful instrument to as many readers as possible. But it is always a biased
product, as Nowviskie noted.
The very
form, media and type of edition the editor chooses as instrument frames how we
are able to see the edited work at all, and becomes a statement on the
appropriate way of conceiving the edited material, not least due to the
authoritative status of a SE. Many subgenres of the SE genre are tied to
particular bibliographical conceptual levels: the facsimile edition to the
traditional bibliographical concept of (visual and material) document. The
transcription edition to linguistic text. The eclectic edition to the
bibliographical understanding of work as the intentional entity inherent in or
constructable from manifest documents. And so on.
Nevertheless,
scholarly editing is conceived of as recreating or enhancing existing documents
when it is also always a matter of creating new documents. The positivist
stance was, and to some extent still is, quite powerful in scholarly editing.[32]
An outcome of this is the idea that editorial practice and textual criticism
are recreating original material, be it an abstract intentional authorial text
or a particular document text such as the reception text or a manuscript text.
Moreover, this is performed with the conviction that it is both possible and
ideal to confine the editorial task to mere discovery and proliferation of the
original, to being somewhat of a transparent medium in which the work can safely
be transported to its readers.[33]
The
notion that all the ingredients in the edited bulk of documents can be
identified, expressed, encoded and reproduced in new documents, is very much
alive in many current digital archive projects that hope not only to collect,
digitise and distribute everything in and about a particular work or author,
but also to enable archives from which a reader can generate practically any
type of edition she desires and thus partly or wholly fulfill the editorial
task herself.[34] The notion
is really based on the idea that documents in their entirety can be reproduced,
which is a kind of mimetic fallacy.[35]
Besides, if possible at all, such omniarchives would promise to annihilate the
need for particular future edition projects, since any reader could generate
any edition she needed from the archive, whereas developments in the last
decade rather suggest that each editing project itself requires new or
different software or the radical tailorisation of existing software. In other words
it might be the case that each such archive can only support one edition type
based on one editorially theoretical stance and in that sense be as monoform as
ever the printed edition. Furthermore, libraries and their bibliographic tools
are, as Brown and Duguid noted, valuable for what they exclude as much as
include.[36]
A digital archive that has ÒeverythingÓ while at the same time hopes to be able
to skip the editor intermediary phase and pass those tasks along to the
readers, threatens to drown the reader with cognitive overload, abandoning its
discriminatory task.
There are
indeed many more observations, related to media and function, to be made around
the SE genre. Just to name one: the SE is an ÒinformationalÓ genre normally
trying to capture ÒpoeticÓ genre texts. Taking into account the tension noted
by e.g. Jerome McGann between these two genre groups (where e.g. polysemy (when
a word has or is open to several meanings), ambiguity, recursivity, inter- and
intratextuality, visuality and materiality are used and valued differently),[37]
this makes SE an interesting battlefield of medial conflicts, as does the
frequent observation that the traditional SE strives to represent a 3D spatial
work into the medially sequential 2D surface of a document.
When subjected
to close analysis, the SE (and the discourse that surrounds it) reveals itself
as both a hermeneutically biased tool and a complex document genre. The SE as
bibliographical tool is certainly not alone in this respect. In fact any
bibliographical tool exhibits hermeneutical and bias dimensions, yet is
frequently conceived of as a value neutral surrogate and condensation.[38]
Scholarly editions, hypermedia archives, digital libraries and all of their
bibliographical tools are powerful discriminatory instruments and important
scholarly communication components. We need to see them not as neutral
prolongers of the life of the works and documents but as filtering media
affecting them and our way of perceiving them.
Architecture
as a metaphor was applied to text early on in rhetorical studies as a help for
students in their struggle with constructing texts on the levels of grammar and
disposition and as an image of the entire process of the rhetorical activity: inventio
Ð dispositio Ð ornatus.[39]
In more recent times, the metaphor has come to be used within computer-based
text creation, processing, and management in juxtaposition with several other
concepts, among them document and information, in
which case it usually refers to models of document types or information
networks or the study and application of these.[40]
Document architecture is a concept often used in connection to platform- and
software-independent document transfer in digital environments, such as in the
standards SGML (ISO 8879:1986) and ODA (Open Document Architecture; ISO
8613:1989) and increasingly in connection with information processing.
Taking
the queue from the description of DA in ISO 8879:1986 which reads ÒRules for
the formulation of text processing applicationsÓ (clause 4.97), Dahlstršm and
Gunnarsson have suggested in an earlier article a broadening of the concept,
viewing it as a potential analytical tool that may disclose to us Òa lot of the
practices and underlying theory of the production of the documentÓ.[41]
The work proposed here is mainly concerned with the document as an organising,
as well as organisable, entity, and as such builds on Dahlstršm and
GunnarssonÕs broader document architecture concept. The resulting model will
draw on aspects of both information architecture and text structures, focusing
on such variables as logical and physical text structures, composition, layout,
medium, navigation, and human-computer interaction.
Several
models of document structures or architectures exist. Important contributions
have been made and continue to be made within the text encoding community when
it comes to capturing essential parts of text and document structures.[42]
Similar interests are found in different projects focusing on automated
analysis and understanding of digitised documents. Tools are developed that
make use of the signs of text structures that the layout of the document
provides in order to draw conclusions about the type and function of text
elements.[43]
Many
projects and researchers show an interest in the nature of documents; however,
most existing models focusing on architecture are created with the intention to
facilitate and make more efficient the production, editing, digitisation of, or
value adding to documents. Here, a different approach is suggested: a model for
the comparative analytical study of DAs in different media Ð focusing on
historical change, differences among disciplines or genres, or perhaps cultural
differences. While drawing on existing models, some additional variables may
serve to inform the analysis. Such a model needs to take into account
media-specific characteristics and epistemological norms underlying the
composition[44] while at the same time be firmly rooted in a theory
of the document. The results of such comparative Ð synchronic or diachronic Ð
studies may provide insights that can inform the construction of digital
libraries as well as individual documents in order to improve for example
knowledge organisation, information processing, and remediation.
If
architecture is viewed as handling space, and the knowledge of or ability to
(re)create that space with the use of physical materials so that it becomes
inhabitable, then the analogy with text as a handling of information[45]
that is given a more or less functioning form by the way it is expressed in a
material medium is not far-fetched. In both cases we are reminded of the mutual
dependency of the ÒcontentÓ and the matter (although the latter may only be
experienced as matter momentarily, as when displayed on a computer screen[46]),
where neither content nor matter can be regarded as preceding the other. To
reintroduce the connection between architectural practice and rhetoric made
earlier, all three practices deal with material, structure, and style (all
dependent on cultural, social, and technological factors). While the creator of
a document needs to compose a text (with its own material, structure and style
depending on e.g. medium and genre), she also at the same time has to entrust
it to a document which, to borrow David LevyÕs words, she can imbue Òwith the
ability to speakÓ for her.[47]
This includes a choice of storage medium, signs and technology as well as
presentation medium, signs and technology,[48]
and a choice of how the text or texts should be displayed in the document both
with regard to making the function of the textual elements clear, and with
regard to aesthetic concerns. Without the ability to talk, the document is no
document,[49] and without
the material, the text will not be communicated over time and geographical
space. In a similar way, the walls, roof and floor that encapsulate space makes
it into a room, but the space also makes the surrounding walls into a building.
While
drawing on the architecture metaphor, exactly which variables will be essential
in a model of DA is still very much under consideration. As can be inferred by
the discussion above, two important and quite evident parts are the material
aspects of the document, both of the storage medium and signs and of the
presentation medium and signs, and their interaction with the text structures
as they are expressions of markup (in the form of tags, typographical marking
etc.) and as they can be inferred by rhetoric and composition analysis.[50]
These aspects of the DA will be highly influenced by agent(s) and purpose(s),
as is the case with the architecture of buildings. However, documents often
provide metadata that speak of their production, use, and purpose, and special
interest will be paid in the model to parts of the structures that can be used
in different forms of knowledge organisation and document retrieval, such as
bibliographical information.
The
architecture metaphor brings to the fore the fact that buildings and documents
share a dependency on a number of cultural, social, and technological factors
that ultimately have to do with time. Documents as vessels of communication are
still (in our fleeting world) construed to ensure that the message will survive
a certain life span, as are buildings. But they also suffer the wear of time;
they are historical artefacts capable of informing us of the past (in terms of
production, use and reuse, renovation, etc.).
In our
time, hypertextual, networked and/or potentially ergodic[51]
document environments as well as new forms of communication, constitute an
additional challenge to a model of DA, not least because they often give
topical interest to the questions of what actually constitutes a document and
what constitutes the limits of a document.[52]
These challenges may be approached by introducing variables concerned with
space, time, and (domain-based[53])
usage, all important concepts in architecture and architectural theory.[54]
The
project presented in this section focuses on scholarly communication as it is
enacted in scholarly journals. As an ever increasingly important forum for
presenting research ideas and results and upholding the scholarly discussion,[55]
the journals are a constant concern for libraries and information centres. The
changeover of scholarly journals from paper to digital form not only serves to
facilitate access, it is also an important condition for the work carried out
at the moment by SPARC[56]
and others[57] aiming to
move the control of the journals from the publishing corporations to the
researchers and universities. These transitions, in combination with the local
research archives initiated by many academic libraries, call for further
research within LIS into the nature of scholarly journals as documents and
texts, focusing on the relationship between new and old conditions and
practices, both social and technological. An investigation into the DAs of
scholarly journals, both traditional and innovative ones in different media,
may help answer questions concerning knowledge organisation Ð especially in a
digital environment Ð such as Can the organisation of documents at a
micro-level be of use to us when dealing with the organisation of documents at
a macro-level? and help further an understanding of how the individual document
relates to the macro-level of the docuverse.
The
purpose of FranckeÕs project is to analyse a historical transition, brought
about by the emergence of new media and new communication networks, within the
area of scholarly journals, with regard to DA (if that continues to be a
relevant and useful metaphor). As has been outlined above, DA will be used as
an analytical framework for studying some qualitative aspects of the documents.
Journals within one or two disciplines will be chosen for case studies based
either on their representativity or their innovativity. The entity of research
will, in this case, be considered to be the journal issue, or possibly a series
of volumes, rather than individual articles, reports or reviews. Thus, no
restriction will be made to an individual article genre, although the genres
involved can be expected to be limited and determined by the journalÕs domain.[58]
The intention is to capture the aspects of different material in the scholarly
journal, both of a presentational, interactive, administrative, and reportative
nature, and to investigate if, and in that case how, the DA is affected by
media transformations and how closely the DA is dependent on domain membership.
Apart from the potential influence that such knowledge may have on the
structuring of document retrieval and management systems, it may bring
increased insight into the nature of documents and the possibilities and problems
involved in document analysis that will be of interest to document studies.
The
preceding section dwelled on how documents can be modelled according to the
innate characteristics of what metaphors allude to. Thus Òdocument
architectureÓ puts focus on how a document is built and how particular
procedures make operations on its smaller parts possible. Another metaphor that
often comes to use in text encoding theory and practice and is inherent to HTML
as a text encoding standard, is the arboreal metaphor which conceives of
documents as trees, organic structures that embody the typified constraints
that define the species. Beeches are not beeches unless certain structural
characteristics of surface and spatial extension can be determined. HTML
documents (or instances of documents, to be correct) are not HTML documents if
the conditions set up in the content model are not met. In other words, they
are not true instantiations of the (abstract) HTML document unless its typified
constraints are respected.
This
section will confront these two metaphoric understandings of documents with the
fact that documents (or textual artefacts) in Òreal lifeÓ often violate the
models in mind, that the contents of files with the htm or html extension nevertheless are
apprehended as HTML even though they are invalid instantiations of the model.
Architectural ideas, that may be reified as prescriptions of how the artefacts
should be processed, as well as arboreal model specifications that restrict possible
variations of HTML instances, seem to have been largely neglected in common
electronic publishing practice.[59]
These are of course consequences of the widespread use of HTML as a pure
formatting technology which has resulted in more or less unpredictable
artefacts that threaten to resist every possible attempt to intelligent
document processing. Document processing, such as automatic generation of
significant document representations or tables of contents, is useful, but
restricted in that crucial elements, such as titles, author names, headings or
abstracts, cannot easily be identified, since they are not unambiguously
labelled.
As Allen
Renear has touched upon, and as is expressed in a recent debate on the adequacy
of XML encoding in Journal of Digital Information, there is something platonic about
some document ÒideologiesÓ, especially the ones that put stress on the
separation of content and presentation.[60]
This strongly propagated distinction in SGML encoding theory is obviously
difficult to come to grips with for text encoding novices.[61]
The call for separation of content and presentation, the very idea that the
descriptive encoding of structural elements and the presentational formatting
of textual artefacts should be respected as two disparate activities, may in
fact be a too unfamiliar approach to gain any widespread success among encoders
used to Microsoft Word or other WYSIWYG word processing tools. The constraints
put upon the author by word processing tools are of another kind.
The
possible counterintuitivity of SGML encoding and the awkwardness of having one
generic document model for all genres may be the cause of this deterioration.
It can also be argued that during the last ten years, development of software
for text encoding support (i.e. HTML editing) and for visualisation of encoded
text (i.e. authoring tools and web browsers) has not contributed to any
paradigmatic change in common text encoding practice that fits what has been
termed a Òcontent-based strategyÓ.[62]
This
situation may draw attention to an presumed relation between what can be termed
the Òlogical document structureÓ (to be encoded by descriptive markup in the
SGML paradigm) and the Òphysical document structureÓ (which is commonly held to
be a presentational matter, to be separately encoded).[63]
If the logical structure is an abstract notion realized in formal grammars,
such as in a HTML DTD, and the physical structure is the embodiment of a
semiotic process Òbased on conventions and traditions that determine the
production process of printing objectsÓ[64]
or screen objects, is then HTML (or any other SGML application) much more than
a confusing tag set for encoding subjective theories of document models? If the
answer is yes, it may very well be a consequence of that undertheorization
which Dubin et al
suspect that the lack of machine-processable formalisms for document structure
semantics is expressing.[65]
To use a content-based strategy requires a thorough (intersubjective)
understanding of the particular element semantics for a document type.
Specifications of such semantics, especially for HTML, are poor.
Besides,
the situation may also be dependent on ÒundertheorizationÓ of other aspects of
the DA. Is it really convenient that the physical structure is to be encoded by
separate style sheets, the rules of which operate on what is intended to be
content markup, elements of the logical document structure. Such situations
seem to presuppose a one-to-one relation between elements of the presumed two
structures, which is a disputable assumption. Certainly, one of the common
objections to the tree model of the SGML family points to the difficulty of
encoding concurrent structures. Dino BuzzettiÕs critical view on what he terms
Òstrongly embedded markupÓ, that content structure cannot be encoded in such a
markup scheme, and NelsonÕs advocating of what can be termed stand off
annotation are two other reactions to the problematic notion of document
structure as expressed in the SGML paradigm.[66]
But all these examples question the qualities of a particular technology and
its model of a world with which it is intended to be used in conjunction. It is
a question of another kind to ask if it is possible to model actual use of an
existing markup scheme, of whatever kind, to enable document processing for
certain purposes. Some elements, whether considered logical, physical or
anything else are desirable to extract, modify or access for particular
purposes. It may be that markup may not convey anything other than itself, or,
to paraphrase Wittgenstein, that the meaning of markup is nothing but its use
in a game of (electronic) publishing. But even if we accept a notion of
document structure in which expression (i.e. markup) is primary to some
construed content, the need for content object inference is apparent in the
context of e.g. metadata generation, document space traversal and
personalization.
The
question then arises, is the performance of markup totally arbitrary or is it
possible to discern discrete patterns of markup that reflect social, linguistic
or technological conventions: the,
so to speak, rules of the game. Such a task is based on an assumption that the
production of text is a kind of typified act, just like in natural language the
theoretically infinite set of possible utterances is constrained in practice by
how a particular discourse is supposed to be enacted. If discursive rules and
conventions determine natural language expression and its contents, why should
not the same be true for the instances of document types? The task of determining
such discursive regularities may owe its classification principles to material
bibliography as portrayed in the section by Dahlstršm of this paper, as well as
to linguistics or results from other disciplines. It is probably also
important, touched upon by both Dahlstršm and Francke, to take into account
changes in the wake of new media that remediate and transform the socially
determined entities that we call documents.
As may
have been clarified in the preceding subsection, we are advocating the study of
markup performance, and believe such a stance could yield results viable for
several disciplines. LIS needs a revival of document studies, of the entities
that convey this thing information that LIS sometimes seems to have taken as
primary to social and linguistic artefacts. Computational Linguistics, which
often seems to dismiss markup as a non-linguistic phenomenon, despite its
apparent significance for discourse, may benefit from the results in much the
same way as from the results of research in phonetics. For markup theory,
regarded as a different level of linguistics, or as Òvisual phoneticsÓ (to use
a paradoxical neologism), performance studies may be as valuable as corpus
linguistics has been for linguistics. Genre theory can obtain valuable data for
the study of new emerging genres. In particular, what this project is trying to
do is to find methods capable of identifying elements of significance in HTML
markup, for certain purposes that can assist in document processing. Such
methods will ideally result in sets of XSLT templates that can extract element
content, or transform instances of HTML documents. The aim is thus to find
methods for identification of such transformation rules, in which an element or
a set of elements (E1), given a specific context (C1), can be transformed to
another element (E2) that disambiguates its content with respect to another
context (C2). This can be seen as a logical relation and be encoded in a logic
programming style as in Fig. 1.[67]
elementTransform(E1, E2):-
condition(E1, C1),
condition(E2, C2).
Fig. 1.
The task
can thus be seen as content aware document transformation, where content is intimately
related to the document as a discursive artefact. Finally, it may be important
to underline that such transformations are relevant to certain genres but
irrelevant to others, e.g. genres of fiction.
To use
(explicit or implicit) markup as empirical evidence of the process of document
production, and for detection of its constituents is not new.[68]
Several attempts to use linguistic content together with typographical
features, and even HTML markup, have been tried out before.[69]
However, few have recognized document instances as artefacts of social action
and tried to infer such kinds of data as valuable in actual processing tasks.
Regarding document structure as a kind of typification of documentary practices
may contribute with computationally critical variables that will also withhold
studies from linguistic reduction.
Performance
studies in linguistics are carried out in corpus linguistics, in which
determination of recurrent syntactic features is prominent. Markup performance
will in some respects be a more complex task. One of the main problems to
handle is that the instantiations of HTML patterns that can satisfy a
particular goal (of e.g. layout) are principally infinite. The syntax of
sentences in natural language performance may seem restricted to a higher
degree, but this is really a question yet to be answered. Another problem of
markup performance studies is the amount of code that has to be analysed in
some way. At this stage, two initial working strategies have been adopted to
handle these two problems. To restrict the domain(s) some stratification
methods will be used, which cluster entities on the basis of variables that
reflect social, linguistic and technological factors, e.g. genre, culture,
rhetorical gesture and code generator. These will be discussed in the following
subsection. To analyse code, machine learning algorithms from the natural
language processing (NLP) domain will be used and these are described in the
subsection Machine Learning. The implementation (apart from the resulting XSLT templates) will
probably be done in Oz, a Òmulti paradigmÓ programming language that will be
briefly described in the subsection Implementation Framework.
One of my
hypotheses is that for a restricted domain of HTML encoding practices the
variation of the resulting HTML patterns can be determined, and a crucial
question is then on what grounds such domains can be determined. Which
sociocultural, linguistic and technological factors can be of assistance in
such attempts and function as computational variables? This is still an open
question and will hopefully be answered in the end, but it is certain that e.g.
layout characteristics will not be regarded as solely aesthetic or ergonomic
issues. The visual rhetorics conveyed by plotting out boxes of textual content
on a media surface may reveal an enactment of how the particular genre is
understood in terms of e.g. its components (or elements).[70]
One may assume that the boxes represent elements with genre specific meaning.
It is also stated by Ògenre theoristsÓ that specific genres exhibit certain
determinable typifications with respect to several structural aspects, such as
the sectional sequence Introduction, Method, Results, Discussion and Summary
(IMRDS) in research articles.[71]
This
postulate can be reasonably argued in favour of in case of elements such as
titles and headings. It can be stated that headings are usually identified as
headings when they in some way stand out on the page in prominent position with
special typographic features and are followed by what can be determined as a
section of one or more paragraphs. Such presuppositions seem to be the ones in
use in cataloguing practices, and at least this is true for genres closely
related to print culture, such as research articles and reports. Remediated
genres pose problems which can be seen in the cataloguing rules for Òelectronic
resourcesÓ that in some cases are extremely difficult to apply. In HTML
encoding, unless represented by an inline image, headings are usually contained
in an element of their own, whether labelled more or less unambiguously by one
of the generic identifiers in the set {h1, h2, h3, h4, h6} or ambiguously by any other tag
pair (or set of tag pairs) whose identifiers indicate presentational features.
Reversely, the actual semantics of a particular markup pattern is just as
ambiguous. What is the intention of embedding textual content in table cells? A
two column space for the body of a text, carried out by the help of identifiers
from the set {table, tr, td} does not necessarily imply that the border between the two columns has
any semantics at all. On the other hand, a two column space, defined by the
same pattern, in which a navigational bar is contained in one of the columns,
the body of the text in the other, may have significance, such as in hyperlink
classification.
That
these elements are genre dependent can be argued for by examining the roles
they play where they are used and how they are used. Titles of research
articles are usually indicative of and try to summarize the contents of the
article in question, whereas titles of journals and, to take an extreme
example, titles of artwork, show other socio-technical characteristics. Titles
of artwork can in fact be said to determine the possible interpretation of its
content or simply be inseparable from the artwork itself, such as in the case
of MagritteÕs Òceci nÕest pas une pipeÓ. Footnotes in the humanities styles and
the scientific styles of documentation likewise play different roles, which may
point to the fact that any attempt to automatically identify citations must
take these variations into account.
What
follows from this is that identification of significant elements must take into
account features that adhere to the entity in which the elements are embedded.
Crucial factors that have to be encoded into the algorithms are factors that
reflect the circumstances in which the document has been enacted.
As
pointed out in the introduction to this section, documents in the SGML family
are modelled as abstract tree structures. SGML itself, or XML, defines how to
specify grammars for validation of instances of such structures. These grammars
specify in their turn something of a container structure for natural language
expression, and natural languages are often depicted as tree structures (e.g.
in phrase structure grammars) and their performance can be described (and
desperately prescribed in schools and in many other places) by the help of
grammars. The similarities with natural language are almost self evident and it
would not be surprising to find methods for natural language processing (NLP)
adaptable for document structure processing.[72]
Parsing a natural language sentence is as much a traversal of a tree structure
as is parsing of e.g. an XHTML instance.
In NLP
complete and deep parsing is extremely difficult and attempts to do this are
still rather unsuccessful. However, algorithms for shallow parsing and parts of
speech (POS) tagging have yielded good results.[73]
Several algorithms have been developed for these purposes and some have, in
fact, been tried for HTML to XML transformation tasks.[74]
There are several successful approaches to shallow parsing and POS tagging and
there is a need for investigating which are the most successful ones for adaptation
to document structure analysis Ð if at all. Some of the POS taggers are
examples of probabilistic approaches to the problem, exemplified by the Hidden
Markov Model and the Maximum Entropy Learning model, others exhibit different
non-probabilistic approaches, but all these presuppose an existing corpus with
a manually tagged counterpart, considered a Ògold standardÓ.[75]
The performance of the tagger is evaluated by comparing its result of tagging
to the gold standard. Accuracy around 97 % is typical.
To
investigate one such algorithm for document processing the Transformation Based
Learning (TBL) approach will be tried out, because it has been used before for
document processing tasks and gives the possibility of including other
variables than preceding and following entities.[76]
For POS tagging the TBL method begins by tagging every word with its most
likely tag taken from a predefined tagset representing word classes.[77]
Since the principle of tagging is based on producing rules for transforming
tags, the tagger needs a set of predefined templates to limit the amount of
transformations to be tried out. These templates are usually manually written
as functions on the form Òchange tag X to Y if the preceding or
following word is tagged ZÓ. The tagger then iteratively tries to instantiate the variables and
evaluates the result by comparing to the gold standard. The iterations stop
when no better result is given.
Thus, as
can be seen from the template example, the basic idea is of course that a
particular word that can be tagged with different tags, like ÒcanÓ, can be
determined a noun under certain contextual circumstances. ÒcanÓ preceded by the
word ÒtheÓ is most likely a noun, and not a verb or a modal. Taking this over
to the task of element recognition implies that certain elements can be
determined through examination of preceding or following elements. We can for
instance say that a heading is most likely followed by a paragraph and that a
citation is followed or preceded by an inline reference to its full
bibliographic description, provided the document in question is of a particular
kind. This last statement points to the fact that the syntagmatic axis of the
HTML markup is not enough for transformation templates. We need other data too,
such as genre, or more specific, if it is a research article, the documentation
style used. Moreover, we know that different authoring tools behave in certain
ways, and consequently embedded traces of these tools, e.g. as attribute values
to the meta element, can be of value in the learning process.
Still
more decisions have to be made in applying TBL algorithms for element
identification. One such decision concerns the significance of the existing
markup. Are the text chunks determined by the opening and closing HTML tags
adequate entities? Curran & Wong seem not to think so. In their
implementation parts of the HTML encoding are stripped off the instances. On
the one hand, overriding the element borders defined by the HTML tags may seem
to assume that markup is secondary to some abstract notion of the document
structure, that the creative process in some way is independent of the tools
for creation. On the other hand, the reverse case may imply a radical denial of
media independent writing strategies.
HTML
instances, as they stand out as sources for investigation, present some
practical computational problems. Foremost, every instance is a complex tree
structure that does not fit machine learning algorithms, so the instances have
to be flattened. Also, there are other preparatory operations needed on the
instances before more sophisticated algorithms can be applied to them, such as
tidying up of invalid instances. The next section will initially elaborate on
some of these issues.
In
choosing a programming language for the implementation of the TBL approach, a
reasonable choice is to pick a language that can offer suitable ready-made
applications and that fits the computational tasks involved. There are at least
two languages widely used for NLP, Prolog and Oz. Both languages have been used
for implementation of the TBL algorithm for POS tagging, but the latter
language is more flexible and offers the possibility for what has been termed
Òmulti paradigm programmingÓ. This means that declarative approaches
(characteristic of Prolog programming) as well as e.g. functional and object
oriented approaches are possible. Oz is further described in a recent book by
van Roy and Haridi.[78]
Generally,
the flattening of a tree structure is a task of tree traversal picking up data
at each node. Two concerns have to be dealt with, 1) which data to be preserved
and represented for later processing in the learning stage and, 2) how to best
represent this data. A basic data type that seems to be adequate for representation
is the record
that is a compound data structure constituted by a label and a set of fields, where each field is constituted by
a pair consisting of a feature and a variable identifier (which is to be instantiated as a value).[79]
An example of a record is given in Fig. 2.
The
public repository MOGUL of Oz packages offers an XML parser written by Denys
Duchier, which serves as a starting point for this implementation.[80]
The output of this parser is a record (representing the abstract root node)
with a set of fields among which the one with the feature children carries the remainder of the tree, either
as another similar record or as a list of similar records, depending on the
number of children. My implementation flattens this record and outputs a list of
simple records; a non-instantiated record is given in Fig. 2. The three dots
indicate a not yet determined set of features. The feature pos is a numerical indicating the
sequential position (according to a depth-first search strategy) of the element
in the tree. The feature xpath is another way of preserving the position of
the element in the tree, by the help of a location path following what is
specified in the XML Path Language.[81]
The value of feature text carries the content of an element, represented by
the atom nil, if it is empty.
pcdata(pos:Pos xpath:XPath text:Content É)
Fig. 2.
What has
not been said before is that the final TBL implementation will involve NLP of
textual content as well.[82]
This fact poses a problem regarding how to determine linguistics entities, a
problem parallel to the text chunking decision depicted above. It is highly
probable that an element tagged with the I or B identifier is syntactically and
semantically part of a surrounding ÒblockÓ element, so for NLP procedures this
element must be analysed as part of the larger context. It may be helpful to
use the (essentially presentational) categorization of elements in the HTML
specification, such as the distinction between ÒblockÓ and ÒinlineÓ elements.
The studies that have been described in this
paper hope to contribute to a new agenda of document studies. We have been
trying to provide examples of the types of presuppositions, research questions
and problems around documents that we find particularly interesting and
challenging. They are all triggered by observations made in the context of new
media, digital document production and electronic publishing, which provide at
times radically different document conditions from previous, print-based
technologies. The studies pull their respective theoretical and methodological
approaches from different academic areas (some of which by tradition a little
foreign to LIS), but are joined by a sociotechnically and historically based
document understanding, in many ways setting their respective research agenda.
This suggests a possible common ground in LIS where various DS research can
meet, a hub around which approaches with quite different origin can rotate.
If we are searching for the establishment of a
third ÒparadigmÓ, alongside an information paradigm and an institutional
paradigm, document studies such as these might be indicative. We need to search
for some sort of unifying formulation of their epistemological contribution as
well as their empirical results. The epistemological analysis of DS as an
autonomous perspective within LIS has not yet been made. Theoretical
development can only come when a certain body of research has been carried out.
It is our hope that the studies presented in this paper can contribute to such
a development.
Aarseth, Espen. Cybertext: Perspectives on Ergodic Literature. Baltimore: Johns Hopkins UP, 1997.
Andersen, Jack. ÒThe Materiality of Works: The Bibliographic Record as Text.Ó Cataloguing and Classification Quarterly 33, no. 3/4 (2002): 39-65.
Bachrach, Steven, et. al. ÒWho Should Own Scientific Papers?Ó Science 281, no. 5382 (1998): 1459-1460. Also available at <http://www.sciencemag.org/cgi/content/full/281/5382/1459> (17 July 2003).
Bates, Marcia. ÒThe
Invisible Substrate of Information ScienceÓ Journal of the American Society
for Information Science 50, no. 12
(1999): 1043-1050.
Bazerman, Charles. Shaping Written Knowledge : The Genre and Activity of the Experimental Article in Science. Madison WI: University of Wisconsin Press, 1988.
Bazerman, Charles. ÒSystems of Genres
and the Enactment of Social Intentions.Ó In Genre and the New Rhetoric, edited by Aviva Freedman and Peter
Medway, 79-101. London: Taylor & Francis, 1994.
Begthol, Clare. ÒThe Concept of Genre and Its Characteristics.Ó Bulletin of the American Society for Information Science 27, no. 2 (2001): 1‑5.
Brill, Eric. ÒTransformation-Based
Error-Driven Learning and Natural Language Processing : A Case Study in Part of
Speech Tagging.Ó Computational linguistics 21, no. 4 (1995).
Briet, Suzanne. ÒWhat is Documentation?Ó Translated by Ronald E. Day and Laurent Martinet. [QuÕest-ce que la documentation? Paris: ƒditions Documentaires Industrielles et Techniques (EDIT), 1951]. <http://www.lisp.wayne.edu/~ai2398/briet.htm> (8 July 2003).
Brown, John Seely and Duguid, Paul. The Social Life of Information. Boston, Mass.: Harvard Business School Press, 2000.
Buckland, Michael. Information and Information Systems. Westport, Conn.: Praeger, 1991.
Buckland, Michael ÒWhat is a document?Ó Historical studies in Information Science. Eds Trudi Bellardo Hahn & Michael Buckland, 215-220. Medford, NJ : Information Today, 1998.
Buckland, Michael. ÒWhat is a ÔDigital DocumentÕ?Ó Preprint of article published in Document NumŽrique 2(2), 1998: 221-230. <http://www.sims.berkeley.edu/~buckland/digdoc.html> (27 May 2003).
Buzzetti, Dino. ÒDigital Representation and the Text Model.Ó New Literary History 33 (2002): 61-88.
Clark, James, Steven DeRose. XML Path
Language (XPath): Version 1.0 W3C Recommendation 16 November 1999. W3C, 1999.
http://www.w3.org/TR/xpath
Comaromi, J.P The eighteen editions of the Dewey Decimal Classification. Albany, NY : Forest Press, 1976.
Cowling, David. Building the Text: Architecture as Metaphor in Late Medieval and Early Modern France. Oxford: Clarendon Press, 1998.
Curran, James R., and Raymond K. Wong.
ÒTransformation-Based Learning for Automatic Translation from Html to Xml.Ó
Paper presented at the Fourth Australasian Document Computing Symposium, Coffs Harbour, Australia 1999.
Dahlstršm, Mats and Mikael Gunnarsson. ÒDocument Architecture Draws a Circle: On Document Architecture and Its Relation to Library and Information Science Education and Research.Ó Information Research 5, no.2 (2000). <http://InformationR.net/ir/5-2/paper70.html> (1 Nov. 2002).
Dillon, Andrew. ÒInformation Architecture in JASIST: Just Where Did We Come From?Ó Journal of the American Society for Information Science and Technology 53, no. 10 (2002): 821-823.
Dillon, Andrew and Barbara Gushrowski. ÒGenres and the Web: Is the personal home page the first uniquely digital genre?Ó Journal of the American Society for Information Science 51, no. 2 (2000): 202-205.
Dubin, David, Allen Renear, C.M. Sperberg-McQueen and Claus Huitfeldt. ÒA Logic Programming Environment for Document Semantics and Inference.Ó Literary and Linguistic Computing 18, no. 1 (2003): 39-47.
Enmark, R. (1998) The non-existing point : on the subject of defining Library and Information Science and the concept of information. Paper at the64th IFLA General Conference, August 16-21, 1998, available at: http://www.ifla.org/ifla/IV/ifla64029-94e.htm.
Gants, David. ÒToward a Rationale of Electronic Textual Criticism.Ó Paper at ALLC-ACH Õ94. Paris, 1994 (online at: <http://parallel.park.uga.edu/dgants/ach94.html>
Goldfarb, Charles. The SGML Handbook. Oxford: Clarendon, 1990.
Greetham, David. ÒEditorial and Critical Theory : from Modernism to Postmodernism.Ó In Palimpsest : Editorial Theory in the Humanities, edited by George Bornstein and Ralph G. Williams, 9-28. Ann Arbor: Univ. of Michigan Press, 1993.
Greetham, David. Theories of the Text. Oxford: Oxford UP, 1999.
Greg, Walter W. Collected Papers, edited by J.C. Maxwell. Oxford: Clarendon, 1966.
Gunder, Anna. ÒForming the Text, Performing the Work: Aspects of Media, Navigation, and Linking.Ó Human IT 5, no. 2-3 (2001): 81-206.
Hammerton, James, Miles Osborne, Susan Armstrong, and Walter Daelemans. ÓIntroduction to Special Issue on Machine Learning Approaches to Shallow Parsing.Ó Journal of Machine Learning Research 2 (2002): 551-558.
Hansson, Joacim ÒThe social legitimacy of Library and Information Studies : reconsidering the institutional paradigm.Ó In Aware and responsible. Ed. Boyd Rayward, 49-69. Lanham, MD : Scarecrow Press, 2003 [in press].
Harnad, Stevan. ÒElectronic Scholarly Publication: Quo Vadis?Ó Serials Review 21, no. 1 (1995): 70-72. Also available at: <http://www.ecs.soton.ac.uk/~harnad/Papers/Harnad/harnad95.quo.vadis.html> (17 July 2003).
Hayles, N. Katherine. Writing Machines. Cambridge, Mass.: MIT Press, 2002.
Hj¿rland, Birger and Hanne Albrechtsen. ÒToward a New Horizon in Information Science: Domain-Analysis.Ó Journal of the American Society for Information Science 46, no. 6 (1995): 400-425.
Hj¿rland, Birger ÒInformation Retrieval, Text Composition, and Semantics.Ó Knowledge Organization 25, no. 1/2 (1998): 16-31.
Hj¿rland, Birger ÒLibrary and Information Science : practice, theory and philosophical basis.Ó Information Processing and management 36, no. 3 (2000): 501-531.
Hj¿rland, Birger ÒDocuments, memory institutions and Information Science.Ó Journal of Documentation 56, no. 1 (2000): 27-41.
Houser,
Lloyd. ÒDocuments: The Domain of Library and
Information Science.Ó Library and Information Science Research
8 (1986): p. 163-188.
Jurafsky, Daniel, and James H. Martin.
Speech and Language Processing : An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition. Edited by Stuart Russell, &
Peter Norvig. Upper Saddle River: Prentice Hall, 2000.
Kirschenbaum, Matthew. ÒEditing the Interface: Textual Studies and First Generation Electronic Objects.Ó TEXT 15 (2002): 16-51.
Klein, Bertin and Andreas Abecker. Distributed Knowledge-Based Parsing for Document Analysis and Understanding. (GMD Report 48). Sankt Augustin: GMD Ð Forschungszentrum Informationstechnik, 1999. <http://www.gmd.de/publications/report/0048/Text.pdf> (26 April 2003).
Larsen, Poul Steen. ÒBooks and Bytes : Preserving Documents for Posterity.Ó Journal of the American Society for Information Science 50 (1999): 1020‑1027.
Lavagnino, John. ÒThe Analytical Bibliography of Electronic Texts.Ó Paper at ALLC-ACH '96. Bergen: University of Bergen, 1996 (online at: <http://www.hit.uib.no/allc/lavagnin.pdf>
Levy, David. ÒDocuments and Libraries: A Sociotechnical Perspective.Ó In Digital Library Use: Social Practice in Design and Evaluation, edited by Ann P. Bishop, N.V. House and B. Buttenfield. Cambridge, MA: MIT Press, 2000.
Levy, David. Scrolling Forward : Making Sense of Documents in the Digital Age. New York: Arcade, 2001.
Lund,
Niels Windfeld. ÒDocumentation in a Complementary Perspective.Ó In Aware and
Responsible, edited by Boyd Rayward. Lanham, Md :
Scarecrow Press, 2003 [In press].
McGann, Jerome. Radiant Textuality. New York: Palgrave, 2001.
McKenzie, Donald F. Bibliography and the Sociology of Texts. London : The British Library, 1986.
Megyesi, Be‡ta. ÒData-Driven Syntactic
Analysis.Ó Doct., KTH, 2002.
Meyer, Heinrich. Edition und Ausgabentypologie : eine Untersuchung der editionswissenschaftlichen Literatur des 20. Jahrhunderts. Bern: Lang, 1992.
Miksa, Francis ÒLibrary and information science : two paradigms.Ó In Conceptions of library and information science : historical, empirical and theoretical perspectives. Eds. Pertti Vakkari & Blaise Cronin, 5-27. London : Taylor Graham, 1992.
Nelson, Theodor H. ÒEmbedded Markup
Considered Harmful.Ó In Xml, Principles, Tools, and Techniques, edited by Dan Connolly, 129-34. O'Reilly, 1997.
Nowviskie, Bethany. ÒInterfacing the Edition.Ó [Talk at the Conference ÒLiterary Truth and Scientific MethodÓ]. Charlottesville, VA: Univ. of Virginia, 2000. Online at: <http://jefferson.village.virginia.edu/~bpn2f/1866/interface.html
Odlyzko, Andrew M. ÒThe Future of Scientific Communication.Ó In Access to Publicly Financed Research: The Global Research Village III, Amsterdam 2000, edited by P. Wouters and P. Schroeder, 273-278. NIWI, 2000. Also available at <http://www.dtc.umn.edu/~odlyzko/doc/future.scientific.comm.pdf> (17 July 2003).
Oppenheim, Charles, Clare Greenhalgh and Fytton Rowland. ÒThe Future of Scholarly Journal Publishing.Ó Journal of Documentation 56, no. 4 (2000): 361-398.
Orlikowski, W. J. and Yates, J. ÒGenre repertoire: The structuring of communicative practices in organizationsÓ. Administrative Sciences Quarterly 33 (1994): 541-574.
Peels, A. J. H. M.; N. J. M. Janssen; and W. Nawijn. ÒDocument
Architecture and Text Formatting.Ó ACM Transactions on Information Systems 3, no. 4 (1985): 347-69.
Price-Wilkin, John. ÒUsing the World-Wide Web to Deliver Complex Electronic Documents : Implications for Libraries.Ó The Public-Access Computer Systems Review 5, no. 3 (1994): 5-21.
Rayward, Boyd The universe of information : the work of Paul Otlet for documentation and international organisation. Moscow : FID, 1975.
Renear, Allen H.. ÒOut of Praxis: Three (Meta)Theories of Textuality.Ó In Electronic Text: Investigations in Method and Theory, edited by Kathryn Sutherland, 107-126. Oxford: Clarendon, 1997.
van Roy, Peter, and Seif Haridi. Concepts,
Techniques, and Models of Computer Programming: MIT, 2003.
Salminen, Airi, Katri Kauppinen and Merja Lehtovaara. ÒTowards a Methodology for Document Analysis.Ó Journal of the American Society for Information Science 49, no.7 (1997): 644-655.
Schamber, Linda. ÒWhat is a Document? Rethinking the Concept in Uneasy Times.Ó Journal of the American Society for Information Science 47, no. 9 (1996): 669-671.
Schryer, Catherine F. ÒThe Lab Vs. The
Clinic: Sites of Competing Genres.Ó In Genre and the New Rhetoric, edited by Aviva Freedman and Peter
Medway, 105-24. London: Taylor & Francis, 1994.
Smiraglia, Richard P. The Nature of 'A Work' : Implications for the Organization of Knowledge. Lanham, Md.: Scarecrow Press, 2001.
Spang-Hansen, Henning. ÒHow to Teach About Information as Related to DocumentationÓ. Human IT 5, no. 1 (2001, orig. 1970): 125-143, also available at <http://www.hb.se/bhs/ith/1-01/hsh.htm>
SPARC. SPARC: The Scholarly Publishing and Academic Resources Coalition. SPARC EuropŽ, 2003. <http://www.arl.org/sparc/> (17 July 2003).
Stehno, Brigit and Gregor Retti. ÒModelling the Logical Structure of Books and Journals Using Augmented Transition Network Grammars.Ó Journal of Documentation 59, no. 1 (2003): 69-83.
Stehno, Birgit, Alexander Egger and Gregor Retti. ÒMETAe: Automated Encoding of Digitized Texts.Ó Literary and Linguistic Computing 18, no. 1 (2003): 77-88.
Svenonius,
Elaine. The Intellectual Foundation of of
Information Organization. Cambridge,
Mass.: MIT Press, 2000.
Swales, John M. Genre Analysis;
English in Academic and Research Settings. Cambridge: Cambridge Univ. Press, 1990.
Taghva, Kazem, Allen Condit, and Julie
Borsack. ÒAn Evaluation of an Automatic Markup System.Ó Las Vegas, Nevada.
Information Science Research Institute, 1995.
Tanselle, G. Thomas. ÒBibliography and Science.Ó Studies in Bibliography 27 (1974): 55-91.
Tanselle, G. Thomas. Literature and Artifacts. Charlottesville: The Bibliographical Society of the University of Virginia, 1998.
TEI Consortium. Text Encoding Initiative. 2001-2003. <http://www.tei-c.org> (17 July 2003).
Toms, Elaine G., Campbell, D. Grant, and Blades, Ruth. ÒDoes genre define the shape of information: The role of form and function in user interaction with digital documents?Ó In ASIS '99: Proceedings of the 62nd ASIS Meeting, Washington, DC, October 31st-November 4th, 1999, 693-704. Medford, NJ: Information Today, 1999.
Van der Weel, Adriaan. ÒReview : Stijn Streuvels, De teleurgang van den Waterhoek. Edited by Marcel De Smedt and Edward Vanhoutte.Ó TEXT 14 (2001). Online at: <http://www.textual.org/text/reviews/vanderwe.htm>
Van House, Nancy &
Sutton, Stuart ÒThe Panda syndrom: an ecology of LIS educationÓ. Journal of
education for library and information science 41, no. 1: 52-68.
W3C (World Wide Web Consortium). World Wide Web Consortium. 1994-2003.
<http://www.w3.org> (14 July 2003).
Wilson, Patrick. Two Kinds of Power : An Essay on Bibliographical Control. Berkeley, Cal.: UCLA Press, 1968.
Witte, Stephen P. ÒContext, Text, Intertext : Toward a Constructivist Semiotic of Writing.Ó Written Communication 9: 237-308.
Zevi, Bruno. Architecture as Space: How to Look at Architecture. Rev.ed. New York: Da Capo Press, 1993.
[1] This version is a preliminary version meant only as a basis for presentation and discussion at DOCAMÕ03 at Berkeley, August 2003. It is later to be expanded into a full article.
[2] Hj¿rland, Birger ÒLibrary and Information Science : practice, theory and philosophical basis.Ó Information Processing and management. Vol. 36, no. 3 (2000): 501-531.
[3] For an overview of the basic epistemological positions in this discussion, see Miksa, Francis ÒLibrary and information science : two paradigms.ÓIn Conceptions of library and information science : historical, empirical and theoretical perspectives. Eds. Pertti Vakkari & Blaise Cronin. (London : Taylor Graham, 1992), 5-27.
[4] Bates, Marcia. ÒThe Invisible Substrate of Information ScienceÓ Journal
of the American Society for Information Science 50, no. 12 (1999):
1043-1050
Van House, Nancy & Sutton, Stuart ÒThe Panda syndrom: an ecology of LIS educationÓ. Journal of education for library and information science 41, no. 1(2000): 52-68.
[5] Enmark, R. (1998) The non-existing point : on the subject of defining Library and Information Science and the concept of information. Paper at the 64th IFLA General Conference, August 16-21, 1998, available at: http://www.ifla.org/ifla/IV/ifla64029-94e.htm .
Hansson, Joacim ÒThe social legitimacy of Library and Information Studies : reconsidering the institutional paradigm.Ó Aware and responsible. Ed. Boyd Rayward. (Lanham, MD : Scarecrow Press, 2003 [in press]), 49-69. It is only with hesitation that the term ÒparadigmÓ is mentioned here. It is not, in its kuhnian sense fully applicable on a discipline such as LIS.
[6] By intra-scientific is meant that the basis of legitimacy is found within the disciplineÕs ability to grow and mature according to lines of development in traditional science. Hypotheses testing and cumulative theoretical growth is crucial and, thus, this perspective requires a well defined core of the discipline around which such growth can occur. By extra-scientific is meant that the basis of legitimacy of LIS is founded mostly in needs of research within society rather than in needs within the scholarly community. A discipline such as LIS is seen as contributing to social development by putting scientific effort into the analyses of significant sectors in society. The contribution to social development is seen as more important that cumulative theoretical growth.
[7] Comaromi, J.P.The eighteen editions of the Dewey Decimal Classification. (Albany, NY : Forest Press, 1976). Rayward, Boyd The universe of information : the work of Paul Otlet for documentation and international organisation. (Moscow: FID, 1975).
[8] Levy, David Ó Documents and libraries : a sociotechnical perspective.Ó Digital library use : social practice in design and evaluation. Eds. A.P. Bishop, N. Van House & B. Buttenfield. (Cambridge : MIT Press, 2000).
[9] Hj¿rland (2000b) Documents, memory institutions and Information Science. Journal of Documentation. Vol 56(1), 27-41. p. 36.
[10] Buckland, Michael ÒWhat is a document?Ó In Historical studies in Information Science. Eds Trudi bellardo Hahn & Michael Buckland. (Medford, NJ : Information Today, 1998), 215-220.
[11] The three studies are performed by, in order of appearence: Mats Dahlstršm, Helena Francke and Mikael Gunnarsson. They are all ongoing Ph.D. projects.
[12] Cf. Buckland, Michael. Information and information systems. (New York: Greenwood Press, 1991)
[13] See e.g. Houser, Lloyd. ÒDocuments: The Domain of Library and Information Science.Ó Library and Information Science Research 8 (1986):163-188, referring to the categorisation of documents according to their discursive function.
[14] We might e.g. conceive of archival documents, i.e. pieces of matter functioning as documents in archival situations, e.g. for preservational purposes, cf. Larsen, Poul Steen. ÒBooks and Bytes : Preserving Documents for Posterity.Ó Journal of the American Society for Information Science 50 (1999): 1020‑1027. One might equally postulate aesthetic documents, i.e. pieces of matter bringing us primarily aesthetic experiences, either as a representation of an artistically intended aesthetic work of art or else conceived of as aesthetically significant. There are in fact all kinds of ways to distinguish various subclasses of documents, depending on the purpose and theoretical position of your analysis and the sociocultural context in which it is performed.
[15] Lund, Niels Windfeld. ÒDocumentation in a Complementary Perspective.Ó In Aware and Responsible, edited by Boyd Rayward. (Lanham, Md : Scarecrow Press, 2003 [In press])
[16] As in Svenonius, Elaine. The Intellectual Foundation of of Information Organization.(Cambridge, Mass.: MIT Press, 2000), or in Smiraglia, Richard P. The Nature of 'A Work' : Implications for the Organization of Knowledge. (Lanham, Md.: Scarecrow Press, 2001).
[17] Noted early on by Patrick Wilson in Two Kinds of Power : An Essay on Bibliographical Control (Berkeley, Cal.: UCLA Press, 1968) and Henning Spang-Hansen in ÒHow to Teach About Information as Related to DocumentationÓ. Human IT 5, no. 1 (2001, orig. 1970): 125-143, also available at <http://www.hb.se/bhs/ith/1-01/hsh.htm>.
[18] Cf. Hayles, N. Katherine. Writing Machines. (Cambridge, Mass: MIT Press, 2002).
[19] The term material bibliography is a collective for textual, descriptive, historical and analytical (or critical) bibliography. It is also referred to as ÒphysicalÓ bibliography. It is to be distinguished from ÒreferenceÓ (or ÒenumerativeÓ) bibliography. Roughly, material bibliography is document-oriented whereas reference bibliography is work-oriented.
[20] Kirschenbaum, Matthew, ÒEditing the Interface : Textual Studies and First Generation Electronic Objects.Ó TEXT 15 (2002): 16-51.
[21] For discussions on the concepts of bibliography versus new media, see Gants, David. ÒToward a Rationale of Electronic Textual Criticism.Ó Paper at ALLC-ACH Õ94. Paris, 1994 (online at: <http://parallel.park.uga.edu/dgants/ach94.html>) ; Lavagnino, John. ÒThe Analytical Bibliography of Electronic Texts.Ó Paper at ALLC-ACH '96. Bergen: University of Bergen, 1996 (online at: <http://www.hit.uib.no/allc/lavagnin.pdf>) ; Kirschenbaum ÒEditing the InterfaceÓ.
[22] As media theory matures, we have increasingly seen theoretical works (e.g. the writings by Wiebe Bijker, Espen Aarseth or Lev Manovich) attempting to identify and complicate such oversimplified media and technology models.
[23] Levy, David. Scrolling Forward : Making Sense of Documents in the Digital Age. (New York: Arcade, 2001)
[24] Cf. Bazerman, Charles. Shaping Written Knowledge : The Genre and Activity of the Experimental Article in Science. (Madison WI: University of Wisconsin Press, 1988) ; Dillon, Andrew and Barbara Gushrowski. ÒGenres and the Web: Is the personal home page the first uniquely digital genre?Ó Journal of the American Society for Information Science 51, no. 2 (2000): 202-205 ; suggestions for genre studies of the receipt and the greeting card in Levy, David Scrolling Forward ; of the grocery list in Witte, Stephen P. ÒContext, Text, Intertext : Toward a Constructivist Semiotic of Writing.Ó Written Communication 9: 237-308
[25] Bazerman, Charles. ÒSystems of Genres and the Enactment of Social IntentionsÓ. In Genre and the New Rhetoric, edited by Aviva Freedman and Peter Medway (London: Taylor & Francis, 1994) , 79-101.; Orlikowski, W. J. and Yates, J. ÒGenre repertoire: The structuring of communicative practices in organizationsÓ. Administrative Sciences Quarterly 33 (1994): 543. As a consequence of acknowledging material and symbolic dimensions in the document beyond its written text, its makes sense in this case to talk about document genres rather than the arguably narrower term preferred by Bazerman and other genre theorists: written genres (a deliberate focus on written text).
[26] Toms, Elaine G., Campbell, D. Grant, and Blades, Ruth. ÒDoes genre
define the shape of information: The role of form and function in user
interaction with digital documents?Ó In ASIS '99: Proceedings of the 62nd
ASIS Meeting, Washington, DC, October 31st-November 4th,
1999 (Medford, NJ: Information Today, 1999),
693-704.
The term ÒdocemeÓ is borrowed from Lund ÒDocumentation in a Complementary
PerspectiveÓ, and designates the components of documents.
[27] Is e.g. the key ingredient in document genre colonisations (see Begthol, Clare. ÒThe Concept of Genre and Its Characteristics.Ó Bulletin of the American Society for Information Science 27, no. 2 (2001): 1‑5) the social function, the visual pattern, the architecture, or all at the same time?
[28] Scholarly edition typologies are suggested by Meyer, Heinrich. Edition und Ausgabentypologie : eine Untersuchung der editionswissenschaftlichen Literatur des 20. Jahrhunderts. (Bern: Lang, 1992)
[29] See e.g. Thomas Tanselle's recent collection of essays, Literature and Artifacts. (Charlottesville: The Bibliographical Society of the University of Virginia, 1998). Besides, W. W. GregÕs early declaration from 1914 has been a motto for many textual critics and bibliographers: ÒWhat the bibliographer is concerned with is pieces of paper or parchment covered with certain written or printed signs. With these signs he is concerned merely as arbitrary marks; their meaning is no business of hisÓ (reprinted in Greg, Walter W. Collected Papers, edited by J.C. Maxwell (Oxford: Clarendon, 1966), 247).
[30] Nowviskie says it succinctly: ÒIn the codex form, a scholarly edition contains an editorial essay, which makes an argument about a text or set of texts, and is then followed by an arranged document that constitutes a frozen version of that argument. Let me make this clear: the text of a scholarly edition is an embodied argument being made by the textÕs editorÓ (my italics). Nowviskie, Bethany. ÒInterfacing the Edition.Ó [Talk at the Conference ÒLiterary Truth and Scientific MethodÓ]. Charlottesville, VA: Univ. of Virginia, 2000. Online at: <http://jefferson.village.virginia.edu/~bpn2f/1866/interface.html>
[31] A review of a SE formulates the dilemma: ÒThe software (É) is very emphatically present, to a large extent setting the editorial agenda.Ó Van der Weel, Adriaan. ÒReview : Stijn Streuvels, De teleurgang van den Waterhoek. Edited by Marcel De Smedt and Edward Vanhoutte.Ó TEXT 14 (2001): ¤ 14, my italics. Online at: <http://www.textual.org/text/reviews/vanderwe.htm>
[32] Tanselle, G. Thomas. ÒBibliography and Science.Ó Studies in
Bibliography 27 (1974): 55-91 ; Greetham, David.
ÒEditorial and Critical Theory : from Modernism to Postmodernism.Ó In Palimpsest
: Editorial Theory in the Humanities, edited by
George Bornstein and Ralph G. Williams (Ann Arbor: Univ. of Michigan Press,
1993), 9-28.
At the same time we must admit that there has always been a parallel but much
more insignificant recognition in textual criticism of the way the editor
frames the possible horizon of the edition user. A strikingly early example of
this is Schlegel (who explicitly referred to editing as Ò†bersetzungÓ); late
examples are Tanselle 1998 (esp. p. 264),
and Greetham, David. Theories of the Text.
(Oxford: Oxford UP, 1999).
[33] The last decades have in fact given rise to alternative strategies in scholarly editing, most notably the ones associated with the so-called sociology of texts, accepting the work as an ever-changing multitude of versions, all of them manufactured for different purposes and audiences and using different modes of production. What you can hope for according to this line of editing is to simulate certain aspects of the work as it happened to appear to particular readers at specific moments in time. Cf. McKenzie, Donald F. Bibliography and the Sociology of Texts. (London : The British Library, 1986) ; McGann, Jerome. Radiant Textuality.( New York: Palgrave, 2001).
[34] An early passage by Price-Wilkin is indicative (Price-Wilkin, John. ÒUsing the World-Wide Web to Deliver Complex Electronic Documents : Implications for Libraries.Ó The Public-Access Computer Systems Review 5, no. 3 (1994): 5-21, section 4.1): ÒWith proper markup, an edition can be viewed in as many ways as the reader desires. It can be a variorum, a study edition, a critical edition, or historical evidence. The form the edition takes is defined by the userÕs needs or preferences.Ó
[35] If the editorial role was transferable to the reader, then the latter would have to have access to all the original documents that the editor would have had in making the archive. This means the original documents themselves have to be included in their entirety in the digital archive, which is impossible. You cannot possibly computerise ÒeverythingÓ about an author or even a work or even a document.
[36] Brown, John Seely & Duguid, Paul. The Social Life of Information (Boston, Mass.: Harvard Business School Press, 2000), 181.
[37] E.g. McGann Radiant Textuality.
[38] For a thought-provoking comment on the polyvocal paratextuality of the bibliographic record, see Andersen, Jack. ÒThe Materiality of Works: The Bibliographic Record as Text.Ó Cataloguing and Classification Quarterly 33, no. 3/4 (2002): 39-65
[39] Gathering or constructing material; Ordering the material according to some form of structure (which may be hierarchical); Decorating the text in accordance with the intended style and audience. David Cowling, Building the Text: Architecture as Metaphor in Late Medieval and Early Modern France (Oxford: Clarendon Press, 1998), 140ff.
[40] For approaches within LIS, see e.g. the two special issues of Journal of the American Society for Information Science and Technology (on Document Architecture 48, no. 7 (1997) and on Information Architecture 53, no. 10 (2002); Andrew Dillon presents an interesting suggestion for a working definition of IA in the latter, Andrew Dillon, ÒInformation Architecture in JASIST: Just Where Did We Come From?Ó Journal o the American Society for Information Science 53, no. 10 (2002), 821).
[41] Mats Dahlstršm and Mikael Gunnarsson, ÒDocument Architecture Draws a Circle: On Document Architecture and Its Relation to Library and Information Science Education and Research,Ó Information Research 5, no. 2 (2002). (1 Nov. 2002). <http://InformationR.net/ir/5-2/paper70.html>.
[42] Influential examples are described in e.g. Charles Goldfarb, The SGML Handbook (Oxford: Clarendon, 1990); TEI Consortium, Text Encoding Initiative, 2001-2003, (17 July 2003). <http://www.tei-c.org>; W3C (World Wide Web Consortium), World Wide Web Consortium, 1994-2003, (14 July 2003). <http://www.w3.org>. Cf. the critical discussion around markup as a representational strategy for text touched upon later in this paper in connection to Dino Buzzetti, ÒDigital Representation and the Text Model,Ó New Literary History 33 (2002); cf. also e.g. Allen H. Renear, ÒOut of Praxis: Three (Meta)Theories of Textuality,Ó In Electronic Text: Investigations in Method and Theory, ed. Kathryn Sutherland (Oxford: Clarendon, 1997), 107-126, for a discussion of the theories of textuality underlying different models; and David Dubin et al., ÒA Logic Programming Environment for Document Semantics and Inference,Ó Literary and Linguistic Computing 18, no. 1 (2003), for an approach to the problems posed to document architectures by markup semantics.
[43] For examples of applications of automated document analysis see Birgit Stehno and Gregor Retti, ÒModelling the Logical Structure of Books and Journals Using Augmented Transition Network Grammars,Ó Journal of Documentation 59, no. 1 (2003); Bertin Klein and Andreas Abecker, Distributed Knowledge-Based Parsing for Document Analysis and Understanding, GMD Report 48, (Sankt Augustin: GMD Ð Forschungszentrum Informationstechnik, 1999), (26 April 2003). <http://www.gmd.de/publications/report/0048/Text.pdf>; and Airi Salminen et al., ÒTowards a Methodology for Document Analysis,Ó Journal of the American Society for Information Science 49, no. 7 (1997).
[44] Cf. Birger Hj¿rland, ÒInformation Retrieval, Text Composition, and Semantics,Ó Knowledge Organization 25, no. 1/2 (1998): 23.
[45] In the sense of Òinformation-as-thingÓ, cf. Michael Buckland, Information and Information Systems (Westport, Conn.: Praeger, 1991), 43.
[46] Cf. also virtual reality and the relation between space and matter in the virtual architectural room.
[47] David Levy, ÒDocuments and Libraries: A Sociotechnical Perspective,Ó in Digital Library Use: Social Practice in Design and Evaluation, ed. Ann P. Bishop et al. (Cambridge, MA: MIT Press, 2000). The concepts of document and text, as well as of the architectural building, will in this section be restricted to artefacts that are created with some sort of intent.
[48] In Anna Gunder, ÒForming the Text, Performing the Work: Aspects of Media, Navigation, and Linking,Ó Human IT 5, no. 2-3 (2001): 98ff., Gunder makes a distinction between the medium and signs used for storing a text, such as magnetic imprints (signs) on a videotape (medium), and those used for presenting the text, in this case images and sound (signs) on the screen and loudspeakers of a television set (medium). While this is an example of indirect access, the codex book has direct access, i.e. storage medium and signs coincide with presentation medium and signs.
[49] This seems to be a basic assumption even in the most accepting of document understandings. Cf. for example Suzanne Briet, whose examples of documents include the by now famous antelope, that through cataloguing and display, through being an object that will speak to its viewers about itself becomes a document, a vessel that carries a text, Suzanne Briet, ÒWhat is Documentation?Ó transl. Ronald E. Day and Laurent Martinet [QuÕest-ce que la documentation? (Paris: ƒditions Documentaires Industrielles et Techniques (EDIT), 1951)], (8 July 2003). <http://www.lisp.wayne.edu/~ai2398/briet.htm>.
[50] On experiences with the interaction between text structures and material aspects, see e.g. Stehno and Retti, ÒModelling the Logical StructureÓ; and Birgit Stehno et al., ÒMETAe: Automated Encoding of Digitized Texts,Ó Literary and Linguistic Computing 18, no. 1 (2003).
[51] Espen Aarseth describes ergodic as a process in which Ònontrivial effort is required to allow the reader to traverse the textÓ, see Espen Aarseth, Cybertext: Perspectives on Ergodic Literature. (Baltimore: Johns Hopkins UP, 1997), 1.
[52] Levy, David ÒDocuments and LibrariesÓ; Michael Buckland, ÒWhat is a ÔDigital DocumentÕ?Ó Preprint of article published in Document NumŽrique 2(2), 1998, (27 May 2003). <http://www.sims.berkeley.edu/~buckland/digdoc.html>; Linda Schamber, ÒWhat is a Document? Rethinking the Concept in Uneasy Times,Ó Journal of the American Society for Infornation Science 47, no. 9 (1996).
[53] Understood as knowledge-domains consisting of thought or discourse communities, cf. e.g. Birger Hj¿rland and Hanne Albrechtsen, ÒToward a New Horizon in Information Science: Domain-Analysis,Ó Journal of the American Society for Information Science 46, no. 6 (1995): 400-425.
[54] See e.g. Bruno Zevi, Architecture as Space: How to Look at Architecture, Rev.ed. (New York: Da Capo Press, 1993).
[55] Although there are suggestions that other forms of communication may be of increasing importance, see e.g. Andrew M. Odlyzko, ÒThe Future of Scientific Communication.Ó In Access to Publicly Financed Research: The Global Research Village III, Amsterdam 2000, eds. P. Wouters and P. Schroeder, (NIWI, 2000). Also available at <http://www.dtc.umn.edu/~odlyzko/doc/future.scientific.comm.pdf> (17 July 2003)..
[56] SPARC, SPARC: The Scholarly Publishing and Academic Resources Coalition, (SPARC Europe, 2003), (17 July 2003). <http://www.arl.org/sparc/>.
[57] Some often-cited articles are Stevan Harnad, ÒElectronic Scholarly Publication: Quo Vadis?Ó Serials Review 21, no. 1 (1995). Also available at: <http://www.ecs.soton.ac.uk/~harnad/Papers/Harnad/harnad95.quo.vadis.html> (17 July 2003); and Steven Bachrach et al., ÒWho Should Own Scientific Papers?Ó Science 281, no. 5382 (1998). Also available at <http://www.sciencemag.org/cgi/content/full/281/5382/1459> (17 July 2003); for the commercial publishing companiesÕ opposing view, see e.g. Charles Oppenheim et al., ÒThe Future of Scholarly Journal Publishing,Ó Journal of Documentation 56, no. 4 (2000).
[58] However, article genres will be one important variable to look incorporate into the model.
[59] Though, in the wake of the introduction of different XML technologies that seem to offer more easily understood advantages for the end user, some indications of increasing awareness even of the significance of document structure grammars seem to emerge.
[60] Renear, Allen H. ÒThree (Meta)Theories of Textuality.Ó In Electronic
Text : Investigations in Method and Theory, edited
by Kathryn Sutherland (Clarendon, 1997), 107-26.
Renear, Allen H. ÒThe Descriptive/Procedural Distinction Is Flawed.Ó In Markup
Languages : Theory & Practice 2, no. 4 (2001):
411-20.
Hillesund, Terje. ÒMany Outputs Ð Many Inputs: Xml for Publishers and E-Book
Designers.Ó Journal of Digital Information 3,
no. 1 (2002).
Walsh, Norman. ÒXml: One Input Ð Many Outputs: A Response to Hillesund.Ó Journal of Digital Information 3, no. 1 (2002).
[61] Perhaps needless to say, references to the ÒSGML familyÓ or the ÒSGML paradigmÓ is applicable to XML as well.
[62] The typology of strategies, to which the content-based strategy is counted, is elaborated in Renear 1997.
[63] This distinction can be traced back at least to the mid 80s,
clearly described in Peels, A. J. H. M. , N. J. M. Janssen, and W. Nawijn.
ÒDocument Architecture and Text Formatting.Ó ACM Transactions on Information
Systems 3, no. 4 (1985): 347-69.
[64] Stehno, Birgit, Alexander Egger, and Gregor Retti. ÒMetae Ð
Automated Encoding of Digitized Texts.Ó Literary & Linguistic Computing 18, no. 1 (2003): 77-88.
[65] Dubin, David, Allen Renear, C. Michael Sperberg-McQueen, and Claus
Huitfeldt. ÓA Logic Programming Environment for Document Semantics and
Inference.Ó Literary & Linguistic Computing
18, no. 1 (2003): 39-47.
[66] Buzzetti, Dino. ÒDigital Representation and the Text Model.Ó New
Literary History 33 (2002): 61-88.
Nelson, Theodor H. ÒEmbedded Markup Considered Harmful.Ó In Xml, Principles,
Tools, and Techniques, edited by Dan Connolly
(O'Reilly, 1997), 129-34.
[67] This is of course an ideal generalization. The construction of rules that take into account both source and target contexts will probably be too large a task, but this formalisation accepts a theory in which document architectures are determined by usage and open for interpretation.
[68] There is a continuum of markup categories ranging from carefully chosen disambiguating tags from elaborated markup schemes to ambiguous traces of more or less unconsciously applied systems of signification, such as italicization.
[69] Curran, James R., and Raymond K. Wong. ÒTransformation-Based
Learning for Automatic Translation from Html to Xml.Ó Paper presented at the Fourth
Australasian Document Computing Symposium (Coffs
Harbour, Australia 1999).
Taghva, Kazem, Allen Condit, and Julie Borsack. ÒAn Evaluation of an Automatic
Markup System.Ó (Las Vegas, Nevada: Information Science Research Institute,
1995).
Stehno, Birgit, and Gregor Retti. ÒModelling the Logical Structure of Books and
Journals Using Augmented Transition Network Grammars.Ó Journal of
Documentation 59, no. 1 (2003): 69-83.
[70] BazermanÕs investigation of US patent applications and grants is but one example. Bazerman, Charles. ÒSystems of Genres and the Enactment of Social Intentions.Ó In Genre and the New Rhetoric, edited by Aviva Freedman and Peter Medway (London: Taylor & Francis, 1994), 79-101.
[71] See e.g. Swales, John M. Genre Analysis; English in Academic and
Research Settings (Cambridge: Cambridge Univ.
Press, 1990), and Schryer, Catherine F. ÒThe Lab Vs. The Clinic: Sites of
Competing Genres.Ó In Genre and the New Rhetoric,
edited by Aviva Freedman and Peter Medway (London: Taylor & Francis, 1994),
105-24.
[72] Explicit analogies are developed e.g. in Taghva, Kazem, Allen Condit, and Julie Borsack. ÒAn Evaluation of an Automatic Markup System.Ó (Las Vegas, Nevada: Information Science Research Institute, 1995).
[73] Shallow parsing consists of identification of e.g. nominal and prepositional phrases, leaving parts of the tree structure ambiguous. A formal definition as Òassigning partial syntactic structure to sentencesÓ is given in James Hammerton et. al. ÓIntroduction to Special Issue on Machine Learning Approaches to Shallow Parsing.Ó Journal of Machine Learning Research 2 (2002): 551. POS tagging is the task of automatically or semi-automatically identifying and tagging word classes.
[74] Curran & Wong 1999. It is interesting to note that the HTML tags play a rather marginal role in this case where the text chunks defined by HTML element borders are often reorganised into smaller or larger chunks
[75] A corpus can be defined as a collection of text, representative of
a domain delimited with respect to the scope of interest for its use. For an
overview of POS tagging, see Jurafsky, Daniel, and James H. Martin. Speech
and Language Processing : An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition.
Edited by Stuart Russell, & Norvig, Peter (Upper Saddle River: Prentice
Hall, 2000)
For an overview and evaluation of the use in applications for Swedish, see Megyesi,
Be‡ta. ÒData-Driven Syntactic Analysis.Ó Doct., KTH, 2002.
[76] The TBL algorithm was originally developed by Eric Brill in the
mid-90s. See e.g. Brill, Eric. ÒTransformation-Based Error-Driven Learning and
Natural Language Processing : A Case Study in Part of Speech Tagging.Ó Computational
linguistics 21, no. 4 (1995).
[77] An example of a small tagset is the Penn Treebank Set, consisting of 45 tags.
[78] van Roy, Peter, and Seif Haridi. Concepts, Techniques, and Models of Computer Programming (MIT, 2003).
[79] For those inclined to LIS interpretations it may be important to underline that this notion of record is context free and has a slightly different denotation than in the context of e.g. Òbibliographic recordsÓ.
[80] The documentation of the parser can be found at
http://www.mozart-oz.org/mogul/doc/duchier/xml/parser/index-0.2.1.html
[81] Clark, James, Steven DeRose. XML Path Language (XPath): Version 1.0 W3C Recommendation 16 November 1999. W3C, 1999. http://www.w3.org/TR/xpath
[82] Obvious reasons for this are that syntactic and morphological features of natural language contents may correspond to features in markup. Some representations of these aspects will be encoded into the record as one or more fields, forming part of the conditional context and function as variables of a certain weight.