Mining the Electronic Documents for Local Collections
by Raleigh Muns
Transcript of a talk delivered at the Spring, 1995 Depository Library
Council Meeting and Federal Depository Conference, Wednesday, April 12,
1995, Arlington, Virginia
OUTLINE
-. Some initial quotes about information
I. Who am I?
In which a personal and individual context is set.
II. Why am I doing what I am doing?
In which motivation and opportunity are explored.
III. What am I doing?
In which the overall approach is explained.
IV. How am I doing it?
In which some nuts and bolts are examined.
V. Bells and whistles
In which some fancier things about mining and providing are explained.
VI. Risks
In which some unforeseen problems are put forth.
VII. Results
In which feedback on activities is presented.
VIII. Conclusion
Some initial quotes about information:
When action grows unprofitable, gather information; when
information grows unprofitable, sleep.
-Ursula K. Le Guin (The Left Hand of Darkness (1969), ch. 3).
Information is the oxygen of the modern age. It seeps through
the walls topped by barbed wire, it wafts across the electrified
borders.
-Ronald Reagan (Guardian; London, 14 June 1989).
The government is us; we are the government, you and I.
-Theodore Roosevelt (Speech, 9 Sept. 1902, Asheville, N.C.).
I. Who am I?
By design and trade I am a Reference Librarian and not a
Government Documents specialist. I used to belong to GODORT as a
"gov docs junkie" but the reality of having children and a
librarian's pay caused me to forego that frill and thrill after
about two years. As a product of UCLA I had access to their
extensive documents collection and probably became unabashedly
addicted to the information the government provides when I ran
across a tattered volume of hearings from the 1950's on how
comic books were turning America's youth into a bunch of crazed
and violent communists. I love that kind of stuff!
The University of Missouri-St. Louis is a state-supported
university that honestly delivers a fine education but to be
honest, has no real reputation as a flagship of higher learning.
Established in the early 1960's, living within a budget imposed
by a frugal state government, and existing in a country that
appears to increasingly be supporting its educational
institutions and libraries with Nike slogans ("Just do it") the
university decided, as many others have done, to gorge on the
govdocs teat as a full-depository; this was followed by a
re-scaling four years later to about 90 percent selectivity
which is where we stand today.
Because we are young and under funded, necessity has led us to
rely heavily on the documents collection. Under funding also
means under staffing which at UM-St. Louis means that we are all
multi-specialists, or, as I like to say, at UM-St. Louis we are
ALL government documents librarians. Our single dedicated
government documents librarian is a REAL reference librarian and
all of the REAL reference librarians can cite SuDoc numbers in
our sleep.
One of the final pieces in the puzzle has been the intellectual
integration of the collection by including the Government
Printing Office/OCLC tapes in our online catalog allowing
patrons to access the collection transparently.
II. Why am I doing what I am doing?
1. I see this as traditional librarianship.
2. We are poor.
3. We can.
4. The information we are providing has real-world
applications in our mission.
"1. I see this as traditional librarianship." The main
activities of our profession revolve around activities of
"access" and "preservation." Simple, basic librarianship
consists of acquiring materials (collection development);
organizing (cataloging, shelving); intermediating (reference
services); and maintaining (preserving). Technical
considerations aside, this is exactly what I am doing with a
local internet gopher-based collection.
"2. We are poor." This is a flip way of pointing out the value
of the depository program. Because of the materials we receive,
we can put resources in other areas not covered by the
depository program. There should be nothing new here to any in
the audience. What I do is an extension of the desire to extract
value from existing resources at a minimal cost.
One of my colleagues contends that what I do is not traditional
librarianship. She points out that I am more in the publishing
business than the library business. I counter that when we take
the traditional roles of librarianship, and apply the context of
a specific institution with a specific mission, what I am doing
is the same as what we have always done in the profession. This
last part is the practical key to all that I do: the context of
what we do.
Let me elaborate: rather than become a vacuum cleaner for
everything that is out there, I suggest that you act as I do and
deal in a world where acquisition decisions of electronic
materials (i.e., mined electronic government documents) are the
same as acquisition decisions for "real" documents, or "real"
non-documents. A projected need must be met.
For example, I do not choose to put an electronic document up on
our Internet gopher because I think it will be used; I put it up
because I know it will be used. This is based on my hands-on
experience with the government documents collection via our
Reference Desk. When I ran across the Occupational Outlook
Handbook on CD-ROM from the depository program, I knew that this
was an item that would be in demand because of the constant use
of the print version. Sometime last year I gave a talk in San
Francisco that stated "Everything I ever learned of value I
learned in library school." The struggle many librarians are
having with the new technologies can be mitigated by stepping
back and realizing that though information formats are changing
radically, the underlying concepts of what we do have not
changed. Evaluation of a resource, for example, should be
independent of the medium. What good is it? What need does it
meet? Are there alternatives? If the process of accessing an
electronic document seems stupid, confusing, and non-intuitive,
it is probably because it is stupid, confusing, and
non-intuitive. I think what I am saying is that if you are a
confused, yet fearless, librarian, you will do fine. Now, we may
still use stupid, confusing, non-intuitive resources, but at
least we should be doing it with open eyes.
Why are we doing this? "3. We can." Two conditions come together
in a large amount of the federal documents I use which make
mining the electronic documents a minor technical exercise, and
they are:
1. The documents are already in an electronic format.
2. Uses of the documents are (usually) not restricted by
copyright.
Lots of useful print documents are not copyrighted and require
extra effort (prohibitive effort based on most of our resources)
to utilize; lots of other electronic documents are simple to
use, but are under copyright; but the synergy of these two
simple conditions creates an explosive mix that ignited one
over-caffeinated, altruistic librarian's ongoing activities.
I would like to point out that one of my frustrations is the
problem of determining the copyrighted nature of a depository
item. For example, one of the products I have raided is the
eminently useable National Trade Data Bank (NTDB) CD-ROM. On the
NTBD is an excellent small monograph:
OPPORTUNITIES IN MEXICO: A SMALL BUSINESS GUIDE is the
product of a public/private sector initiative among the
U.S. Small Business Administration (SBA), the Service
Corps of Retired Executives (SCORE) and AT&T. This
guide provides U.S. small businesses with practical
trade information on exporting to Mexico.
Unfortunately, the Program Description part of the file
unambiguously states:
Contents of this publication are copyrighted. All rights
are reserved to Free Trade Consultants. No portion of
this book may be reproduced mechanically, electronically
or by any other means, including photocopying, without
written permission from John L. Manzella, Author,
President of Free Trade Consultants, Buffalo, New York.
Since this item is available on the official NTDB gopher
(gopher://sunny.stat-usa.gov:70/00/STAT-USA/NTDB/) worrying
about this seems absurd, but violation of copyright in our
profession is serious, even when it is absurd.
Another item I would dearly love to mine is the Joint Electronic
Library CD-ROM (D 5.21:994/2/1 A) which is chock full of all
sorts of historical papers from Military War College sources. I
have neither the time nor the inclination to pursue determining
to a conclusion the true copyright nature of this source (and
suspect that it is a piece-by-piece answer anyway and not global
to the entire CD-ROM). However, this is again a barrier to
mining information I rather was not there. In any case, although
these items are coming through the Depository program, as with
printed materials via the program, there is no guarantee that
they are in the public domain. This problem is magnified in what
I do because of the nature of providing electronic information
on the Internet, even locally. It is one thing to make a single
photocopy, and yet another to create a resource that can be
easily reproduced by fifty million people.
The final piece of "Why am I doing this" is that, "4. The
information we are providing has real-world applications in our
mission." The information items provided are inherently useful.
This is not an exercise in academic experimentation, but another
dimension in providing desired information to those who need or
want it. If I can inspire anyone to contribute to this common
pursuit of our profession as I am doing, then I have again
leveraged more value than it would appear out of these "dry as
dust" government documents.
III. What am I doing?
In brief: raiding, stealing, pointing, mirroring, manipulating
documents received via the depository library program or on the
Internet. As institutions, especially government institutions,
shift from paper to electronic formats, the availability of
electronic documents is exploding, and thus the available
opportunities are exploding.
Based on what is currently on our gopher, a user can find the
Army Area Handbooks, Economic Reports of the President, the US.
Industrial Outlook, "The Green Book (1994)" Overview of
Entitlement Programs, and a list (unique?) Of all Depository
libraries organized by state.
I would like to even brag a bit about preceding the official
National Trade Data Bank gopher site by about a year (and grouse
at the same time at the initial announcement of the NTDB's
availability on the Internet as "for the first time anywhere").
Though we did not mount all NTDB files on our gopher, we did
extract, again, those we found most useful from our immediate
experience such as the Background Notes and aforementioned Army
Area Handbooks (among others). In fact, by extracting the most
useful files (again reflecting our experience with local user
needs) we have found that we have cut down on what I call the
"noise" on the NTDB CD-ROM of having too rich a body of
information. This is application of the selection and collection
activities of traditional librarianship. The pleasing thing
about this is that in mining the electronic documents we are
less tied to pure economic forces (how much does an item cost?)
and more tied to the intellectual activity of determining patron
needs in an almost abstract manner.
Though I am addressing "Mining the Electronic Documents for
Local Collections," the borderless nature of the Internet
really means that everything is universally accessible. I admit
to, and encourage you do the same: to be driven by local needs.
The truth is that many of our local needs are the same local
needs as users of the Cleveland Public Library, the Library of
Congress, or the America Online service. In fact, according to
our user logs, the largest group of users of our Internet gopher
government documents are subscribers to America Online.
IV. How am I doing it?
Also, how can YOU do it. Undeniably, a certain level of
technical expertise is required. The more expertise you have,
the more you can do, the fancier you can get, the sexier your
site, and the happier you can make your patrons. However, you do
not need to know how to do computer programming (though if you
know any programming, you can do some fun things); you do not
need to know calculus or algebra; you do not need to know
assembly language programming; in fact, if you have conquered
any modern word-processing program, you have already learned
what is probably the most difficult (and onerous) part of all I
do.
What DO you need?
1. An existing Internet infrastructure of some kind.
2. Public domain files in an electronic (ASCII preferred)
format.
3. The aforementioned word-processing skills.
4. The ability to download/upload files from/to local
PC's/Mac and your net site.
5. About one hour of instruction (or decent documentation).
Whether you are dealing with the World Wide Web, gopher, ftp
sites, or whatever, a necessary but not sufficient, condition is
that someone at your institution be running a machine on the
Internet. Mainframes, PC's, Macs, whatever, can all be used to
run freeware Internet server software. You will be hard-pressed
to find institutions with sites on the Internet that do not have
an existing server of some kind already up and running. Your
job, Mr. and Ms. Phelps, should you decide to accept it, is to
make the human connection to the people running the machines.
Without an existing Internet infrastructure of machines,
software, and people, you cannot do any of what I am about to
describe.
Interestingly, there is a growing array of commercial providers
who will do this for you. For $9.95 a month you can lease space
on the World Wide Web with a company called Webcom
(http://www.webcom.com). They become the infrastructure about
which I am talking. This is not a recommendation of Webcom.
I am just using them as what I consider a prototypical example
of how the commercial sector can provide the needed Internet
infrastructure.
In my situation, I noted that some of our computer techies had
set up a prototype gopher server on the campus mainframe and I
innocently asked if I could have an account called "The
Library." After about fifteen minutes of instruction and with a
single sheet of paper showing me how to set up gopher menu
structures (all done with simple text editors), I was told I
could start uploading files that could be accessed. For those of
you who think some mysterious and arcane knowledge is required
to put files on the Internet I cannot stress how far from the
truth is such a misconception. You can do mysterious and arcane
things on the Internet, but being a basic provider is incredibly
simple, provided you have an existing Internet infrastructure
(or buy access to one).
Now, being a depository library, we (and you no doubt) receive
tons of CD-ROMs. This is the crop from which you will harvest.
Remember, WHAT you harvest is partly limited by technical
considerations, but more critically related to understanding in
a real-world sense what information is worth mining.
Initially, I install the software for accessing a CD-ROM as
directed by any accompanying documentation. There are still many
people that do not know that the information on a CD-ROM is as
accessible as files on a diskette or your workstation's hard
drive. One does not necessarily need to install special software
to look at files on a CD. It is not unusual to have
workstations, old and cheap ones, which cannot use the interface
software supplied. It may not have enough memory; it may not
have a color monitor; it may not have the most recent version of
the DOS operating system. By looking at the files on a CD as
you would files on a diskette, one can still extract valuable
information that would be otherwise inaccessible.
Certainly much of what you can probably look at directly may
require other programs. For example, by looking directly at the
directories of files on GPO distributed CD-ROM's I've found
groupings of Lotus 1-2-3 spreadsheet files that require
spreadsheet software to access. If you have your Internet
infrastructure in place and working, it can become as simple as
setting up a gopher menu item "1-2-3 Spreadsheets from the 1995
Federal Budget" and then just uploading all of the files from
the CD-ROM to the Internet server account. Though not
recommended, this could be done without even having such a
spreadsheet program yourself.
The key here is to poke around directly and not to rely on the
native accessing software. You may find all kinds of neatly
arrayed files just sitting around. The Joint Electronic Library
CD-ROM mentioned above is an outstanding example. I have used
the same technique to extract GIFs or pictures from USGS CD-ROMs
to create a local exhibit of disaster photographs. Also, do not
keep yourself from understanding how the "native" search software
for a CD-ROM product works, either. The NTDB, and other Dept. of
Commerce products, usually have two available interfaces. By
familiarizing yourself with the software you can select files on
the NTDB to be extracted as separate file (e.g., all Department
of State Background Notes come out as separate files for each
country) or create one large file with all sections appended.
Here is where you could create an ftp (file transfer protocol)
archive with the entire text of a single Army Area Handbook, or
create, as I have done (and as is done at the STAT-USA site that
carries the NTDB on the Internet) Army Area Handbooks with each
chapter a separate file.
Each product is different and subsequent editions may have
updated or changed interface software. The general approach
again is to:
1. Access the CD-ROM directly
2. Familiarize yourself with the supplied interface
software.
Of special note is that for those products that (hopefully) come
out with a certain regularity, such as the NTDB, the
familiarization process will pay off over time as you understand
how to extract information with each new edition, and then carry
over that expertise to subsequent issues.
V. Bells and Whistles
So far I have spoken broadly about how easy it is to just pull
files off a CD-ROM and post them to a gopher or World Wide Web
site (and begged the question of exactly HOW to do that as
beyond the scope of this, or any presentation - how you do
things is so tied into local resources that it is impossible to
say in any generic sense how one should proceed). You can do
some fascinating things with these files with a little
expertise.
First, some files are prohibitively large. Putting the entire
Occupational Outlook Handbook on an Internet server is trivial
since the CD-ROM version has a single file with the full-text on
it. By writing programs that can chop up larger files into
constituent pieces, one can add value to the product. Accessing
five or six paragraphs on the occupation of "library clerk"
using the Internet gopher software is a lot more efficient, and
faster, than using that same gopher software to transfer the
entire Occupational Outlook Handbook. The same thing can be said
for documents such as the North American Free Trade Agreement
(NAFTA), the entire Federal Budget, or the Economic Report of
the President.
Overall, the value one can add is by judiciously chopping up
larger documents for easier access to the constituent pieces. As
dull and dry as this may sound, I consider this a necessary
component to providing universal access. By catering to the
lowest common denominators, whatever the components of those
denominators are, the universal access (hopefully) mandated by
various federal information distribution programs can be met.
One never knows whether an accessor is using a dumb terminal
logged onto a Unix computer account or a top-of-the-line, fully
networked, high end workstation on the Internet. The least
common denominator here requires designing for the slowest
transfer speed as possible. It is going to take a while for
someone with a 2400 baud modem to look at a document than
someone with Mosaic on a networked Macintosh.
Another level of value that can be added involves organizing the
pieces of chopped up information. By expending more effort,
complex documents can be arranged in hierarchies for easier
access. Chapters can be listed within which sections can be
arranged within which tables can be arrayed, all at different
levels. There are no shortcuts to doing this, but when such
judicious arrangement is done, we are again acting like
librarians more than technicians.
No shortcuts. Librarians, of all professionals, have little
problem understanding the importance of thankless tasks. I like
to point out a difference between technicians and librarians.
If you ask a technician to do something onerous and
time-consuming, you are likely to be told "it cannot be done"
(and what they mean is "I do not want to spend the time doing
this onerous and time-consuming task"). As a counter example,
when our government documents librarian was asked about shelf
shifting and rearranging our growing collection, the answer was
twofold:
1. An analysis that the job would take six months of hard
work.
2. Six months of hard work.
Similarly, many of the best things one can do in mining
electronic documents for both local and universal collections
are time-consuming, onerous, thankless, invisible, and
absolutely critical for providing useable and useful online
resources. It is fine if you can find files on CD-ROMs or on the
Internet chopped up into nice packages. Nevertheless, if you
cannot, roll up your sleeves and start hacking.
The chopping up need not be difficult. Most word processing
programs can take large documents while allowing cutting and
pasting. I have written some programs in BASIC that do the
chopping automatically. The level of programming skill is that
required by the most basic courses of even twenty years ago. In
the case of the Occupational Outlook Handbook, all sections of
the one large file were flagged with the unique characters of two
backwards slashes. By writing a program that chopped up the
larger file on every occurrence of "//" I was able to quickly
produce files consisting of separate occupations to be mounted
on the Internet.
Note again that the driving force for doing this was based on a
first-hand knowledge at the Reference Desk of the utility of
this specific work, and how people use it - it is not read cover
to cover but is accessed by specific profession.
Another bell and whistle possible when you create and provide
local access to government document collections is what I call
commercials on the Internet. I've long proposed (to the snorts
of disdain of my colleagues) that we put commercials on our
online catalog, or OPAC. When I chop up larger files for local
collections, I do just that when I make sure that each piece has
a bit of advertisement for the University of Missouri-St. Louis.
With the exception of some of the first documents I placed on
our gopher, all other electronic documents placed on our gopher
and World Wide Web servers have, and will have, innocuous little
tags saying something like "access to this chapter of the China
Army Area Handbook is brought to you courtesy of the libraries
of the University of Missouri-St. Louis." Additionally,
information as to the source of the electronic document (e.g.,
the NTDB for a specific month) is also included. Note these two
important functions:
1. Providing provenance information of the document.
2. Advertising the expertise of the university.
Both of these things are extremely relevant. The issue of
provenance comes into play when patrons wish to find similar
items at a local depository. How many times have you had to deal
with a patron carrying a photocopy of a single page of a
government document asking "where is the rest of this item?" It
is more than a courtesy to include the source of a document in a
piece of a larger electronic file - it is a necessity. When a
patron retrieving a chapter of an Army Area Handbook from a
UM-St. Louis Library Internet node brings the printout to you,
there should be no problem directing them to your local
holdings. I strongly suggest that this is another dull, dry, and
thankless area that is crucial to proper maintenance of local
electronic collections.
Advertising one's expertise, I hold, is also relevant and not an
ego trip. In an environment where dependency on public support
is crucial, it is important that we toot our own horns in an
attempt to keep ourselves visible to our local, national, and
even international patrons. When America Online subscribers
consistently run across "free" depositories of useful
information, it is in our mutual self-interest to let these
voters on tax issues understand from whence this information is
coming. Without tooting our horn these prototypical America
Online users are likely to erroneously assume that it is their
network provider (America Online) who is giving them this
information. We did it. We do it. We will do it. And if the
citizenry benefits from our services it behooves us to let them
know to whom to give credit. This is less an ego issue than a
survival issue. For an honest public institution as mine, this
is also an opportunity to demonstrate value returned for tax
dollars invested. If we can all proceed in this manner, our
modest and invisible profession can only benefit.
VI. Risks
Erroneous attribution
A recurring theme of this talk is the connection between local
collections and universal access. In practice, what this
means is that local activities can be criticized by anyone on
the Internet. I have received messages from Norway explaining to
me that their country voted NOT to join the European Economic
Union (EEU); Austrians have told me about abbreviations in the
CIA World Factbook which are in error; and Pakistanis have
corrected me on the transcription into English of the name of
their currency. By providing access to information you will be
setting yourself up as appearing to be the publisher of that
information.
Personal Attacks
As a local/universal provider of access to government
information, erroneous attribtution can lead to personal
attacks. I was recently called a jew killer and Benedict Arnold
for my efforts of mining government documents for our local
collection. My crime? I posted "as is" a copy of the Yugoslav
Army Area Handbook from the National Trade Databank. The irate
virtual patron decided that my publication of this work was a
racist slap in his face. As a good librarian I calmly responded
to his complaint and apologized and explained the situation. I
said that, at his request, I would forward to my colleagues on
the internet a proposal to remove all magazine articles,
atlases, globes, and books with the word "Yugoslavia" in them. I
heard nothing more.
Responsibility
By providing access to this information you should be setting
yourself up as a consistent and responsible provider of
information. Another area where we can add value to our local
electronic collections is by maintaining our documents,
continuing to update them, and making sure that access is
robust. The risk is that irresponsibility can be seen
immediately by all users of your information.
VII. Results
Relative fame and no fortune are the results. The best we can
hope for is enough recognition to continue support for us and
our institutions so we can continue to provide resources and
services to our constituencies. One of the most interesting
things about setting up local electronic collections on Internet
servers is the ability to monitor use. Gopher and World Wide Web
servers typically have the capacity to create user logs. These
files contain the date, time, accessor, and files accessed by
anyone utilizing the local server's resources. At UM-St. Louis
we have provided for about two years information other than that
mined from, primarily, GPO distributed CD-ROMs. Our gopher logs,
however, indicate that the items most heavily used are from the
Government Documents section of our virtual collection.
Specifically, the Army Area Handbooks are the single most
heavily used items.
Due to technical problems, we are only able to track what I call
"accesses" or "transactions." Whenever a user presses a key to
move to another level of the gopher or to retrieve a document, a
line of text is written to the day's gopher log indicating,
again, date, time, user, and file or path accessed. Accesses of
government documents is up from a hundred transactions per month
two years ago, to over one hundred THOUSAND transactions monthly
today. In abeyance is my desire to demonstrate WHICH documents
are being accessed. Efforts to cease publication of things like
the Occupational Outlook Handbook or the Industrial Outlook
might be combatted by hard statistics showing continued and
heavy use of these documents.
VIII. Conclusions
My conclusions are handwritten at the last minute because, for
the life of me, I couldn't come up with any way to tie
everything together. I posit that the reason for this is because
this is an open-ended, ongoing, amorphous and ambiguous process
(a salient feature of these new technologies). There is no
conclusion to these activities; products, changing formats of
information (e.g., the increasing use of Adobe Acrobat and the
PDF file format), and technologies in general are all in flux.
My conclusion, and advice, is to invert the popular
environmentalist's aphorism of
"think globally and act locally,"
to the new internet maxim of
"think locally and act globally."
WWW Home Page URL:http://www.umsl.edu/~muns/