Updated August 24, 2008
San Jose State University
School of Library and Information
Science
LIBR202, Information Retrieval
- dr. joanne twining
Welcome
/ Greensheet / Class
Schedule & Assignments / Grading / Blackboard
Assignment 3 - SIMPLE Datastructure with Abstracts, Indexing
Available Points: 50
In this assignment you will use the inMagic database software we learned during Assignment 1 to individually design and create a database for at least fifteen of the articles listed in the course supplemental readings. You might want to "start" this assignment early in the semester by retrieving and reading the supplemental readings as you will be expected to also discuss the supplemental readings throughout the semester in relation to our discussions of the text, in the "weekly readings" forum of our blackboard.
For this assignment your User is YOU!
YOU are the designer, as well as the indexer, as well as the user for this datastructure. Outside of the requirements that are outlined below, you have complete creative freedom for this assignment. Think of this assignment as a way you can create a system now that will help you keep all the readings for this class, and possibly for all the other classes you'll be taking as you continue your MLIS studies, organized, and easily retrieved, later. You will want to design your datastructure in a way that serves YOU. Think ahead several classes...how useful might it be to you THEN to be able to do a quick search of this datastructure to retrieve that ONE particular article (you may not remember the exact title or author) that you read today, and then be able to quickly add it's citation for a paper you're writing?
Your datastructure for this assignment will contain the usual and necessary bibliographic fields, such as title, author, publication, etc., in whatever way you want to design them. It will also contain a precoordinate index field, a postcoordinate index field, and an abstract field.
You will deposit a single .zip file named yourlastname_yourfirstname.zip into the Course Blackboard discussion forum named "Assignment 3: Indexing" and will use the subject line: yourlastname_yourfirstname for the post. Your zipped document will include:
a user guide, including any rules, authorized terms, or notes to aid in indexing and use later; name this filename: yourlastname_yourfirstname_userguide.rtf (rich text format)
your zipped inMagic datastructure containing one record for each of at least fifteen articles from the supplemental readings;
use this filename: yourlastname_yourfirstname_index.zip
your evaluation document as described below; filename: yourlastname_yourfirstname_evaluation.rtf (rich text format)
Background:
Along with the standard bibliographic fields, you will build two indexes for this datastruture. Each index will be a separate field in your datastructure.
The first index field will be a precoordinated index, and the field will be term indexed (this is done by using the inmagic's field design functions.)
The second index field will be a postcoordinated index and will be word AND term indexed (again, using the inmagic's function.)
Note:
you would normally not find both types of index in a single datastructure, but we will create them here, since we want to compare and contrast them in order to understand how they function, and their value.
A precoordinated index is an index (a controlled vocabulary, or an authorized set of values allowed in a field) in which otherwise distinct concepts are brought together ahead of time (in the index), hence "pre," to form an entirely new, combined concept. "Pre" means "before" the query. For instance, the preco index term "history - United States - colonial" combines three distinct concepts or words: history, United States, and colonial, to form a single new "term" so that everything indexed using this precoordinated "term" is about and only about the colonial history of the United States. In an index using this precoordinated scheme we might also find index "terms" like:
"history - Mexico - revolutionary"
"culture - United States - 1950s"
"history - United States - revolutionary"
"culture - Brazil - 1930's"
A precoordianted index treats each of these "new - combined - strings of - words" as a single search term....a brand new concept. By telling inmagic to "term index" the field, it will treat each string of concepts like a single query item. The precoordinated index's controlled vocabulary is usually published along with its datastructure, either in the form of an accompanying book, or perhaps as a drop down menu on the search screen, since precoordinated index terms can be a challenge to remember. Precoordination is particularly helpful for collections that are large or highly focused or specialized, and where queries using otherwise uncoordinated words would provide either too many, or two few, retrieved records.
A postcoordianted index is an index in which words (or word phrases) are put together after the point of query, by the retrieval engine. The user might string together multiple, single words or word phrases in the query box, and the retrieval engine does the coordinating of these words during the retrieval process. A postcoordinated index looks like a simple list, like this:
history
culture
Mexico
United States
revolutionary
1950's
colonial
with each of the above words or word phrases treated as a single index term. When a query is executed on a postco index, the retrieval engine looks for the first word in the query and creates a set of retrieval documents, then it searches that set of documents for the second word (and eliminates all documents that don't include both,) and so on until the query is completed. A postco index query consumes much more systems resources than does a preco index, but it allows the user to combine search words in new and creative ways.
Procedures:
Select at least 15 articles from the Supplemental Readings.
Compile a candidate term pool
--
Go through the articles, concentrating mostly on the
abstracts, titles, introductions, and conclusions, and compile a list of the
general concepts that you think should be represented in a controlled vocabulary that represents (and allows aggregating and
differentiating) your articles.
This is your candidate term pool.
It will evolve. You
may take terms straight out of the articles in some cases; you will substitute
other terms for the concept in other cases.
Your terms will probably continue to evolve throughout your work.
Create your lists of authorized terms
--
Go through your candidate term pool and start to turn the
terms into a vocabulary.
It probably will seem most natural to start with a post-coordinate
vocabulary, but some people prefer to create the pre-coordinate vocabulary
first. Either way is fine.
Remember that you need to be able to aggregate with
this language -- the whole purpose of the vocabulary is to bring together all
articles on the same subject. So if you have
unique terms for each article, it won’t be a very useful vocabulary.
You need terms that will work to pull articles on the same subject
together (but you’re working with a small collection so especially if you’ve
read widely, you may have some terms that apply to only one article now -- but
would apply to others on the list, or others you might add in the future).
You also have to be able to use the language to discriminate between articles -- if you apply the same three terms to every article, that won’t
be useful either. So there should
be some terms that reflect concepts that are not shared by all the articles.
Keep in mind that with a post-coordinate vocabulary, a lot of
the discrimination occurs in the combination of terms:
you may have 3 articles on search engines, but one is about design, one
is about evaluation, and one is about current search engines available on the
Web.
The precoordinate vocabulary should cover the same concepts
as the postco, though the terms will obviously be different.
It should serve the same functions – you need to be able to use a term
for more than one article, but not all of them.
Aggregation with the preco vocabulary will probably not be as good as for
the postco vocabulary; you will likely have more terms in your preco vocabulary that
are assigned to only one article than you will have for your postco vocabulary.
You will need to be consistent in creating subdivisions for
your preco subjects, and will probably have to create some rules to keep consistent. This is
where “syntactical rules” and “standards” come in.
A syntax rule is the sort of thing we looked at in the
handouts about LCSH (the LIbrary of Congress Subject Headings, a precoordianted index ) -- what can be a main heading, what can be a subdivision,
etc. For instance, the form
subdivision -- in LCSH, if it is used it must be the last part of any subject
heading. In other words, if the
work is a bibliography, BIBLIOGRAPHY is the last in any string of subject
headings and subdivisions (CANADA--BIBLIOGRAPHY,
CANADA--HISTORY--WARS—GENERALS BIBLIOGRAPHY, VETERINARY
MEDICINE--DICTIONARIES, etc.). So
any rules that govern how you put your main and sub-headings together are syntax
rules.
Standards are pretty much
the same thing -- the rules or organizational principles that you use in putting
together your languages. They
govern things like, do you say INTERFACES—USABILITY or USABILITY--INTERFACES.
It matters because if you put INTERFACES as the subject heading, you can
group other works related on interfaces under similar subject headings
(INTERFACES--DESIGN and INTERFACES--EVALUATION, for instance), or if you use
USABILITY as your subject heading that allows you to group other works related
to usability together (USABILITY--IR SYSTEMS, USABILITY--OPACS,
USABILITY--INTERFACES, USABILITY--IR SYSTEMS--EVALUATION,
USABILITY--OPACS--EVALUATION, etc.)
These kinds of rules or standards keep you from having
situations where you have INTERFACES--USABILITY and USABILITY--OPACS.
You need to be consistent, and the syntax rules are the consistencies you
create.
Index your articles --
Go through your articles and index each one using
your vocabulary. You will probably
find you have to tweak the vocabulary a little bit at this point – you may
find you need to add terms, or that some terms aren’t clear, or some overlap.
So make those adjustments and go ahead and index the articles.
Post co -- will probably assign 2-5 or 6 terms per record,
might on rare occasions be some articles that have only 1 or that have more than
6. You will probably assign fewer
precoordinate terms when you get to that stage -- maybe only 2 or 3, for a few
articles only 1.
Create the records for your database
--
Now that you’ve made all the decisions for your database,
write the user guide -- an introduction for someone wanting to search your
database. Consider:
What do people need to know about your vocabularies
to be able to search? Do
users need to know the syntax rules you’ve worked out?
What do they need to know about the kinds of decisions you’ve made
concerning your vocabulary? When
should they search the natural language fields and when should the search the
controlled vocabularies? (You may
not know this for sure till after you’ve done your evaluation, but you should
have some ideas.) It is useful to
include your subject term lists in the user guide, perhaps as a printout of your
validation lists, or as a printout from the QBE screen using F3.
Evaluate your database –
First you have to decide how you will evaluate it – what
will your criteria be? How will you
determine how well your vocabularies work, whether one works better than the
other, whether they work better or worse than a natural language fields for
retrieval? Refer back to the
texts, your other reading, the material we’ve covered in class – you know
enough now to know what you want from a database.
Establish your criteria, then figure out how you will know if
your database meets your criteria. What
tests or searches can you do to find out?
Run these test searches and keep careful track of your tests
and results.
Look at your results and decide which fields give the best
subject access and why. Is one
field superior for one kind of search, but another field for other kinds?
Do some not give good results? Analyze
the strengths and weaknesses of your fields, and your indices, according to your
tests and your own insights.
Share your findings and discoveries in your evaluation document.
LIBR202, Information
Retrieval
- dr. joanne twining
Welcome
/ Greensheet / Class
Schedule & Assignments / Grading / Blackboard