Updated August 24, 2008

San Jose State University School of Library and Information Science
LIBR202, Information Retrieval - dr. joanne twining
Welcome / Greensheet / Class Schedule & Assignments / Grading / Blackboard


Assignment 3  - SIMPLE Datastructure with Abstracts, Indexing  

Available Points: 50

In this assignment you will use the inMagic database software we learned during Assignment 1 to individually design and create a database for at least  fifteen of the articles listed in the course supplemental readings.  You might want to "start" this assignment early in the semester by retrieving and reading the supplemental readings as you will be expected to also discuss the supplemental readings throughout the semester in relation to our discussions of the text, in the  "weekly readings" forum of our blackboard. 

For this assignment your User is YOU!

YOU are the designer, as well as the indexer, as well as the user for this datastructure. Outside of the requirements that are outlined below, you have complete creative freedom for this assignment. Think of this assignment as a way you can create a system now that will help you keep all the readings for this class, and possibly for all the other classes you'll be taking as you continue your MLIS studies, organized, and easily retrieved, later. You will want to design your datastructure in a way that serves YOU. Think ahead several classes...how useful might it be to you THEN to be able to do a quick search of this datastructure to retrieve that ONE particular article (you may not remember the exact title or author) that you read today, and then be able to quickly add it's citation for a paper you're writing?

Your datastructure for this assignment will contain the usual and necessary bibliographic fields, such as title, author, publication, etc., in whatever way you want to design them. It will also contain a precoordinate index field, a postcoordinate index field, and an abstract field.

You will deposit a single .zip  file  named yourlastname_yourfirstname.zip  into the Course Blackboard discussion forum named "Assignment 3: Indexing" and will use the subject line: yourlastname_yourfirstname for the post. Your zipped document will include:

Background:

Along with the standard bibliographic fields, you will build two indexes for this datastruture. Each index will be a separate field in your datastructure.

The first index field will be a precoordinated index, and the field will be term indexed (this is done by using the inmagic's field design functions.)
The second index field will be a postcoordinated index and will be word AND term indexed (again, using the inmagic's function.)
Note: you would normally not find both types of index in a single datastructure, but we will create them here, since we want to compare and contrast them in order to understand how they function, and their value.

A precoordinated index is an index (a controlled vocabulary, or an authorized set of values allowed in a field) in which otherwise distinct concepts are brought together ahead of time (in the index), hence "pre," to form an entirely new, combined concept. "Pre" means "before" the query. For instance, the preco index term "history - United States - colonial" combines three distinct concepts or words: history, United States, and colonial, to form a single new "term" so that everything indexed using this precoordinated "term" is about and only about the colonial history of the United States. In an index using this precoordinated scheme we might also find index "terms" like:

"history - Mexico - revolutionary"
"culture - United States - 1950s"
"history - United States - revolutionary"
"culture - Brazil - 1930's"

A precoordianted index treats each of these "new - combined - strings of - words" as a single search term....a brand new concept. By telling inmagic to "term index" the field, it will treat each string of concepts like a single query item. The precoordinated index's controlled vocabulary is usually published along with its datastructure, either in the form of an accompanying book, or perhaps as a drop down menu on the search screen, since precoordinated index terms can be a challenge to remember. Precoordination is particularly helpful for collections that are large or highly focused or specialized, and where queries using otherwise uncoordinated words would provide either too many, or two few, retrieved records.

A postcoordianted index is an index in which words (or word phrases) are put together after the point of query, by the retrieval engine. The user might string together multiple, single words or word phrases in the query box, and the retrieval engine does the coordinating of these words during the retrieval process. A postcoordinated index looks like a simple list, like this:

history
culture
Mexico
United States
revolutionary
1950's
colonial

with each of the above words or word phrases treated as a single index term. When a query is executed on a postco index, the retrieval engine looks for the first word in the query and creates a set of retrieval documents, then it searches that set of documents for the second word (and eliminates all documents that don't include both,) and so on until the query is completed. A postco index query consumes much more systems resources than does a preco index, but it allows the user to combine search words in new and creative ways.

Procedures:

Select at least 15 articles from the Supplemental Readings. 

Compile a candidate term pool -- 

Go through the articles, concentrating mostly on the abstracts, titles, introductions, and conclusions, and compile a list of the general concepts that you think should be represented in a controlled vocabulary that represents (and allows aggregating and differentiating) your articles.  This is your candidate term pool.   It will evolve. You may take terms straight out of the articles in some cases; you will substitute other terms for the concept in other cases.  Your terms will probably continue to evolve throughout your work.

Create your lists of authorized terms --

Go through your candidate term pool and start to turn the terms into a vocabulary.  It probably will seem most natural to start with a post-coordinate vocabulary, but some people prefer to create the pre-coordinate vocabulary first.  Either way is fine.

Sort your candidate terms into alphabetical order.   Remove any duplicates and decide on the preferred form (memory or computer memory?  Internet or World Wide Web or both?)   Which terms have similar meanings and should be collapsed?  Where can you merge singular and plural forms of the words?  Do two terms have overlapping meanings?  -- should one of them be broadened and the other eliminated?

Remember that you need to be able to aggregate with this language -- the whole purpose of the vocabulary is to bring together all articles on the same subject.  So if you have unique terms for each article, it won’t be a very useful vocabulary.  You need terms that will work to pull articles on the same subject together (but you’re working with a small collection so especially if you’ve read widely, you may have some terms that apply to only one article now -- but would apply to others on the list, or others you might add in the future).   

You also have to be able to use the language to discriminate between articles -- if you apply the same three terms to every article, that won’t be useful either.  So there should be some terms that reflect concepts that are not shared by all the articles.  

Keep in mind that with a post-coordinate vocabulary, a lot of the discrimination occurs in the combination of terms:  you may have 3 articles on search engines, but one is about design, one is about evaluation, and one is about current search engines available on the Web.  

The precoordinate vocabulary should cover the same concepts as the postco, though the terms will obviously be different.  It should serve the same functions – you need to be able to use a term for more than one article, but not all of them.  Aggregation with the preco vocabulary will probably not be as good as for the postco vocabulary; you will likely have more terms in your preco vocabulary that are assigned to only one article than you will have for your postco vocabulary.

You will need to be consistent in creating subdivisions for your preco subjects, and will probably have to create some rules to keep consistent.  This is where “syntactical rules” and “standards” come in.

A syntax rule is the sort of thing we looked at in the handouts about LCSH (the LIbrary of Congress Subject Headings, a precoordianted index ) -- what can be a main heading, what can be a subdivision, etc.  For instance, the form subdivision -- in LCSH, if it is used it must be the last part of any subject heading.  In other words, if the work is a bibliography, BIBLIOGRAPHY is the last in any string of subject headings and subdivisions (CANADA--BIBLIOGRAPHY, CANADA--HISTORY--WARS—GENERALS BIBLIOGRAPHY, VETERINARY MEDICINE--DICTIONARIES, etc.).  So any rules that govern how you put your main and sub-headings together are syntax rules.

Standards are pretty  much the same thing -- the rules or organizational principles that you use in putting together your languages.  They govern things like, do you say INTERFACES—USABILITY or USABILITY--INTERFACES.  It matters because if you put INTERFACES as the subject heading, you can group other works related on interfaces under similar subject headings (INTERFACES--DESIGN and INTERFACES--EVALUATION, for instance), or if you use USABILITY as your subject heading that allows you to group other works related to usability together (USABILITY--IR SYSTEMS, USABILITY--OPACS, USABILITY--INTERFACES, USABILITY--IR SYSTEMS--EVALUATION,  USABILITY--OPACS--EVALUATION, etc.)  

These kinds of rules or standards keep you from having situations where you have INTERFACES--USABILITY and USABILITY--OPACS.  You need to be consistent, and the syntax rules are the consistencies you create.  

Index your articles --

Go through your articles and index each one using your vocabulary.  You will probably find you have to tweak the vocabulary a little bit at this point – you may find you need to add terms, or that some terms aren’t clear, or some overlap.  So make those adjustments and go ahead and index the articles.

Post co -- will probably assign 2-5 or 6 terms per record, might on rare occasions be some articles that have only 1 or that have more than 6.  You will probably assign fewer precoordinate terms when you get to that stage -- maybe only 2 or 3, for a few articles only 1. 

Create the records for your database --

Now that you’ve made all the decisions for your database, write the user guide -- an introduction for someone wanting to search your database.  Consider:   What do people need to know about your vocabularies  to be able to search?  Do users need to know the syntax rules you’ve worked out?  What do they need to know about the kinds of decisions you’ve made concerning your vocabulary?  When should they search the natural language fields and when should the search the controlled vocabularies?  (You may not know this for sure till after you’ve done your evaluation, but you should have some ideas.)  It is useful to include your subject term lists in the user guide, perhaps as a printout of your validation lists, or as a printout from the QBE screen using F3.

Evaluate your database

First you have to decide how you will evaluate it – what will your criteria be?  How will you determine how well your vocabularies work, whether one works better than the other, whether they work better or worse than a natural language fields for retrieval?   Refer back to the texts, your other reading, the material we’ve covered in class – you know enough now to know what you want from a database.

Establish your criteria, then figure out how you will know if your database meets your criteria.  What tests or searches can you do to find out? 

Run these test searches and keep careful track of your tests  and results.

Look at your results and decide which fields give the best subject access and why.  Is one field superior for one kind of search, but another field for other kinds?   Do some not give good results?  Analyze the strengths and weaknesses of your fields, and your indices, according to your tests and your own insights.

Share your findings and discoveries in your evaluation document.

LIBR202, Information Retrieval - dr. joanne twining
Welcome / Greensheet / Class Schedule & Assignments / Grading / Blackboard