ACCC Home Page ACADEMIC COMPUTING and COMMUNICATIONS CENTER
Accounts / Passwords Email Labs / Classrooms Telecom Network Security Software Computing and Network Services Education / Teaching Getting Help
 
Web Searching / Indexing
0 Contents 1 Google 2 Intro 3 What's Indexed 4 Fields & Queries
5 Forms 6 Output 7 Examples A1 Related Links  

Introduction

 

Send comments or bug reports to wwwtech@uic.edu.

 
   
 
     
Introduction
 

This document will help you build a custom index and query form for your Web documents. Or, at the least, you will learn how to adjust your documents so that the existing indexing and query forms find them when they should.

There are now two options, Google and UIC Search Engine. Most of this document is about the older UIC Search Engine. Consider using a custom Google Search if it can meet your requirements. If not, keep reading.

Note: I assume you are already familiar with HTML. If you intend to build your own query page, you must already be familiar with HTML forms, as well. You don't need to be a programmer, but attention to detail is quite helpful.

The ADN is running the Netscape Catalog Web indexing system at UIC. Netscape Catalog is a complicated system with many moving parts, but it allows a huge degree of flexibility in customizing which documents are indexed, how they are indexed, how one performs a query, and how the results are presented. We have tried to configure this system to give individual users as much flexibility as possible, consistent with general Web security practices. At the same time, we've tried to make this system usable to those who don't want to understand every last detail. But "ease-of-use" being the opposite of "flexible", some compromise had to be reached. You will need to know some details, although I hope they aren't too hard to learn.

If you want to rely on the default indexing and searching, you still might want to know a little about Netscape Catalog, because the nature of your HTML documents can affect how they are indexed and, therefore, how easily they are found by the right people. You should read the Bare Minimum, just to make sure your documents turn up in the right searches.

This document is a bit long. If you want to get started quickly, go to the example section, copy one of the example search forms to your own Web directory, and make changes to it. After you understand how Netscape Catalog works, and how to formulate queries, you'll be more able to customize a search for others.

Even though this document is long, it is also incomplete. The material here is oriented toward how to construct search forms using Netscape Catalog and the relevant UIC modifications. A real discussion of how the catalog server works is beyond the scope of this document, as is a full discussion of the verity search engine. Also, some features related to use at UIC are still under construction.

 
     
Create Your Own Search Form -- Bird's Eye View
 

If you want to create your own search form, either to create your own look-and-feel or to easily search through a subset of all the pages at UIC, here are the general steps.

  1. Make sure the files you want to search through are, in fact, indexed. And make sure they display well when retrieved. You can check this by using the general search from the UIC home page to find your files and see how they look. Some considerations are covered in Bare Minimum.
  2. In order to search for files, you must know what about each file actually gets indexed. For example, you can search for words in the title separately from words in the URL. This is covered in Searchable Fields
  3. Once you know how files are indexed, you must know a little about how to construct a query to examine these indexes. The internal query language is governed by the Verity Engine. This is discussed in Construction of Queries. You don't have to know a lot about this, but a small familiarity is helpful.
  4. Make an HTML form that executes a query. Do so by naming each input field in the HTML form correctly. Once you set up the input fields (where the user will type in the words he wants to search for), an existing CGI script will translate this info into the Verity Query Language, perform the query, and return the results. This is discussed in Custom Web Search Forms.
  5. After the search is performing properly, you can now customize the look-and-feel. You can add buttons to determine the order of the search results or the max number of hits. You can write a config file that will put your custom banner at the top of the results, or will put an additional search form on the search results, so that the user can refine his search. You can construct a link to search and display the "next 20". Or you can construct a URL to do a canned search. This is all discussed in Custom Output Formats.

The documentation is written mostly as a reference, and can be a bit complicated. You might want to start with the Examples and Tutorial.

 
     
Make Your Files Findable -- Bare Minimum
 

Even if you don't create a custom query page, there are a few things to keep in mind about your HTML documents.

  1. Netscape Catalog parses your HTML documents to find which words appear in which tagged elements. It is therefore important that you use valid HTML so that Netscape Catalog can understand it. This statement is less trivial than it may seem, because many browsers are quite tolerant of invalid HTML. Just because your page looks good on one (or more) browsers does not mean your HTML is valid. Try using weblint or some other HTML validator to check.
  2. Be careful to include a descriptive <TITLE> element inside the <HEAD> of every document. The TITLE is relatively unimportant when viewing the document with a browser, but can be quite important when searching for a given document. The words within a title are apt to describe a document more succinctly and clearly than almost any other automatically identifiable phrase. Also, the title is often used as a one-line identifier in the list of documents returned from a query.
  3. Use heading tags such as <H1> or <H3> to denote the headings of a document, not just font changes and centering. And don't use heading tags solely as indicating font changes. Using heading tags to denote headings (and putting useful section titles into the headings) will help users navigate your document, and help the catalog server to categorize your document by placing the words from your headings into a special table-of-contents field.
 
     
General Scheme
 

Netscape Catalog reads and indexes a set of Web documents, most of which are HTML, but it handles other formats, too. This reading and indexing takes place automatically. (That is to say, the reading and indexing is done centrally, and hardly anyone needs to worry about the details.) All of the files are put into one large index, but appropriately worded queries can be constructed that search only a selected part of this data.

Roughly speaking, a file will be indexed or refreshed once per week; the index itself will be deleted after about 3 months if not refreshed. So once you create or change a file, it may take as long as a week for the file to show up in the index. Once you destroy a file, the indexed entry will live on for another 3 months.

Netscape Catalog holds the index in a special web server called a "Catalog Server", whose function is to build the index and to respond to queries. The search engine used by the Catalog Server is from Verity.

One queries the Catalog Server through a web form. The data entered on the web form is transmitted to a cgi script. The cgi script then connects to the Catalog Server, performs the search, reformats the output, and sends the list of hits back to the end user.

The owner of a set of web pages may want to customize this search procedure. There are several opportunities for user customization.

  1. The user can add some web pages to the overall index that might not have been included otherwise. Or the user can ask to have certain pages not included in the overall index. (The details of how to do so aren't ready yet. Stay tuned.)
  2. The user can modify his web pages in various ways to change the manner in which the page is indexed. For example, the user should put a descriptive title in the <title> element, both because pages are often searched and indexed by their titles. Also, it is possible to create other categories of indexing through use of the <meta> tag. That is, a user could label a document as having an author whose name is Bob, and the search for all documents authored by Bob.
  3. The user can prepare his own HTML form for initiating searches. This custom form can be used to restrict the search in various ways, perhaps to the subpages of a single department.
  4. The user can customize the format of the returned hits to some extent, by preparing a special configuration file, or by the use of hidden variables in the original search form.
 
 

Web Search Forms Previous: 1 Google Next: 3 What's Indexed


2008-2-12  wwwtech@uic.edu
UIC Home Page Search UIC Pages Contact UIC