ACCC Home Page ACADEMIC COMPUTING and COMMUNICATIONS CENTER
Accounts / Passwords Email Labs / Classrooms Telecom Network Security Software Computing and Network Services Education / Teaching Getting Help
 
Web Searching / Indexing
0 Contents 1 Google 2 Intro 3 What's Indexed 4 Fields & Queries
5 Forms 6 Output 7 Examples A1 Related Links  

Which Files Get Indexed?

   
 
     
How the Robot Finds Files
 

There are several avenues the catalog server uses in finding files to index:

  • Netscape Catalog starts with the UIC home page, http://www.uic.edu. It finds all the links in that page, downloads the associated files, and indexes them. Then it descends the Web hierarchy recursively. But in this pass, it indexes files only from tigger or icarus. At the time of writing, this happens twice per week. The timing or frequency may change, of course.
  • Another pass selects non-ADN servers on the UIC campus that are referenced from the UIC home page, such as the Math and EECS home pages. These pages and their links are descended to a depth of four. This happens once/week.
  • Yet another pass selects non-campus servers that are referenced from the UIC page. These pages are indexed, but their links are not traversed. Once/week.
  • A list of personal home pages is compiled from the ph database. (If you want your personal home page listed in ph, log onto tigger or icarus and use the phupdate command.) Links are followed only when they involve personal pages.
 
     
Robots - Keep Off!
 

What if you don't want your files indexed? By convention, indexing robots normally download a file called /robots.txt when first contacting a server. This file tells the robot which files and directories to stay away from; Netscape Catalog server is configured at UIC to respect this convention. So if you have files on tigger or icarus that you don't want indexed, all you have to do is get those directories listed in the system-wide robots.txt. (Obviously if are files are on another server, see your sysadmin.)

How do you get listed in robots.txt ? Just create your own file named robots.txt in your own webspace. A background task will find all such files and re-write the system-wide file, once per day.

Here are details on the robots.txt conventions, but the file can be pretty simple. Just list the directory subtrees, one to a line, that you want the robot to avoid. For example, I might put the following in /homes/home1/bobg/public_html/robots.txt :

disallow: /~bobg/restricted
disallow: /~bobg/good_stuff/dontlookhere
Then any robot should avoid asking for URLs that begin with these strings. Notes:
  • Start each rule with the string disallow:
  • Start each URL with a slash ( / ) and include the full path of the url you want to block.
  • You can only use URLs that refer to files lower in the filesystem. I could not, in the above file, put in disallow: /depts because that does not refer to anything in my personal home pages.
  • If you want to check the results, look at the system-wide tigger robots.txt or icarus robots.txt. Remember, it takes a day to register your entries.

Example

Assume you have a tigger directory, /homes/home1/bobg/public_html. Under there, you have some further directories that you want to block. The urls you want to block are:

 http://www.uic.edu/~bobg/block1
 http://www.uic.edu/~bobg/ok/block2
Then prepare a robots.txt file like this, and place it in the public_html directory:
disallow: /~bobg/block1
disallow: /~bobg/ok/block1

NOTE: You must include the full prefix of the path part of the url (in this case, /~bobg/). The disallow line will be added to the master robots.txt file without editing, if it seems relevant to the directory it is in.

 
 

Web Search Forms Previous: 2 Intro Next: 4 Fields & Queries


2005-6-18  wwwtech@uic.edu
UIC Home Page Search UIC Pages Contact UIC