Reproduced with permission from the NLA Review

Effective Information Acquisition Using Internet Technology
By
Thomas C. Omer
University of Illinois at Chicago

Introduction

The Internet represents a vast storehouse of information with over 50 million sites providing information on everything from A to Z. There are two problems associated with having such a plethora of information. Those problems are; where to start looking, and once the starting point is found, how can information be acquired in the most cost beneficial (in terms of time) way.

The Internet of several years ago provided only a few sites that concentrated on information searches. Today the offering of online information services is growing daily. These sites compete for users by continually upgrading and adding to their information offerings. The sites are commonly referred to as "search engines" and represent one of the fastest growing segments of the Internet information system. Supported by advertising dollars these sites compete based on the "hits" the site can provide to a sponsor's web site.

Search engines are free to users and represent an excellent means for acquiring information for almost any research topic. This article is not a recommendation of any particular site but an overview intended to increase reader knowledge about the use of search engines and provide a place to start his or her search. Choosing a search engine is truly a matter of personal taste. The names and Internet addresses of the five search engines mentioned are:

Yahoo http://search.yahoo.com
Infoseek http://www.lycos.com
Lycos http://www.lycos.com
Excite http://www.excite.com
AOL Netfind http://netfind.aol.com

What is a Search Engine?

The term "search engine", as generally used, refers to three types of information sites. Two of these types are "true search engines" and "directories". The third site is a combination of the features of the other two. This type is referred to as a "hybrid search engine."

True Search Engine

A "true" search engine has three parts. A "robot" or "crawler" that goes to every page or representative pages on the Web and creates an index. The second part is the index. The index contains a copy of every web page visited by the "robot" or "crawler". The third part is the interface program that receives visitor requests, compares the request with entries in the index, and returns the results. Depending on the search engine the results can be returned with just the hypertext link or the link and a short information clip about the page at that link. The abbreviated result allows more links on one page, but the information clip helps identify the information source most relevant to the search.

Directories

A directory depends on humans for its listings. Web site creators submit a short description of their web page or directory editors write descriptions for web sites they have reviewed. A search of a directory only looks for matches in the descriptions submitted by creators or reviewers.

Hybrid Search Engines

Hybrids primarily use the robots or crawlers to gather information on the web but also maintain an associated directory created by submissions or reviews from editors. Because the focus of the hybrid is on the robot or crawler, submissions by web page creators may be less likely to be accepted into the directory than one submitted to a true directory. Visitors to the hybrid engines are given the option to search using the search engine or the directory.

Which One?

Whether a user is using a "true search engine", directory, or hybrid is probably not obvious. However, a few points about these "search engines" should be made. The "true search engine" is limited in its search by program parameters (which may be few in number). Therefore, a "true search engine" is likely to collect, in total, more web sites in its index than a directory. If the user is looking for really obscure information on the web a search may be more successful using a "true search engine" or a hybrid. On the other hand, "true search engines" may collect more extraneous information that requires users to develop their own information filter. Directories, because they rely on submissions and editor reviews, are already filtered at one level and are most likely to contain quality information sites. Thus, depending on the user’s search needs one site may be preferred over another. Of the "search engines" mentioned in this article Yahoo is the only true directory. Excite, Lycos, and Infoseek are hybrids and AOL Netfind is the closest to a "true search engine." The remainder of this article will ignore these differences and use the common term "search engine".

How to get there.

How do users get to these engines? Obviously one approach is to type in the address provided in this article. However, to minimize access time use a favorite web browser. Netscape and Microsoft’s Internet Explorer provide access to multiple search engines. If you click once on the "Net Search" button on the Netscape toolbar or on the "Search" button on the Internet Explorer toolbar you will find the search engines mentioned here along with over thirty indexes covering such things as business pages, phonebooks, topic indexes, etc.

How They Work

Most of the major search engines attempt to do something close to indexing the entire content of the World Wide Web. Once a site's pages have been indexed, the search engine will return periodically to the site to update the index. Some search engines give special weighting to: words in the title, in subject descriptions, listed in HTML META tags, to the first words on a page, and to the frequent recurrence (up to a limit) of a word on a page.

Because each search engine uses a somewhat different indexing and retrieval scheme users may find themselves gravitating to one engine repeatedly. Most users find themselves favoring one search engine over another because the indexing and retrieval scheme seems consistent with how they categorize or index information. A good strategy is to begin the search at a favorite search engine and use the other available engines when information is not acquired using the primary search engine. Remember a slightly different indexing and retrieval scheme may be the key to a successful information search.

How to Search

Searching for information is an individualistic exercise. Thus, search routines suggested by colleagues may not be consistent with how you like to proceed. However, there are some general rules regarding searches. While these rules are quite general they represent an approach that can be adapted to your own style.

Begin a search using ideas and concepts (a selection sometimes provided as an alternative to keyword searches) instead of just keywords. Always use more than one word in a search. Some search engines (Excite for example) attempt to find relationships that exist between words and ideas, so the search results will contain words related to the concept being searched.

Use descriptive, specific words as opposed to general ones. For example, a search for "depreciation recapture" will return much more specific results than a search for "depreciation." Of course "Section 1250 depreciation recapture" is even more specific. If the search is for particular asset recapture rules a search might include "Section 1250 depreciation recapture" and "Residential."

Use the filtering features provided at the search engine being used. Most search engines provide some form of filtering pages returned from a search. For example, Excite provides the option to power search, which guides users through restricting a search using several methods. Some engines sort the search results by site. This presents several pages from the same site thus minimizing search time when one site has multiple documents relevant to the search. Other search engines present search results sorted by the relevance of the information on the page (this depends on the search engines relevance rules). Thus, if the search is for "tax law" the search engine's first result might be the page with the largest occurrence of the words tax and law on the same page. In addition to the relevance ordering many search engines provide additional means for narrowing a search. These take the form of additional links related to the search terms. Infoseek uses a related topics index, Excite uses links titled "More Like This Link", Lycos provides a section for additional information related to a topic.

These general search hints should reduce the number of pages returned when searching for information. However, the general search routines described above are likely to produce more pages than users might be willing or have time to deal with. How can the number of pages returned be reduced even more? Many users rely on the special search operators discussed below.

Character Operators

To reduce the number of pages returned from a search, special characters that tell the search engine to include or exclude words, or search for specific phrases can be used. The common characters available at all search engines are discussed first with characters specific to the sites mentioned above discussed next. When using these characters pay care attention to spaces. Some operators do not allow for spaces between the character and the search term and others required that the search term and the operator be separated by at least one space.

Common Character Operators

" " Quotation Marks: If searching for a phrase where word order is important, enclose the phrase in quotes. For example a search on the terms management and discussion returns all pages with any or all of the words, in any order somewhere on the page. But a search on "management discussion" returns pages with the exact phrase.

Plus (+): If a plus sign is placed directly in front of a word, all the documents retrieved will contain that word. Thus, a search for +tax+law, will retrieve pages with the words tax and law. On the Excite search engine that search term configuration returns over 12,760 pages. Using the phrase "tax law" on the Excite search engine law returns 5000 pages. Combining the phrase operator and the plus sign to narrow the search for tax laws can further reduces the pages returned. For example, a search using the combination +"tax law"+state returns only 600 pages on the Excite search engine. Finally, the combination +"tax law"+state+Minnesota returns only 314 pages.

Minus (-): If a minus sign is placed in front of a word, the search engine will NOT retrieve documents containing that word. For example, a search for "tax law"+state-federal will return only 524 pages instead of the 600 pages returned when "tax law"+state was used.

Engine Specific Characters

Pipe (|): This character tells the Infoseek search engine that a search should move from the general category to a specific topic. Continuing with the tax law example, "tax law"|state produces 836 pages. Thus, the pipe allows a category to be specified "tax law" and a specific search to be conducted "state" within that category.

WildCard ($ or *) These operators are used by Lycos ($) and Yahoo (*) to allow a search for word variations. For example, Liab$ or Liab* will search for occurrences of the basic word Liability. If unsure about spelling users can search based on known letters. For example, assume the correct spelling of Deloitte Touche is unknown. A search on Lycos can be conducted using the following:

Del$ AND To$.

Document Properties (t:, u:, link:, url:, title:, site:) These operators are used by Yahoo (t:, u:) and Infoseek (link:, url:, title:, site:) to search document properties. Searches using these characters will only look for the search term in the specified document property. Thus, a search using "dell" will return all pages with the term "dell" on the page. A search using u:dell will return pages with URLs containing "dell" in the address (e.g. http://www.dell.com).

BEFORE, FAR, NEAR, and ADJ These operators are used by the Lycos search engine to help define relationships between search terms.

BEFORE: Documents found must contain the search terms in the order specified. BEFORE similar to the use of AND in a Boolean search (discussed below). The difference being that in a BEFORE search the words must occur in the order specified. For example:

Interest BEFORE Basis

However, there is no limit to the number of words that might occur between the search terms.

FAR: Documents found must contain the search terms more than 25 words apart. For example:

Partnership FAR contribution

 

FAR ignores search terms closer than 25 words on the same page. Thus, searches using FAR should be in conjunction with other search operators.

NEAR: Documents found must contain the search terms within 25 words of each other. NEAR is the opposite of FAR. For example:

Partnership NEAR basis

ADJ: Documents found must contain the search terms right next to each other. The search will return the occurrence of the search terms next to each other regardless of order. For example:

Property ADJ Dividend

"O" and "/": These characters can be combined with NEAR, ADJ, and FAR to increase the precision of the search. "O" imposes order on searches conducted with NEAR, ADJ, and FAR. "/" allows the user to specify the number of words that the NEAR, FAR, and ADJ operators use to find occurrences of the search terms. For example:

Corporate OADJ/2 Distribution

This will return pages with corporate and distribution in that order separated by at most one word. The search indicated above will produce pages with phrases like "corporate property distribution" or "corporate cash distribution.

Boolean Operators

Boolean operators tell the search engine to search for documents that contain exactly the words entered. Boolean operators include AND, AND NOT, OR, and parentheses. These operators must appear in ALL CAPS with a space on either side.

AND: Documents found must contain all words joined by the AND operator. For example, to find documents that contain the words "FASB," "pensions," and "costs," enter:

FASB AND Pension AND costs

OR: Documents found must contain at least one of the words joined by OR. For example, to find documents that contain the phrase "balance sheet" or the word "asset," enter:

"Balance sheet" OR asset

AND NOT: Documents found cannot contain the word that follows the term "AND NOT." For example, to find documents that contain information about financial derivatives you might enter:

Derivative AND NOT calculus

( ): Parentheses are used to group portions of Boolean queries together for more complicated queries. For example, to find documents with information about the risk of or losses associated with financial derivatives you might enter:

"Financial Derivative" AND (risk OR loss)

Character operators, in many cases, can be used instead of the Boolean operator and may be preferred because Boolean operators are concerned with only the search terms entered rather than any relationships that might exist between the words. The best search strategy involves using combinations of operators to retrieve web pages containing the desired information. Some search engines allow character operators and Boolean operators to be used in the same search phrase others do not. If you are not sure, try a combination. The search engine will indicate whether the combination is possible.

Evaluation of Information Sources

Information accessed on the Internet suffers, at times, from credibility. Because no oversight committee determines the content of Web pages, researchers using the Internet must develop criteria for evaluating the information sources available through this medium. Whatever criteria you develop consider the following about information source.

Information Source Credibility

All information sources on the Web are not created equal. Consequently, you should develop a reliability index associated with the sources providing information. High on the list of reliable information sources should be government sites (U.S. Treasury), academic institutions (University of Chicago), respected private organizations (Tax Analysts), and some reputable commercial sites (Microsoft).

In addition, users should consider whether the information provided is available in other formats. For example, congressional bills are available through other sources (i.e., printed documents). Thus, high reliability should be associated with the Internet site "Thomas" (http://thomas.loc.gov) which is a government source with information also available in another medium.

Also consider whether proper references to underlying sources the presented material relies upon are provided. In other words, if the information makes a factual (versus opinion) statement that relies on other documentation, you should be able to obtain the other document source to verify the statement.

Summary

The purpose of this article was to provide an overview of search engines and hints about using the technology to gather information. Using the features associated with the search engines can dramatically reduce the time and effort necessary to retrieve information using the Internet. While the search process is individualistic and determined by the question being addressed, this overview should provide the basic building blocks for effective information retrieval. See you on the Web.