Finding things on the web

A Page on the Web, published in the Solicitors Journal, June 1999.

I suspect that few people other than information professionals and other web technofiles like using search engines. What users want is to find things on the web quickly and easily. They are not interested in how the data is gathered, indexed or delivered, simply in the results themselves.

Search mechanisms are ubiquitous on the web. The term ‘search engine’ is most commonly understood to refer to the global search mechanisms which seek to index the whole of the web, but it should also be understood to encompass facilities with more modest aspirations: similar technologies are used to index information stored on local servers or groups of servers.

Search mechanisms are also employed for application-specific purposes, ie to search not the web pages on a server, but particular structured databases held on the server – everything from train timetables and product catalogues to case reports and other document databases. These work quite differently and are not the subject of this article.

Global searches

Back in early 1995 it was estimated that there were 7 million pages on the web. Frankly this seemed to me, even at the time, to be a serious underestimate. Nevertheless a database of that order of magnitude was a formidable challenge for the search engines which grew up to service it. Four years on the web consists of probably more than a billion pages of information and an elite handful of services, including such names as Alta Vista, Excite, Infoseek, Lycos and Yahoo, have established themselves as robust and efficient solutions to searching the whole of the web (though none indexes every page).

Which search engine to use comes down in the end to personal preference. My own favourite is Alta Vista at altavista.digital.com, which consistently has the largest number of pages indexed, and the detail below is based on its functions. Other major search engines will provide similar functions.

Simple searches

I’m told 80 per cent of people who own a VCR use it only to play back pre-recorded video tapes. Similarly, most people who use these sophisticated search tools use them in a very unsophisticated way, simply typing in a few words and hoping for the best. If you’re one of this 80 per cent, let me suggest four simple points:

  1. Think before you type. Understand that you are searching a database which can contain literally anything: where ‘case’ does not mean just court proceedings, but 12 bottles of wine, travel goods, patient or client records etc. Understand also that you are searching web ‘pages’ which may be anything from a few to many hundreds of lines long.
  2. Get the basics right! Typing in a series of words will match any one or more of those words. Typing a phrase within “double quotes” will match that exact phrase. Generally use lower case only; use Initial Capitals only where you wish to restrict your search to instances of capitals.
  3. Interpret the results. Don’t just click on the first titles you see. Assess the list for relevance according to the information provided: usually title, date and URL, and maybe summary as well. Refine or revise your search rather than plowing through a long list of hits.
  4. Print out the Help pages. You may never read them otherwise.

Advanced searches

Once you are comfortable with the basics, why not read those Help pages you printed out? You’re now into the world of advanced searches where formidable terms, like ‘Boolean operators’, once the preserve of information professionals, crop up.

In truth an effective advanced search requires no more expertise than – to go back to the VCR analogy – being able to select a channel and press ‘Record’. You can move on to advanced advanced searches later if you’re up to it – program your VCR; understand what those numbers are in the TV listings!

Boolean operators or simply ‘operators’ are essentially the words AND, OR, NOT and NEAR – or their syntactical equivalent – used to connect search terms, thus refining the results. It should be obvious what each does on its own, but always bear in mind that the record being searched is a whole web page, so AND might give you more hits than you would expect.

In my own experience the operator NEAR has proved the most useful – broadening a search term (by finding terms within 10 words of each other) where I cannot afford to be too precise, but much more precise and useful than AND.

Operators and search terms can be combined in as many ways as you can imagine, and wildcards can also be used with word stems, but here you do need to read those Help pages again and develop your expertise further.

Searching fields and properties

HTML – the markup language for pages on the web – was originally designed to convey some structural information about the content of pages (eg there are tags for address blocks, defined terms and definition of those terms). However, it essentially now used as a simple formatting language. There are thus no specific content-related fields which can be searched: you are searching the entire contents of each page.

However, there do exist for each page on the web, certain references and properties outside of its content which may be useful for search purposes and there is a particular syntax to enable you to do so. These are:

  • the date of the page
  • the title of the page (which appears in the title bar of most browsers)
  • the URL of the page
  • the domain of the site (ie .com, .org, .uk etc)
  • the host computer for the site (eg www.bigco.com)
  • the text of hyperlinks
  • Java applets (programs)
  • the filenames of images
  • URLs of links on the page

Results

The results of your search will usually be displayed as a list of matches (hits) comprising:

  • the title of the page, with a hypertext link to the page
  • the URL
  • the date of the page
  • perhaps a brief extract or summary

Usually hits are listed in order of relevance, as determined by the frequency of occurrence of the search terms you have used and the exactness with which the combinations of terms have been matched.

Clearly, rather than working through the links in the list one by one, you need to review the list and decide whether to follow up some of the links offered or whether it is necessary to refine your current search.

It is worth pointing out here to the uninitiated that the title of a page is not the 72 point Arial bold title at the top of the page announcing ‘The Widget Co’, but the wording which appears in the title bar at the top of your browser’s window. Hopefully, this should also announce ‘The Widget Co’, or if you are viewing a sub-page, perhaps ‘The Widget Co: 1999 price list’. But it is surprising how many web pages omit a title altogether (in which case the URL is displayed) or comprise a title which out of context is meaningless (eg ‘Products’, ‘Introduction’ or – the worst of them all – ‘Home Page’ or ‘Welcome’).

If the title (and summary) are not of much help – and they often won’t be – the URL will usually identify or give a clue to the name of the author or organisation as well as the country of publication and other details. For example www.widgetco.co.uk/products/pricelist.htm is clearly the price list for a UK company called Widget Co or similar; ourworld.compuserve.com/homepages/kthompson may well be the home page of your friend Keith Thompson and so on.

The date of a web page is of course useful if you are seeking out any sort of information which is time critical or where the date of publication has a particular significance. It is also a useful way of identifying pages which may be redundant or dormant (and there are an awful lot lying around). Dated pages will also point out those sites which are not updated as frequently as you would expect and hence may be of less value than would otherwise appear.

Local searches

As mentioned in the introduction, similar technologies are commonly employed to index and enable you to search local websites or groups of sites. Clearly if you are after information originating from a specific source, then the quickest route will be to visit that site and use the search engine provided there. For example the CCTA Government Information Service at www.open.gov.uk provides a search facility across all Government servers. Searching individual websites is, however, often unrewarding unless they do contain large quantities of hard information. If the structure of the site is any good, it will usually be quicker to find the appropriate page by browsing.

Searching for things legal

Although there are search sites offering specifically legally-oriented searches, such as Findlaw, these are so heavily US-oriented that they are not much use to UK lawyers. There are no UK-based equivalents, so you are probably better off using a standard search engine.

When using search engines to find legal materials on the web, bear in mind that you will only find what is publicly available. In the first place, don’t expect to find the text of all current statutes or SIs on the web: they are just not there. At present HMSO publishes Acts from 1996 and SIs from 1997. Texts of some other Acts and SIs may well be available on other sites, but these will be non-official versions.

Similarly, there are few cases published publicly on the web. The Court Service has published some recent cases of note and this service may well expand substantially in future.

You will also not find on the web materials which may nevertheless be deliverable over it. Many text databases are stored off-line and the data is only ‘served up’ onto the web when a specific request is keyed in on the web search page by the user. Because the data itself is off- line, it cannot be indexed by the global search engines. A good example is Smith Bernal’s Casesbase database of Appeal cases. There is free access to this database, but only via the Casebase search page.

Finally you will not find with a search engine web pages which are held in protected areas of a publisher’s web server, for the same reason that the search engines don’t have access.

Tell me more!

Several sites enable you to access all the major search engines from one page. One such is search.com.

For the real enthusiast or aspiring search engine junkie there is a huge amount of information and statistics about search engines on the Search Engine Watch site.

Official sites where legal and government materials are published are on the infolaw: Lawfind page [now removed].