Grinding Java - Searching the WWW in Java

by Shai Almog

Searching the WWW in Java
I was finally able to set Linux up to "work properly" on my system (properly is defined very loosely, as in X-Windows I have no colours and Netscape won't work). I did this because I got tired of waiting on IBM's Java promises. This however, gave me a first hand view of the issues of Java's portability, which is amazing but not as amazing as Sun would have us think.

From the next article onwards I will try to test my articles on 2-3 platforms: OS/2 (if there will be a *working* JDK 1.1 by then), Linux and Windows 95. I was able to run this article's program under OS/2 but they did not work correctly; I have no idea if this is due to the platform or due to a bug of my own. It is hard to determine this without a decent debugger.

When I first saw AltaVista I was amazed, yet I find that for more particular searches there aren't any good search engines. Was there any time when you found a company's site using a search engine but spent an hour on searching the site itself for the exact keyword you were looking for? For me this is a recurring problem, so I decided to do something about it. I wrote a simple small search engine to search a small domain for a particular phrase.

I tested my search engine by searching for the word Shai in the domain http://www.edm.com/ with 0 server hops. I found 4 URLS. The search was made on 01/06/97 and had up to 250 simultaneous threads running! The search took 20-30 minutes using a frame relay link. I tried the same search using hotbot and did not find a single URL.

Abstract
The WWWUtility gets a URL and starts its search by following the links in it. In every URL it reaches it searches for the phrase. The main problem here is the huge number of possible paths which can flood a T1 line with ease. The solution was to limit the server to the current server alone, i.e. if the URL http://www.edm2.com/ contains a link to http://www.os2ezine.com/ that link will not be followed by the engine since it is on a different server.

Design
I implemented the search using a class called SearchEngine. A SearchEngine object loads a URL (using the URLParaser class) and searches the retrieved document. For every link in the document it creates a new instance of SearchEngine which sits on a separate thread.

Each search engine gets a URL to search and a number of servers to jump. If the number is 0 the SearchEngine will not leave this server if it is greater than 0 then whenever the search engine leaves the server it will reduce the number by 1 thus creating the effect of distance from the (original) server.

Drawbacks
In testing the program it kept crashing on one file which turned out to be 1.4MB file containing a very large number of links (these can be quite common sizes for programmers who rely on the JavaDoc utility). I was actually able to crash the VM (not the system) from lack of memory! I added some workaround code to call the garbage collector and left it there for educational purposes; that code didn't do much good and my mistake was programming as if I had an unlimited amount of memory. I will probably make that mistake again but hopefully not in such harmful conditions.

I did not implement a boolean search or user interface to control the number of threads or the amount of memory used or the priority of the threads; these are all very easy to implement yet the program is already too complicated and I feel I will be wasting my time in adding features to a program which is here only as a tutorial.

Despite these drawbacks I have already used this program on a couple of occasions and believe a slightly improved version will be very useful for site developers.

Implementation
WWWUtility.java

Things to notice:
 * I did not go over pulldown menu creation in AWT. It is actually very simple and does not need much explanation. A menu has to have a frame (a window) of its own and there are 2 main classes involved in its creation: Menu and MenuBar. Each item in the Menubar is a Menu object which can be disabled enabled and removed.
 * The Runtime class is very important in understanding the VM. The Runtime class allows you to check the amount of available memory, execute a file, call garbage collection and more. We use the Runtime class for checking the amount of memory and calling the garbage collector. It turned out that this was still useless and the reason that the memory ran out was due to my bad design which did not take into consideration the possible large size of certain URLS. I have no intention of changing my design since I feel it is "interesting".
 * I make use in this program of a FileDialog which is a Java abstraction of the OS's native file open dialog. This class is very simple to use and extends Dialog so it has Dialog's look and feel yet some added functionality for opening files.
 * We use URLs for both regular and HTTP files. The URL class in Java is very dynamic and was designed for both input and output. Anyone who ever tried writing an HTTP program in C++ will be thrilled to see that in Java files and HTTP URLs are almost the same thing.
 * I use panels to group the controls in the program. Remember that in the AWT mentioning a fixed location or size is not considered good programming and thus I strategically locate the controls using the Panel as my grouping mechanism. Notice I don't keep the references to the Panels I create; I don't need them.
 * We have not discussed GridLayout yet. It is actually a very simple layout which tries to set everything on an invisible grid. The power of the grid layout comes from a feature I did not use this time called Insets. Insets are proportions for objects on the grid, thus you could define a particular button on the grid to be twice as large as another one without specifying an exact size.

There is little to say about the URLLocatedListener interface since it is self explanatory.

URLLocatedListener.java

Things to notice:
 * We use a thread group to reference all the search threads as one thread. We even use the stop method on the group in order to stop all the threads.
 * We implement an action listener interface to allow several classes to listen to the search results.
 * The method wakeupSleepingThreads is declared synchronized. This is in order to prevent other threads from entering that method. I had a problem with other threads terminating while that method was running.
 * Garbage collection is invoked manually.

URLParaser.java

Things to notice:
 * In this class we make heavy use of java.net.URL. As you see java.new makes working with URLs and the Internet practically invisible, thus classes can use streams and be unaware of the underlying protocol being used. I only showed a fraction of what java.net is capable of. java.net contains classes to handle everything from sockets to UDP datagrams to HTTP connections.
 * We store an entire file in one String. The limit of length on a string is quite high and it should not pose a problem. I did not check though into the subject of its efficiency.
 * Notice the wasteful use of memory. If this were a single threaded no recursive class this behaviour would be tolerated but in this case it is a clear design flaw.

ProgressIndicator.java

Things to notice:
 * This class embodies only the user interface thus keeping those details away from the program's logic. I consider this to be good design.
 * I use a somewhat dirty trick to force sizes of certain objects. In the AWT size and location are usually not mentioned explicitly, this is important to prevent the program from differing in its look between platforms. When I want a listbox to be able to contain a certain number of characters, I have to take the font size into account. That is not so hard to do using the font size methods and such. I used a simple trick of writing a string at a certain size and putting it in the panel with the list, the panels width was determined by the string width which is widest.

ProgressBar.java

Things to notice:
 * I perform some casts from integer to double before division. The trick of casting to double is widely used by C++ programmers but other programmers might find it odd so I will clarify by example: 8 / 9 == 0 is a correct statement if all the variables are integers. Since integers do not have fractions all fraction data is discarded so we need to cast in order for the division to work correctly. [If you do something like "double = int / int;", then you get integer division with the answer converted and stored in a double. To get the desired answer you need to be dividing things that can hold fractions (eg, doubles), which means doing the conversion before the division. In some instances the compiler will do this for you but if you have one integer divided by another integer you have to do it yourself. Hence the casting, above. EDM]

Well that's all for this time. This article was actually quite simple and is here only to illustrate how (relatively) easy it is to use Internet resources in Java using the java.net package. Next month we will explore plugins and writing your own application macro language based on Java using Java without any effort!