Searching the WWW in Java
I was finally able to set Linux up to "work properly" on my system
(properly is defined very loosely, as in X-Windows I have no colors and
Netscape won't work). I did this because I got tired of waiting on IBM's
Java promises. This however, gave me a first hand view of the issues of
Java's portability, which is amazing but not as amazing as Sun would have
From the next article onwards I will try to test my articles on 2-3
platforms: OS/2 (if there will be a *working* JDK 1.1 by then), Linux and
Windows 95. I was able to run this article's program under OS/2 but they
did not work correctly; I have no idea if this is due to the platform or
due to a bug of my own. It is hard to determine this without a decent
* * * *
When I first saw AltaVista I was amazed, yet I find that for more
particular searches there aren't any good search engines. Was there any
time when you found a company's site using a search engine but spent an
hour on searching the site itself for the exact keyword you were looking
for? For me this is a recurring problem, so I decided to do something
about it. I wrote a simple small search engine to search a small domain
for a particular phrase.
I tested my search engine by searching for the word Shai in the domain
http://www.edm.com/ with 0 server hops. I found 4 URLS. The search was
made on 01/06/97 and had up to 250 simultaneous threads running! The
search took 20-30 minutes using a frame relay link. I tried the same
search using hotbot and did not find a single URL.
The WWWUtility gets a URL and starts its search by following the links in
it. In every URL it reaches it searches for the phrase. The main problem
here is the huge number of possible paths which can flood a T1 line with
ease. The solution was to limit the server to the current server alone,
i.e. if the URL http://www.edm2.com/ contains a link to
http://www.os2ezine.com/ that link will not be followed by the engine
since it is on a different server.
I implemented the search using a class called SearchEngine. A SearchEngine
object loads a URL (using the URLParaser class) and searches the retrieved
document. For every link in the document it creates a new instance of
SearchEngine which sits on a separate thread.
Each search engine gets a URL to search and a number of servers to
jump. If the number is 0 the SearchEngine will not leave this server if
it is greater than 0 then whenever the search engine leaves the server it
will reduce the number by 1 thus creating the effect of distance from the
In testing the program it kept crashing on one file which turned out to be
1.4MB file containing a very large number of links (these can be quite
common sizes for programmers who rely on the JavaDoc utility). I was
actually able to crash the VM (not the system) from lack of memory! I
added some workaround code to call the garbage collector and left it there
for educational purposes; that code didn't do much good and my mistake was
programming as if I had an unlimited amount of memory. I will probably
make that mistake again but hopefully not in such harmful conditions.
I did not implement a boolean search or user interface to control the
number of threads or the amount of memory used or the priority of the
threads; these are all very easy to implement yet the program is already
too complicated and I feel I will be wasting my time in adding features to
a program which is here only as a tutorial.
Despite these drawbacks I have already used this program on a couple of
occasions and believe a slightly improved version will be very useful for
Things to notice:
There is little to say about the URLLocatedListener interface since it
is self explanatory.
- I did not go over pulldown menu creation in AWT. It is actually
very simple and does not need much explanation. A menu has to have
a frame (a window) of its own and there are 2 main classes involved
in its creation: Menu and MenuBar. Each item in the Menubar is a
Menu object which can be disabled enabled and removed.
- The Runtime class is very important in understanding the VM. The
Runtime class allows you to check the amount of available memory,
execute a file, call garbage collection and more. We use the
Runtime class for checking the amount of memory and calling the
garbage collector. It turned out that this was still useless and the
reason that the memory ran out was due to my bad design which did
not take into consideration the possible large size of certain URLS.
I have no intention of changing my design since I feel it is
- I make use in this program of a FileDialog which is a Java
abstraction of the OS's native file open dialog. This class is very
simple to use and extends Dialog so it has Dialog's look and feel yet
some added functionality for opening files.
- We use URLs for both regular and HTTP files. The URL class in Java
is very dynamic and was designed for both input and output. Anyone
who ever tried writing an HTTP program in C++ will be thrilled to
see that in Java files and HTTP URLs are almost the same thing.
- I use panels to group the controls in the program. Remember that in
the AWT mentioning a fixed location or size is not considered good
programming and thus I strategically locate the controls using the
Panel as my grouping mechanism. Notice I don't keep the references
to the Panels I create; I don't need them.
- We have not discussed GridLayout yet. It is actually a very simple
layout which tries to set everything on an invisible grid. The
power of the grid layout comes from a feature I did not use this
time called Insets. Insets are proportions for objects on the
grid, thus you could define a particular button on the grid to be
twice as large as another one without specifying an exact size.
Things to notice:
- We use a thread group to refrence all the search threads as one
thread. We even use the stop method on the group in order to stop all
- We implement an action listener interface to allow several classes
to listen to the search results.
- The method wakeupSleepingThreads is declared synchronized. This is in
order to prevent other threads from entering that method. I had a
problem with other threads terminating while that method was running.
- Garbage collection is invoked manually.
Things to notice:
- In this class we make heavy use of java.net.URL. As you see
java.new makes working with URLs and the Internet practically
invisible, thus classes can use streams and be unaware of the
underlying protocol being used. I only showed a fraction of what
java.net is capable of. java.net contains classes to handle
everything from sockets to UDP datagrams to HTTP conections.
- We store an entire file in one String. The limit of length on a
string is quite high and it should not pose a problem. I did not
check though into the subject of its efficiency.
- Notice the wasteful use of memory. If this were a single threaded
no recursive class this behaviour would be tolerated but in this
case it is a clear design flaw.
Things to notice:
- This class embodies only the user interface thus keeping those details
away from the program's logic. I consider this to be good design.
- I use a somewhat dirty trick to force sizes of certain objects. In
the AWT size and location are usually not mentioned explicitly,
this is important to prevent the program from differing in its look
between platforms. When I want a listbox to be able to contain a
certain number of characters, I have to take the font size into account.
That is not so hard to do using the font size methods and such. I
used a simple trick of writing a string at a certain size and
putting it in the panel with the list, the panels width was
determined by the string width which is widest.
Things to notice:
Well that's all for this time. This article was actually quite simple and
is here only to illustrate how (relatively) easy it is to use Internet
resources in Java using the java.net package. Next month we will explore
plugins and writing your own application macro language based on Java
using Java without any effort!
- I perform some casts from integer to double before division.
The trick of casting to double is widely used by C++ programmers but
other programmers might find it odd so I will clarify by example:
8 / 9 == 0 is a correct statement if all the variables are integers.
Since integers do not have fractions all fraction data is discarded
so we need to cast in order for the division to work correctly.
[If you do something like "double = int / int;", then you get integer
division with the answer converted and stored in a double. To get
the desired answer you need to be dividing things that can hold
fractions (eg, doubles), which means doing the conversion before
the division. In some instances the compiler will do this for you
but if you have one integer divided by another integer you have
to do it yourself. Hence the casting, above. EDM]