By
Jack Dean, Secure Data Systems
{jack at securedatasystems.net}
In this paper, I will present an alternative to the large-scale hypertextual web search engine as the de-facto method for searching vast web based hypertext collections. This method is present with a prototype, ZHEH. ZHEH is designed to crawl and index specific portions of the Web and effectively produce satisfying search results, often more satisfying than existing large-scale systems, at a fraction of the cost. This prototype, with its hyperlink database is available at http://www.zheh.com.
To engineer a search engine is a challenging task of course. Search engines index billions of web pages. They answer hundreds of millions of queries every day. Despite the continuing importance of large-scale search engines on the web, I believe that ZHEH effectively demonstrates the advantages of the user driven approach to search. With less than a half-million indexed web pages, ZHEH is able to deliver search results that are sufficient for a majority of search situations and often delivers more satisfying results than other major search engines in a normal search situation.
This paper provides an overview our my web search engine -- the first such public description I know of to date. A detailed description will develop over time.
Keywords: World Wide Web, Search Engines, Information Retrieval, ZHEH, User driven search results.
The web continues to create new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. People are likely to surf the web using large scale search engines such as Google or AltaVista or other high quality human maintained indices such as DMOZ or Yahoo! or with other popular search engines. While human maintained lists cover popular topics effectively they continue to be subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics.
Automated search engines that rely on keyword matching usually return too many low quality matches. To make matters worse, some advertisers continue to attempt to gain people's attention by co-mingling their products with standard search results. The result is a profitable, but somewhat misleading collection of search results.
Arguably, the largest and most successful large-scale search engine is Google. With 8,058,044,651 web pages indexed as of this writing, it holds a commanding lead in the search world. But think about it… At this same time there are 48,138,837 domains registered in the US, not including many more websites in the remainder of the world. Dividing these two numbers, it can be found that, on average, each domain is responsible for about 167 webpages. This is likely not a normal distribution. Most websites probably have far fewer pages than this with a relatively few domains having several thousand pages.
Domain Counts |
|||||||||||||||
|
Daily Changes (last 24hrs) |
|
|||||||||||||
Active |
|
Deleted |
|
On-Hold |
|
New |
|
Deleted |
|
Transfered |
|
TLD |
|
|
|
34,020,075 |
|
17,820,002 |
|
346,746 |
|
58,982 |
|
18,298 |
|
37,282 |
|
.COM |
|
|
|
5,397,479 |
|
3,447,658 |
|
63,376 |
|
8,337 |
|
3,260 |
|
5,182 |
|
.NET |
|
|
|
3,375,280 |
|
1,004,324 |
|
1,274 |
|
2,211 |
|
761 |
|
1,898 |
|
.INFO |
|
|
|
3,351,223 |
|
2,098,538 |
|
33,401 |
|
5,346 |
|
2,038 |
|
2,611 |
|
.ORG |
|
|
|
1,092,437 |
|
441,615 |
|
1,021 |
|
1,656 |
|
786 |
|
1,284 |
|
.BIZ |
|
|
|
902,343 |
|
520,135 |
|
490 |
|
883 |
|
525 |
|
707 |
|
.US |
|
|
|
|
|
||||||||||||||
48,138,837 |
|
25,185,057 |
|
446,308 |
|
77,415 |
|
25,668 |
|
48,964 |
|
Total |
|
|
|
Last Updated 2/4/2005 |
|||||||||||||||
To highlight the point, searching Google for “Bill Clinton” will yield a whopping 14,300,000 web pages with something to say about the former President with the coveted top 10 results being:
Biography of William J. Clinton
... William J. Clinton. During the administration of William Jefferson Clinton, the
US enjoyed more peace and economic well being than at any time in its history. ...
www.whitehouse.gov/history/presidents/bc42.html - 35k - Cached - Similar pagesThe "Unofficial" BillClinton
... William Jefferson Clinton Memorial Library - A Rush Limbaugh featured site Bill
Clinton Books - Scandals and Impeachment [dropbears.com] [What do you think are ...
www.zpub.com/un/un-bc.html - 47k - Cached - Similar pagesBillClinton - Wikipedia, the free encyclopedia
...BillClinton was the first United States President born after ... He was named after
his father, William Jefferson Blythe Jr ... Billy, as he was called, was raised by ...
en.wikipedia.org/wiki/Bill_Clinton - 101k - Cached - Similar pagesPresident BillClinton - The Dark Side
... hay out of his brother's personal cocaine habit, or Bill's admitted pot ... reporter
who targeted George Bush much more than Clinton) and William Colby (CIA ...
www.realchange.org/clinton.htm - 19k - Cached - Similar pagesBillClinton's Morality
BillClinton's Morality. by Bob Wilson. ...BillClinton is a master politician.
To me, a moral person is someone who is honest and consistent. ...
www.spectacle.org/1096/wilson.html - 7k - Cached - Similar pagesRepublicans for BillClinton
... William Krystol and Bill Bennet are saying Bob Dole is ... lose and are concerned that
Bob Dole will suck the ... of New York is about to endorse Clinton for President ...
www.perkel.com/politics/clinton/repub.htm - 39k - Cached - Similar pagesAmazon.com: Books: My Life
...Clinton, born to humble Arkansas roots, never knew his father. William Jefferson
Blythe was killed... read more Book Description President BillClinton's My ...
www.amazon.com/exec/obidos/ tg/detail/-/0375414576?v=glance - 101k - Feb 4, 2005 - Cached - Similar pagesC. J. BURKE'S BILLCLINTON JOKE-OF-THE-DAY PAGE
... THE BILLCLINTON JOKE-OF-THE-DAY PAGE. established August 1, 1996. "I didn't do
it. ... THE BILLCLINTON JOKE-OF-THE-DAY ARCHIVES. 2000 Archives, The Year In Review. ...
www.io.com/~cjburke/clinton.html - 29k - Cached - Similar pagesPodium Videos - Monday - 2004 Democratic National Convention ...
... Monday's Videos President BillClinton...
www.dems2004.org/site/apps/nl/content3. asp?c=luI2LaPYG&b=125919&ct=158734 - 37k - Cached - Similar pagesAmerican Presidents: Life Portraits
... Article 2 - Yea 50; Nay 50 More... BillClinton (August 19, 1946 - ). Life
Facts. Personal: • First Lady: Hillary Rodham Clinton...
www.americanpresidents.org/presidents/ president.asp?PresidentNumber=41 - 13k - Cached - Similar pages
Executing the same search on ZHEH will yield the following results:
BeachBum's Clinton Scandal Page
More than you wanted to know about the First Felons, Bill Clinton and Hillary Clinton - Translate
URL: users.aol.com/beachbt/index.html · Score: 18.1 · Links: 12 - Verified: December 20, 2004The Atlantic | Feb 2001 | Bill Clinton and His Consequences | Beatty
An alternative history of the Clinton administration - Translate
URL: www.theatlantic.com/issues/2001/02/beatty.htm · Score: 17.1 · Links: 3 - Verified: February 27, 2004Bill Clinton's autobiography - Clinton's writing assistants - Rob...
Well, almost. How three very dogged collaborators helped Clinton write his book. - Translate
URL: newyorkmetro.com/nymetro/news/people/columns/intelligencer/9275/ · Score: 14.2 · Links: 3 - Verified: December 21, 2004World of Fun
Jede Menge an Funstuff, wie Comics, Videos, Spiele, Cartoons, Desktopbegleiter. Hiert wirst du es sicher nicht bereuen vorbeigeschaut zu haben, denn hier wirst du viele nette Sachen finden - Translate
URL: www.theworld.ch/offun/ · Score: 12.9 · Links: 0 - Verified: April 05, 2004Rivsys - The Ultimate News Site
Get all your news in one place. No other site has more news from more sources. - Translate
URL: www.rivsys.com/ · Score: 12.9 · Links: 0 - Verified: March 05, 2004The Clintongate Administration
A look at the scandals of the Clinton presidency and what they involve. - Translate
URL: members.tripod.com/~GOPcapitalist/clinton-scandals.html · Score: 12.3 · Links: 0 - Verified: July 14, 2004Political Satire, Jokes and Humor, Political Gifts, Political Ran...
Funny Political Satire Joke Site Skewers Republicans and Democrats, John Kerry, John Kerry Bill, John Kerry Jokes, Arnold Schwarzenegger, John Edwards, John Edwards Jokes, Political Novelties, Political Playing Card... - Translate
URL: www.slick.com/ · Score: 12.2 · Links: 63 - Verified: August 21, 2004President Bill Clinton's Latest Crimes Top News
Latest News Stories About Clinton’s Crimes and Scandals. Includeslinks to news articles and sites. - Translate
URL: www.geocities.com/CapitolHill/Senate/5773/c.html · Score: 11.3 · Links: 0 - Verified: February 16, 2004Julie Hiatt Steele Legal Case News - Prosecutorial Misconduct
The day the Senate began its trial of President Clinton, Kenneth Starr's grand jury indicted Julie Hiatt Steele. She is a remote, figure in the Starr campaign against the President, and a single mother without resou... - Translate
URL: www.juliehiattsteele.com/Motions/ProsecutorialMisconduct.htm · Score: 11.2 · Links: 0 - Verified: January 19, 2005Hey Buddy Headquarters
Home of the Clinton parody, 'Hey Buddy, Hey Socks! Letters from Snotty-Nosed Kids to the White House Pets' - Translate
URL: www.heybuddy.com/ · Score: 11.0 · Links: 0 - Verified: February 16, 2004The results are different to be sure, there is no link to Amazon in the ZHEH results, (but there could be).
lts 10 - 20 of 2,039
What we have is a very similar collection of commentary, opinion, history and comedy. The same as is found in the Google results. Others may take exception with my observations but the simple truth is out of Googles results:
Results 1 - 10 of about 9,270,000
The average user will only page through about 20-30 links before they will find what they are looking for and leave the engine. With ZHEH’s results, there are more than enough quality links to satisfy the average search appetite.
Results 1 - 10 of about 2,039
More than enough quality results to satisfy the average web surfer. How can ZHEH return similarly valid results with such a limited index? The answer lies in user driven search. (Note: Although Google indicates that over 9 million web pages match your query, if you continually search through the Google results, selecting all the pages available to you, you'll find that only about the top 1000 web pages will be returned, even if you persistantly click on the "Next" link on the bottom of the page. You'll eventually get to the page where the 'next' option will no longer be available.)
Therefore, 9 million web pages is not an accurate number. If you can only reach the top 1000 results, then the advantage of the large-scale search engine is in the esoteric search alone. Generally then, user-driven search will yield higher quality results for the normal search situation.
A new industry has grown up around getting your website listed high enough in the search results of the popular search engines so that a large number of web users will click on your website and provide you with 'traffic', presumably to promote your product, idea, or project. The results provided by all of the major search engines have become filled with paid listings, offering the advertisers a precise, quantifiable vehicle for gaining new customers. Many search sites freely mix paid results with non-paid results, more or less identifying the paid results for users to choose if they like. While this author sees nothing wrong with deriving an income by providing a service on the web, it does seem that the current search results are becoming increasingly cluttered with an increasing number of paid for placement and optimized websites, at the expense of valuable, but less well-funded sites.
DMOZ, the Open-Directory Project is at this point in time, the leader in catalogs used for hypertext search. Using over 66,000 human editors, DMOZ has been able to index over 4 million webpages. DMOZ indicates that they have been catagorizing web pages using human editors since 1998. In 8 years, at an average of 60 web pages per editor, DMOZ has accumulated a good sized index of quality sites. In a similar amount of time ZHEH at it's current growth rate can catalog nearly 30 million web pages. These will not be human-edited in the same sense as the DMOZ results, but indexed as direct result of being of value to an internet user. Valuable enough that the user bookmarked the site with the intention of returning again. Clearly automated user-driven search results using ZHEH's method provides a greater volume of superior quality pages than the Open Directory project will ever be capable of producing using their current catagorization method.
In this space, 475 "Guides" presumably edit 50,000+ topics providing a large collection webpages for their users. While it may be questioned how a single guide can be a 'passionate expert' on over 100 topics, this human-edited collection of websites is very popular. About.com boasts a library of "almost one million pieces of original content". It is unclear to this author whether this 'library' is owned by About.com or is referring to, at least in part, to their collection of offsite reference or links pages. About.Com was founded in 1997, and has been using various "guides" for the collection of websites for about 9 years. Assuming a reasonable amount of productivity for a rotating group of past and previous 'guides' , the 1 million sites seems to be the approximate number of both internally generated content and external webpages that they have linked to or framed in their own pages.
Contrasting ZHEH's 'user-driven' index against About.com's 'guided' experience, it should be noted that the user bookmarked links indexed in ZHEH would have a similar value as those indexed by a 'passionate' guide for About.com. Due to the automated nature of ZHEH, the results should be delivered at a much greater rate with vastly reduced human cost to compile the results.
As with nearly any human categorization scheme, the tendency towards editorial bias is most evident in human-edited directories. What is one editors valuable link, is another's useless spam. Testing this theory, DMOZ was queried for the controversial subject of concerning origins known as "Intelligent Design":
Open Directory Sites (1-20 of 1337)
(Note SERP truncated for clarity)
Two links can be noted that contain information about the desired subject, however these links are located in the DMOZ category " Society: Religion and Spirituality", reflecting an editorial worldview biased towards philosophical naturalism. A proponent of intelligent design theory would have placed the links in the more appropriate "Science: Physics: Cosmology",where pages about the competing big bang theory had been placed.
It is just this sort of editorial bias that the ZHEH index avoids. A search for "intelligent design" on ZHEH produces a number of sites both advancing and challenging intelligent design theory. Without the rigid categorization imposed by the DMOZ directory, ZHEH supplies websites that reflect the general preferences of users on the web, not the editorial bias's of human editors. (Note: due to the technical limits of the existing search software, this specific query takes about a minute to complete as of this writing)
THE KEY TO GETTING INTO DMOZ?: Show respect (and have patience).
There is a mystique surrounding DMOZ, perpetuated by the dominate position the directory has obtained in the web search space. Many believe at this time that a listing in DMOZ will significantly improve their sites overall visiblity. Getting listed however generally takes weeks or months, and many sites are rejected, sometimes for the editorial biased reasons illustrated above. ZHEH brings a level playing field to the subject of search engine optimization, no longer have to deal with an arbitrary or perhaps capricious human editor to get listed. The bottom line for a listing in the ZHEH index is that your website will have to be found useful to your users, useful enough that they will want to return to your site, that they will bookmark your site so they can find it again easily.
Mamma.com is an example of a metasearch engine, culling it's results from many diversified large-scale hypertext systems such as Google, Alta-Vista and others. Experimenting with the same queries used above at Mamma.com produced excellent results, often time superior (in my opinion) to the results delivered by searching a single search engine alone. Nevertheless, the links retrieved via mamma.com and other metasearch engines contain all of the really crummy sites too..
Owing to the user-driven nature of the ZHEH index, improving placement in the index becomes an issue of increasing the number of users that are bookmarking your website, not in word counts, keyword overloading, or other popular SEO techniques. (To be sure, as user-driven search becomes more popular, the incredibly creative leanings of some individuals will require certain safeguards to be employed to prevent abuse) ZHEH's listings rise to the top based on one factor alone, relevance.
Presently, the single most cited reason for someone bookmarking a website is that they intend to return to it later. It can only be inferred that the site is of value for any specific purpose. Generally however, my research has shown that people usually bookmark a site because they found it to be well done, relevant to their original search, and suitable for saving in their local favorites.
Bookmark Synchronization has been around for at least the past 10 years. BookmarkSync and others have been the staple websites for tens of thousands of computer users for nearly a decade. The index data used in ZHEH's search is generated seperatly and entirely removed from the users that use the service for bookmark synchronization. There is no connection between the bookmark tables and the search engine source tables. There is physically no way to relate any particular search result to any individual users link who is using the synchronization service. No information is taken from the user except the url of the linked site, all information for the search result that is displayed is taken from the website itself during a continuous spidering operation used for the added service of link validation for the user of the synchronization service.
Still, the mere fact of uploading one's personal preferences to a remote server, no matter how secure is troubling to some. Perhaps an additional method of manually reviewing sites could be implemented where a selection of submitted sites could be made available to all users to look at. If any one of them find the site interesting enough to bookmark, then the site will be added to the index through the preferred method, keeping the premise of the search intact. (I like this idea)