Tuesday, February 08, 2005

Google is your friend

After a long hiatus, I'm going to start blogging again with this paper on "Automatic Meaning Discovery using Google" by Paul Vitanyi (the Kolmogorov complexity guy) and Rudi Cilibrasi, which has generated a flurry of interest, even sparking a slashdot discussion. The essence of the paper is this : Using the page counts returned by the Google search engine to define a distribution over words and word pairs and using it to automatically extract meaning from the world wide web. For example, the number of page counts returned by the query "horse"+"rider"(about 2,710,000) versus "hoarse"+"rider"(31,400) gives some information on the semantic associations between these words. In this way they are trying to exploit the huge but low-quality information source that is the world wide web to generate a lexicon and an ontology, in comparison to Cyc for example which is building a hand-crafted knowledge base.

The idea has been suggested before but this is the first realistic attempt that I know of. Their approach is interesting for two reasons. First, it has strong theoretical justification: an argument based on Kolmogorov complexity and optimal string encodings. Basically the metric they use, called the Normalized Google Distance, is universal w.r.t. the Google Distance of individual authors ie. the NGD of any two words is within a linear factor of the GD of those words in the web documents originating from any one source.

Secondly they have impressive experimental results, especially one involving the heirarchical classification of a set of numbers and colours. Another set of experiments uses the NGD between an instance word and a set of "anchor" words to define a set of features that is used as input to an SVM. By using the correct set of anchor words, they were able to classify all words that are "electrical" terms with 100% accuracy.

16 Comments:

Blogger Sridhar said...

Very interesting paper. So we know that google has a *huge* amount of information. We know that meaning of some sort can be pulled from it. Will we be looking at Arnie fighting Google in 2050?

Check this out.
googlezonI'll be back.

12:46 AM  
Blogger Sridhar said...

ok. blogger doesn't let people edit posts and so I must sadly keep my missing newline with me.

12:47 AM  
Anonymous Anonymous said...

Dear Deepak,

Thanks for the nice summary article. I am glad to hear you are inspired to blog again. I will be even more interested to hear about all the things people do with this and similar techniques; the possibilities do really seem pretty open right now and I am looking forward to many people replicating and improving out work. Cheers,

Rudi

2:12 PM  
Anonymous makc said...

The thing is that Google folks have said somewhere, that those numbers are mere estimations, not real result counts. Today I've typed "rainbow" in web search. The link appear saying "go check 721,000 image results for rainbow". When I clicked, image search said there are more than 1M results. Another example: http://www.flashmove.com/forum/showthread.php?t=16774 (It's old and does not work right now).

12:47 AM  
Blogger AM said...

Ditch, Try this blog for lots of Math/CS related info. www.geomblog.blogspot.com

10:47 AM  
Anonymous makc said...

To add to my last comment, check link I've put under my name. Here's quote:

> The numbers that Google reports are strange. As of June 4, 2005 they are 203,000 and 279,000 for the two searches above. Considering that NameBase has never had more than 132,000 pages on its site, and our file-naming conventions have been stable now for several years, this means Google cannot count. If this isn't sufficiently bizarre, try just site:www.namebase.org, which shows a count of 2,330,000. On September 10, 2005 the above links were checked again, and the numbers had jumped to 588,000 and 2,010,000 and 4,670,000. At the same time, the referrals from Google for NameBase have been decreasing. Even assuming that Google includes every page from NameBase, this still leaves their count 35 times higher than it should be. <

2:15 AM  
Blogger cc Infopage said...

Hello,

I am searching around for fresh information
for my cc Infopage, 30,000 Information Pages about all kind of subjects.

It might interest you to know that your blog has been visited and has been read. I hope you enjoy "Blogging" as much as I do.

I wish you all the luck I can, keep the good work going!

Kind regards,
Jos
News About Google

3:31 AM  
Blogger Joe Muka said...

This comment has been removed by a blog administrator.

12:54 AM  
Anonymous Anonymous said...

This comment has been removed by a blog administrator.

5:16 PM  
Anonymous Anonymous said...

affiliate ringtone is great

i found here searching for the word affiliate ringtone and your site was listed high on the word affiliate ringtone
GOOD JOB

affiliate ringtone

6:31 AM  
Anonymous Anonymous said...

This comment has been removed by a blog administrator.

4:01 PM  
Blogger Dominicans resources said...

::FREE BLOGS AND FREE WORDPRESS BLOG HOSTING - CREATE UNLIMITED BLOGS NOW ALL FREE ::

Free blogging hosting speicalist

6:54 PM  
Blogger grecchia said...

I've found that you can do a pretty good job of this even without Google's vast amount of information... If you download a Wikipedia dump, that's > 3 gigs of text that you can search to your heart's content ^^

1:41 PM  
Blogger Ray Lightning said...

There are smarter ways of learning word associations than just using google. Prof Dekang Lin has improved the performance of his shallow parsing using this approach of statistical frequencies, albeith on a smaller corpus. Similar approaches are being attempted in various other tasks of Natural language processing, including machine translation and dialogue processing. It is a widely acknowledged fact that knowledge based approaches have lived their day. Cyc is a good effort, but nobody waits for it nor has great expectations out of it.

5:34 AM  
Blogger Ayisha said...

Wonderful site..would be back soon..meanwhile if you are looking for more rapidshare links on Neural networks and AI then you are most welcome to visit my website..

Ebooks and Tutorials on AI and Neural Networks

2:56 AM  
Blogger milf said...

black mold exposureblack mold symptoms of exposurewrought iron garden gatesiron garden gates find them herefine thin hair hairstylessearch hair styles for fine thin hairnight vision binocularsbuy night vision binocularslipitor reactionslipitor allergic reactionsluxury beach resort in the philippines

afordable beach resorts in the philippineshomeopathy for eczema.baby eczema.save big with great mineral makeup bargainsmineral makeup wholesalersprodam iphone Apple prodam iphone prahacect iphone manualmanual for P 168 iphonefero 52 binocularsnight vision Fero 52 binocularsThe best night vision binoculars here

night vision binoculars bargainsfree photo albums computer programsfree software to make photo albumsfree tax formsprintable tax forms for free craftmatic air bedcraftmatic air bed adjustable info hereboyd air bedboyd night air bed lowest pricefind air beds in wisconsinbest air beds in wisconsincloud air beds

best cloud inflatable air bedssealy air beds portableportables air bedsrv luggage racksaluminum made rv luggage racksair bed raisedbest form raised air bedsaircraft support equipmentsbest support equipments for aircraftsbed air informercialsbest informercials bed airmattress sized air beds

bestair bed mattress antique doorknobsantique doorknob identification tipsdvd player troubleshootingtroubleshooting with the dvd playerflat panel television lcd vs plasmaflat panel lcd television versus plasma pic the bestThe causes of economic recessionwhat are the causes of economic recessionadjustable bed air foam The best bed air foam

hoof prints antique equestrian printsantique hoof prints equestrian printsBuy air bedadjustablebuy the best adjustable air bedsair beds canadian storesCanadian stores for air beds

migraine causemigraine treatments floridaflorida headache clinicdrying dessicantair drying dessicantdessicant air dryerpediatric asthmaasthma specialistasthma children specialistcarpet cleaning dallas txcarpet cleaners dallascarpet cleaning dallas

vero beach vacationvero beach vacationsbeach vacation homes veroms beach vacationsms beach vacationms beach condosmaui beach vacationmaui beach vacationsmaui beach clubbeach vacationsyour beach vacationscheap beach vacations

bob hairstylebob haircutsbob layeredpob hairstylebobbedclassic bobCare for Curly HairTips for Curly Haircurly hair12r 22.5 best pricetires truck bustires 12r 22.5

washington new housenew house houstonnew house san antonionew house venturanew houston house houston house txstains removal dyestains removal clothesstains removalteeth whiteningteeth whiteningbright teeth

1:30 PM  

Post a Comment

Links to this post:

Create a Link

<< Home