Home arrow Resources arrow Articles arrow Google and LSI ; the Missing Link?
Wednesday, 27 August 2008
Business Development Resources
US Small Business Admistration
Canadian Business Development Bank
Small Business information center
the Ecommerce Guide 
Login Form





Lost Password?

PDF Print E-mail
Written by David Harry   
Monday, 24 July 2006

Google and LSI ; the Missing Link?

Ok, so I had no idea what to name this journey into the darkness of relational/relevance document indexing algorithms. Let’s remember that these are algorithmic filters, not the core of how a search engine works. These are the building blocks which the engineers would build upon with their own personal ‘flavor’. Let’s not take this as the hand of the Google gods, but a peek at what makes search engine tick.

Where’d it come from?

Google Corporation purchased Applied Semantics in 2003. AS was the creator of CIRCA, a software application which;

"understands, organizes, and extracts knowledge from websites and information repositories in a way that mimics human thought and enables more effective information retrieval.”. PR Release; http://www.google.com/press/pressrel/applied.html
"Applied Semantics is a proven innovator in semantic text processing and online advertising," said Sergey Brin, Google's co-founder and president of Technology. "This acquisition will enable Google to create new technologies that make online advertising more useful to users, publishers, and advertisers alike." Seeing the leap to incorporating LSI into all Google's core search operation is not exactly rocket science.


How does LSI work?

LSI or Latent Semantic Indexing enables the Search engine algorithms to relate to what a page is about outside of the defined search parameters.
For example, a page about Ford Motor Company is likely to have terms such as ‘Mustang’, ‘Explorer ‘ or ‘transmission’. These terms are not ‘naturally’ related to automobiles ( a horse and adventurer respectively) but LSI ‘learns’ that they can be related to the core term ‘Ford Motor Company’. As SEO Book founder Aaron Wall put it, “LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant.” This illustrates the value of LSI to today’s search engines. He goes on to say “Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent. source”

Utilizing LSI, the algorithm can place additional weighting to a search term based on the relevance of additional words/terms contained in the document, (web page in this case). A document that is lower in ‘relevance weighting’ or worse, has irrelevant correlating information within it, naturally now ranks lower. Starting to see the pattern here? Well so are they! Relevance patterns of terms, words and phrases.

While its existence in the algorithm is (un)certain, the actual data is way too large to put here. You'll have to do some reading to make up your own mind. I found an easy way to see it in action. Go to your stats package of choice (log analysis). Most hosting comes with AW Stats or equivalent. Go to the search phrase results that have found their way to the pages on your site. Notice any search terms that aren’t targeted, closely related or the term isn’t even on the page in question? That must be LSI at work. It’s using ‘relational’ thinking so to speak. I see a ton of them all over my client’s sites logs as well as our own sites.

Got me by the link and curly!

LSI can also be used to weight your Link Profile (back links, outbound links). Inbound links may only have a few terms; the more varied and related your incoming links text is, the higher the over-all weighting score would be. This means you should vary the link text in you link building programs. This can also be seen in some of Matt Cutts (Goo’s unofficial PR guy) comments on Big Daddy and linking guidelines, if read with the concept of LSI in mind. When analyzing a problem site he said,

”This is also a real estate site, this time about a Eastern European country. I see 387 pages indexed currently. Aha, checking out the bottom of the page, I see ‘your’ Linking to a free ringtones site, an SEO contest, and an Omega 3 fish oil site? I think I’ve found your problem. I’d think about the quality of your links if you’d prefer to have more pages crawled. As these indexing changes have rolled out, we’ve improving how we handle reciprocal link exchanges and link buying/selling”

The relevance weighting of LSI is perfect for helping with this indexing problem. Weight is lower for non-relevance. As with the human mind, there is an expectation of relevance when it weights a link or document.

Getting all wrapped up.

There is one caveat that I must mention. For any of this to mean more than wind blowing out my aspirations, you have to believe that Google utilized LSI beyond it’s original purpose when purchased, 'text relevant advertising'. You know; the Google Adwords and AdSense programs. I personally believe as an SEO provider and an application developer that a tight integration between the advertising programs and the core Search engine would be the way to go logically. There are other factors relating to other research we have done that lead me to believe the LSI is in place in the mainstream search results. I shant be posting that publicly though.

I am truly sorry if any of your limbs have gone numb or should you be adrift with a blank stare at the moment. It’s an important aspect of understanding SEO with Google that I have tried to make as palatable as an algorithm can be. Many more pages were ingested in the making of this article.
Oh, and by the way kiddies, I stumbled on some other theories such as IDF, (Inverse Document Frequency) that may warrant a look or two. When my brain thaws out from LSI, old HillTop theories and VIPS (Visual-block Page Segmentation)... I’ll see what I can find out...

For now here are the basics of IDF;
Inverse document frequency - term used to help determine the position of a term in a vector space model. idf = log ( total documents in database / documents containing the term ),

Until then, have some alphabet soup… make your own acronyms.


Resources; Finding relative keywords/phrases

Search Google for search results with related terms using a tilde ( ~ ) symbol For example, searching ‘ ~marketing ‘, will return pages with terms matching or related to internet marketing, (media, communications, market are a few for this one) and will highlight some of the related words in the search results. Not to mention ones the LSI likes.

Search a “a lexical database for the English language’ such as; http://wordnet.princeton.edu/

Read the page copy
and analyze the backlinks of high ranking competitor pages.

Google AdSense Sandbox Tool; http://www.digitalpoint.com/tools/adsense-sandbox/
This is a handy little utility if you would like to see what sort of Google AdSense ads based on content or keywords. Simply enter the URL or keywords in the box below, and you will see up to 20 sample AdSense ads for the URL or keywords.

MORE on Google and Latent Semantics from John Andrews also have a read though 'Google Latent Semantic Indexing Technology' by fellow rebel Visio

 

Last Updated ( Sunday, 29 October 2006 )
 
Home - Articles - About Us - Services - Business Development Services- Internet Marketing - Internet Development - Resource Links- Site Map - Contact Us
All designs,images and original content © Copyright 2006 Comprehensive Development Services
Custom Web Site Design and developmentby Verve Developments
Contact Us Internet Marketing Business and Marketing Articles Business Development Contact Us Internet business development Internet marketing Read CDS Articles