Wednesday, March 6, 2013

Thinking Like a Search Engine (part 4)


Thinking again like a search engine operator, how would you go about detecting keyword relevance? One thing you would soon realize is that you do not want to throw the baby out with the bath water. That is, you would realize that by only looking to see if the keywords show up in the actual content of the site and assuming that cheating has occurred if the words do not, you will exclude some very relevant, very valuable sites. This, too, would be bad for you as a search engine operator.

There are many legitimate situations where a keyword may not be repeated in the actual content of the site. Say a Webmaster has a regional site which provides news and current activities for a three county area commonly known among people in the region as the "River Basin Area" or perhaps the "Wiregrass Area" or some similar term. Say the three counties are Washington County, Adams County, and Jefferson County. Imagine further that the Webmaster included the state name and the name of all three of these counties in her keywords, but never actually mentions them in the content of her Website. The Website just refers to "news and events for the River Basin Area." The county names are very relevant keywords, even though they do not appear in the content of the site. People looking for news and events for Washington County would want to see this page. As a search engine operator, you would want them to find this page because it has information they are looking for. It would have been better if the Webmaster had included a statement like "covering the Washington County, Adams County, Jefferson County news and events and more" in her content. But, since she did not, you will have a more effective search engine if you recognize that she is not cheating, and her keywords are relevant. Thus, you realize that you are going to have to come up with a pretty sophisticated procedure for determining keyword relevance and ranking the Websites in your search engine.

Since you cannot afford to pay someone to sit and look at all of the millions of pages submitted to your engine in person, you will have to develop some algorithm that will do this task as best it can be done without human intervention. Clearly, it's not going to be extremely accurate. It would be far too difficult to write a program sophisticated enough to figure out all the different variations of relevant keywords. Most likely, you will have to settle for doing it on some statistical bases. Say, for example, you decide that if 90% of the keywords actually appear on the site, then that's close enough. The other 10% could be cheating or it could simply be legitimate oversight like the River Basin Area example above. You may have to settle for that margin of error. On top of that, you could look for certain keywords that suggest cheating and deal with them separately. You could try to develop algorithms which judge whether the keywords which do appear in the content are actually in context or just thrown in to fool the search engines. Whatever means is actually employed, it is a constant struggle between the aggressive Webmasters who would manipulate the search engines and the search engine operators who want to keep their search engines effective.


from George Little's Internet Income Course. Register to SFI for free and get immediate access to the complete 79-part course.

No comments:

Post a Comment