Behind the Form – Google, The Deep-Web Crawl, and Impact on Search Engine Visibility

Posted on

Crazy Things That Really Rich Companies Do

Kind of like that weird guy at the party with an acoustic guitar and the Pink Floyd shirt, Google is getting DEEP. Some would say…uncomfortably deep. After an already busy year, wherein Google released an open source mobile OS and a browser that’s rapidly gaining market share, they recently announced that they had mapped the sea floor, including the Mariana Trench. And hey, why not found a school featuring some of the greatest scientific minds out there and see what happens?

So Google’s been more visible than ever lately, and there’s no doubt that this’ll continue as they get their hands into more and more projects – but let’s drop down a few floors and look at something that should dramatically affect the way Google’s indexing programs (“spiders” or “crawlers”) collect data, analyze websites and present the results.As much work as the BEM Interactive search engine marketing team puts into making sites appeal to spiders (and there’s a lot we can do to make those spiders love it), the spider programs themselves are pretty straight-forward: hit a site’s page index, check out the structure and content, and compare that to what Google has determined to be “relevant” or “popular.”

But because of the way these programs are written, there are certain areas that they simply can’t reach…namely pages that require human information, input, or action. As a basic example, there’s usually a confirmation page after a user submits a “Contact Us” or “Newsletter Sign-up” form – this could contain a promotional code or some other kind of unique data.This dynamically generated content (this could also be a search results page, calculations or conversions, even the results of a symptom tool on a medical site) simply doesn’t exist until the user creates it! Depending on the form you filled out, the resulting page is yours and yours alone – so try to ignore that tingle of omnipotence next time you Google something.

But search engine spiders can’t understand what the form is asking for or the info being delivered to the user – and even if they could, how would they figure out what to insert in order to generate any relevant content? Drop-down boxes, category selection, zip code input – any of these forms can prevent data from being indexed. Collectively, this blocked data is referred to as the “Deep Web.” By some estimates, the Deep Web contains an astounding amount of data – several orders of magnitude more than what’s currently searchable. Since they chiefly rely on site maps and hyperlinks, search engines crawlers just can’t find a way to access the information.

So can Google really expect to find, log and interpret this data? Well, between mapping the ocean and opening a school that will probably discover the meaning of life before lunch, Google did just that. Working with scientists from Cornell and UCSD, Google researchers (whom I can only hope will not become supervillians at some point) have devised a method for their spiders to complete and submit HTML forms populated with intelligent content. The resulting pages are then indexed and treated as regular indexed data and displayed in search results – in fact, at this moment, content gathered from behind an HTML form is displayed on the first page of Google search queries 1000 times every a second. The methods the bots are using are pretty cool, but I’m Nerd McNerdleson about that kind of thing. So we won’t dive in to the technical stuff here, but check out the article if you’re into it.

That’s cool…NERD. But what does it mean?

Everyone knows Google loves relevance – their entire business model is built upon it. This technology is about pulling exactly what the user is searching for and immediately providing it without even requiring them to visit any page outside of the Google results page! Spooky.

Say that you’re feeling under the weather. Rather than type in “symptom checker” and find a WebMD-type page, you type “coughing, runny nose, strange bubonic plague-like swelling” directly into the search engine. Google – who has already had their spiders hit every medical symptom form out there, query them in endless varieties and combinations, and determine the relevance & popularity of the results – immediately comes back with “You’ve got the Black Death” and you’re set (or…maybe not).

From a retailing standpoint, many sites have functions to generate product lists based on user input. As it stands now, a shopper looking for a red, American-made minivan with under 30K miles would find the appropriate website, input his or her criteria, whereupon the website would query the database and return the results. If Google continues to move forward with their deep web crawls, this information could be displayed directly through their outlet of choice without the user ever accessing any site other than Google (if the user makes a purchase, does Google get a cut? Hmm…)

Obviously, this is a massive step forward in search technology and, in an industry that seems to change every hour, represents a new method of obtaining and presenting information. As web marketers, this is another variable, another challenge to consider in our work – how can we optimize pages that can be generated in a seemingly limitless number of ways? With search engines becoming increasingly more powerful and their data mining capabilities getting deeper, will there come a time when all data is presented through one aggregate portal? This may be years down the line, but the technology and the foundations are here now; forward-thinking businesses and web marketers need to be there as well.

Leave a Reply

Your email address will not be published. Required fields are marked *