Thursday, 6 October 2011

Getting a search engine to re-index an erroneous piece of metadata

I had a request the other day from a PhD author who wasn't happy at the fact that her date of birth had been used in her name in our institutional repository. I made the necessary changes to the repository and to her EThOS record, but she still wasn't happy about the fact that Google and other search engines were still bringing up her date of birth when you searched for her name. This was, of course, because search engines only crawl the web periodically to update their index.

This made me do some research into search engines and the way they crawl the web. I found a lot of forum posts with people asking how to make a search engine re-index a site, and a lot of disheartening answers saying it was impossible. It seems that search engines crawl the web in their own way and to their own timetable. High ranking sites (those which are linked to a lot) can be crawled every day, whereas lower priority sites can take months to be updated on search engines.

However, I was very pleased to discover that there was one way of trying to get a site crawled a bit faster. Each search engine has a 'Submit URL' page, where you can enter the URL of the page you wish to be crawled. The engines are clear that this won't necessarily mean your site is crawled any faster, but they will add your page to their queue. I therefore added the individual URL of the record in question, and crossed my fingers. Becuase I was desperate, and also interested in experimenting with this, I also tried one or two other things. I posted a link to the record in my blog (which is hosted by Google, and therefore crawled by them, I hoped). I also changed the format of the title of the record, as this automatically becomes the title of the webpage, and changes to this are seen as important by the engines. Amazingly, a search on Google the very next day showed that the record had been re-crawled, and the author's date of birth had been removed. I was extremely happy! Bing followed a few days later, and Yahoo a day or two after that. Of course, I don't know which of my methods caused the page to be re-indexed, but I would guess that the submission of the URL was the main reason. I'm really pleased that I investigated this, as I will know what to do next time if the situation should arise again.