Metadata Monkey: 2011

Thursday, 6 October 2011

Getting a search engine to re-index an erroneous piece of metadata

I had a request the other day from a PhD author who wasn't happy at the fact that her date of birth had been used in her name in our institutional repository. I made the necessary changes to the repository and to her EThOS record, but she still wasn't happy about the fact that Google and other search engines were still bringing up her date of birth when you searched for her name. This was, of course, because search engines only crawl the web periodically to update their index.

This made me do some research into search engines and the way they crawl the web. I found a lot of forum posts with people asking how to make a search engine re-index a site, and a lot of disheartening answers saying it was impossible. It seems that search engines crawl the web in their own way and to their own timetable. High ranking sites (those which are linked to a lot) can be crawled every day, whereas lower priority sites can take months to be updated on search engines.

However, I was very pleased to discover that there was one way of trying to get a site crawled a bit faster. Each search engine has a 'Submit URL' page, where you can enter the URL of the page you wish to be crawled. The engines are clear that this won't necessarily mean your site is crawled any faster, but they will add your page to their queue. I therefore added the individual URL of the record in question, and crossed my fingers. Becuase I was desperate, and also interested in experimenting with this, I also tried one or two other things. I posted a link to the record in my blog (which is hosted by Google, and therefore crawled by them, I hoped). I also changed the format of the title of the record, as this automatically becomes the title of the webpage, and changes to this are seen as important by the engines. Amazingly, a search on Google the very next day showed that the record had been re-crawled, and the author's date of birth had been removed. I was extremely happy! Bing followed a few days later, and Yahoo a day or two after that. Of course, I don't know which of my methods caused the page to be re-indexed, but I would guess that the submission of the URL was the main reason. I'm really pleased that I investigated this, as I will know what to do next time if the situation should arise again.

Monday, 11 July 2011

Musings on authority control

One thing that I find very frustrating with the EPrints software used to host our institutional repository is the way it represents authors. Or in fact, doesn't represent authors. It seems fairly obvious that the best implementation would be to have a sort of 'identity' for an author with some distinguishing features: a unique reference for example. EPrints attempts to create this with their e-mail address and homepage fields. But this is fairly useless. E-mail addresses can be written with lower or uppercase letters, an author may have multiple e-mail addresses, or just multiple forms of e-mail address. And authors publish under a multitude of different names. I am continually faced with the dilemma: if Smith, J. the same as Smith, J. A.? And are they the same as Smith, John, or Smith John A., or Smith J. A. (John A.)? The only way of finding out is to perform a search on EPrints to find out what else that author has published that we have in our database. This can be tricky for someone who combines multiple research areas. What is needed is a record for each author. This record would begin with a unique identifier, perhaps a reference number, or perhaps, in the semantic web era, a web address. This record could maintain a list of publication forms. It could also give information about their department, career history, etc.

But hang on, am I just re-inventing the wheel here? Isn't this what a Library of Congress authority file is? But the key is that a lot of Warwick academics do not have their own authority record. So, is there something out there does a better job than the Library of Congress for British academics?

A few months ago I met with a member of the Names project - a JISC-funded collaboration between Mimas and the British Library (http://names.mimas.ac.uk/). I discovered that the team are, in effect, creating a British version of Library of Congress authorities, for use with institutional repositories. They have realised that there is a need for an authority service for UK academics, many of whom will not have their own LC heading because they only publish in journals. And they have also realised that each institutional repository already provides a wealth of information about who UK academics are, and what areas they publish in. By writing algorithms to try and match academics with the same name who publish research in the same subject area, they can automatically create a list of unique authors. I'm waiting with interest to see how their project goes.

I've also just been made aware of initiative: ORCID (Open Researcher and Contributor ID) (http://orcid.org/). ORCID are aiming to produce an independent registry of academics, giving each their own unique identifier which can be linked to all their research output. It aims to 'transcend discipline, geographic, national and institutional, boundaries' and therefore has a much broader scope than the UK based Names. I am led to believe there is also a higher integration with the Library of Congress name authorities, something that is important to us at Warwick as we already use them where possible.

I believe the initiative behind both these projects is much needed in the future of research on the web, and I will be watching their progress with interest.

Wednesday, 18 May 2011

Musings on subject headings

I'm starting a blog in order to put down my thoughts about various metadata related topics, and I thought I'd begin with my musings on subject headings. I quite often tweet about my frustrations with Library of Congress subject headings, so I thought it was about time I wrote down properly my issues with LCSH, and my suggested solutions.

Firstly, to provide a bit of context: I create metadata for all submissions to the University of Warwick's institutional repository. I'm therefore cataloguing all sorts of journal articles and PhD theses on some sometimes very random and bizarre topics. This has several issues.

1. The topics can be very narrow and complex, something which LCSH does not cope with very well as it has to cover the world, the universe and everything. A lot of the subject headings are very general.

2. One way to make LCSH more specific is to add subdivisions, but I get very frustrated at the lack of flexibility here. For example, I recently catalogued an article on bacteria that can be found residing in sheep's feet. I would have loved to create a heading such as Sheep -- Feet -- Bacteria found in. However this is not possible for several reasons, and so I ended up basically assigning three headings: Sheep, Foot and Bacteria. Now this is fine if a user understands and is happy to use post co-ordination. But for a browsing individual looking for, say works on the anatomy of the human foot, having articles about sheep's (and who knows what other animals') feet is only going to annoy them. I would think it very unlikely that anyone in the repository would be interested in anything's feet, ovine or otherwise.

3. Back on topic, almost 100% of what I catalogue will not be catalogued by the Library of Congress. I am therefore creating subject headings for works that are very unlikely to have been viewed by those that have the power to create new headings. I am cataloguing new, innovative research, which by definition will need new, innovative subject headings.

4. There is no way of defining relationships between headings. For example, it would be nice for a paper comparing the effects of physiotherapy and self-help for people with back pain, to represent this comparison in the subject headings.

5. Even some seemingly established research topics and phrases do not have their own heading, for example: metadiscourse, or professionalism.

6. Of course, there are always the American-centric gripes. I object to referring to 'College' students rather than 'University' students, and continually having to remember to write 'organization' or 'labor'.

But then again, sometimes I am very pleasantly surprised at what there are headings for. 'Tissue scaffolds' for instance, or 'Domain-specific programming languages'.

So how do I think things could be improved? Well, it would seem that, certainly for the domain I work in, a much more specialised subject heading system is required. But as the items I catalogue could cover any topic, it would be infeasible to create one from scratch. So are there already taxonomies of terms for subject areas that could be utilised? Some subject areas do have their own systems, for example the Mathematics Subject Classification produced by the American Mathematical Society. Maybe such a scheme could be designated for each subject area, and used accordingly.

This is definitely something to investigate further...