Friday 7 November 2008

Annotation and Ontologies in the context of patent retrieval

So far, the Information Retrieval Facility Symposium (IRFS) has pretty much focused on tricky issues surrounding the search of patent data, or rather from the IR specialists point of view. Approaching this issue from the other side was Pierre Buffet, Executive Vice President of Questel (and co-founder). His presentation "How NLP techniques can and should help in structuring Patent Information" aimed to explained to us how that could and should be done. For the uninitiated NLP techniques refers to Natural language processing, the relationship between how computers interact with human language.
Stressing that in order to improve IR, the structure of patent documents needs to alter. Buffet gave delegates a variety of examples of what patents are comprised of and the areas that they need to change to be more searchable.
For example the variety of language styles used between a patent disclosure and a patent claim can widely alter, patent filings with a corresponding drawing are "a nightmare to map the differences between the illustration and the description" Data held in patents in the form of tables, however small can be crucial information the researcher was hunting for. This is because it is difficult to search a table and still keep the understanding as to what it says when viewed as a standalone piece of information.
The way citations are currently applied is also a cause for concern considering that there are no standards, the example that Buffet gave us from his own organisations' files had different styles all within one page. One salient point Buffet added to this was the frequency of hyperlinks used in citations the problem of course was how long would the pages they linked to remain online? Ten years? Five years?
The classification of patent materials also received somewhat of a drubbing with Buffet offering us a reminder as to what classification is for (the arrangement of a collection of objects in a single dimensional world). This was for the purpose of storage and when needed retrieval at a later time. The problem I think he was arguing with this is that these 'one-dimensional worlds' are weak because the systems in place USCLASS, (USA) ECLA (Europe) and Japanese FI are culturally specific.
Some standardisation wouldn't go amiss I presume.
As Buffet says though, "Describing a document isn't classifying a document"
There were plenty of suggestions mooted on how to get around these problems (linking descriptive items in an illustration to the same descriptive items within the text area as well as the necessity of standardisation of data such as in the area of statistical analysis.
Critically though, few would disagree with Buffet that what the IR/IP worlds need to find is a way of deciding who takes responsibility for what.

No comments:

Post a Comment