Thursday 6 November 2008

The challenge of Multi-language Patent data

The opening sessions from the IRFS in Vienna have been dominated by the need to find workable solutions to retrieving and translating multi-language patent information. In particular that from Asia.
In the last 30 years the amount of information from the region has exploded. Originating mainly from Japan - China, Korea and Taiwan are doing their level best to catch up. In total half of all patent application filings now originate from the region.
From the limited number of options available, finding accurate patent information can be achieved in several ways, at present the options available are as follows such as manual human translation - slow and the most expensive option; automatic Machine translation is faster but besieged by the complexity of differences of language context between different global areas. A hybrid of the two methods mentioned above sees Human assisted translation as a possibility.
In his presentation Professor Jian-Yun Nie from University of Montreal outlined what he thought the ideal system should be (using the analogy of 3 cogs in a translation machine.
1) Query in English goes in, this then is automatically translated into an Asian Language,
2) The exact required information is then retrieved.
3) That information is then accessible to be read in English.
Sounds simple enough but it's a tough problem to crack because due to the many differences in language structure between the West and Asia a number of obstacles will need to be overcome such as the way different terms apply to different words and to consider this when applying the relevant technology.
In part there is the need to recognise the relationship between these different terms - but the real question seems to be how do you determine that relationship? How do you weight the terms?
As far as language structure is concerned consider for a moment that the Chinese language doesn't use spacing in it sentence structure while Korean does, albeit sporadically and not to the same rigidity as western languages. There are also marked differences in how the Chinese language in patents is applied compared to the different but related regions of Taiwan and Hong Kong. From the presentation given by Benjamin T'sou, Director of Language Information Sciences at the University of Hong Kong, taking into account the differences in just one sector (the automotive industry) between these three areas is proving to be a major headache. It illustrates well the difficulty of applying the same problems to a much different culture with its own set of languages.
The suggestion was that what seems to work effectively right now is to use machine translation to get the gist of the document and then use a human to evaluate whether its worth digging deeper and engaging the services of a professional translator. This might be good enough for now but whether that remains true is another matter.

No comments:

Post a Comment