Wednesday 2 April 2008

Comprende computer?

“The myth of fully automated translation is just that – a myth’, ‘Languages are just too complex for us to be able to automate the whole process” said Mark Lancaster Chief executive of SDL a UK commercial translation company in today’s FT.


If as an information professional you are ever tasked with scouring a foreign language source for vital research data, then you may have needed to use professional translation services. But what if the amount of data you need to wade through doesn’t make using team of humans time or cost effective for your limited budget? Is automated or Machine Translation (MT) an option?


MT is certainly not perfected. It has difficulties like Lancaster says and as the article points out a significant human input is still required.


For any information professional (those working in patent research for example) attempting to understand, as well as navigating a foreign language resource will be a difficult challenge. Language barriers that researchers can come up against include correctly translating different sentence structures and meanings of words compared to their native tongue. There are many complex factors that need to be considered when relying on MT to understand the subtleties that Lancaster mentions.


Using the patent example, a researcher could be searching for a vital, if obscure piece of information from a SE Asian country. The structure of Korean, Japanese or Chinese Languages is fundamentally different from that of its western Latin contemporary. The FT article  highlights the considerable times it would take with “Japanese double byte type projects” Double bytes, the FT says refer to “the number of bytes required to code Japanese, Chinese or Korean characters – English is a single-byte language”


This all reminds me of a session on MT and patent search I blogged about at the Information Retrieval Facility Symposium (IRFS) in November. There can be a several ways to express the colour red in Korean. The spacing of a Chinese word in a sentence can alter its meaning significantly, so when translating to a differently structured language there are added complications to contend with. The speakers at the event detailed these differences exquisitely, highlighting exactly what is required to get around them and what kind of work still needs to be done. Admittedly in the patent case, how the patent data was filed (consistently) was important in achieving good results.


As European Patent Office (EPO) head of research, Dr Barrou Diallo, the lead speaker on MT at IRFS, had quite a detailed breakdown of the issues involved. His department develops various tools that focus on helping information examiners in their information retrieval searching.


Diallo admits to plenty of hurdles needing to be overcome in the EPO’s five year MT research initiative. He believes that the lessons learned from translating European patent information as well as the technology developed mean the maturity of existing translated material is enough to already improve the efficiency of patent workers. It is also better placed to deal with the complex differences between European and South East Asian information sources.


The nice thing about Diallo’s presentation is that the quite complex difficulties that MT needs to overcome are explained in-depth. They are challenging (especially to the layperson) and highlight the mountain still to climb. But the positive efforts of European, Korean and Chinese contemporaries on such a testing area of research are worth listening to.


Links to Diallo’s twenty(ish) minute presentation as well as others are here also with accompanying slides.

No comments:

Post a Comment