The use of machine translation tools for crosslingual text机器翻译工具在跨语言文本中的应用.pptVIP

  • 2
  • 0
  • 约5.29千字
  • 约 15页
  • 2017-03-09 发布于上海
  • 举报

The use of machine translation tools for crosslingual text机器翻译工具在跨语言文本中的应用.ppt

The use of machine translation tools for crosslingual text机器翻译工具在跨语言文本中的应用

Kernel Canonical Correlation Analysis (Language Independent Document Representation) Blaz Fortuna Marko Grobelnik Dunja Mladeni? Jozef Stefan Institute, Ljubljana Outline What is KCCA – intuition and theory Preliminary results for AC corpora Applications of KCCA Related approaches What is KCCA about? KCCA enables to represent documents in a “language neutral way” Intuition behind KCCA: Given a parallel corpus (such as Acquis)… …first, we automatically identify language independent semantic concepts from text, …then, we re-represent documents with the identified concepts, …finally, we are able to perform cross language statistical operations (such as retrieval, classification, clustering…) Input for KCCA On input we have set of aligned documents: For each document we have a version in each language Documents are represented as bag-of-words vectors The Output from KCCA The goal: find pairs of semantic dimensions that co-appear in documents and their translations with high correlation Semantic dimension is a weighted set of words. These pairs are pairs of vectors, one from e.g. English bag-of-words space and one from German bag-of-words space. The Algorithm – Theory (1/2) Formally the KCCA solves: max(x,y) Corr(x,, , , y,, , ) x, y – semantic directions for English and German ( , ) is a pair of aligned documents The Algorithm – Theory (2/2) Examples of Semantic Dimensions from Acquis corpus: English-French (1/2) Most important words from semantic dimensions automatically generated from 2000 documents: Examples of Semantic Dimensions from Acquis corpora: English-Slovene (2/2) Most important words from semantic dimensions automatically generated from 2000 documents : Applications of KCCA Cross-lingual document retrieval: retrieved documents depend only on the meaning of the query and not its language. Automatic document categorization: only one classifier is learned and not a separate classifier for each language Document clustering: documents s

您可能关注的文档

文档评论(0)

1亿VIP精品文档

相关文档