十载美国译咖梦,今朝黄埔结硕果
近日,来自美国的Jost Zetzsche(美国Tool Box Journal的作者,国际权威翻译技术专家)通过其在中国的友人周兴华老师,联系到了Tmxmall的CEO张井。Jost Zetzsche在对Tmxmall进行详细了解的同时,也表达了他十二年前为语料共享这一宏伟事业所做事情的感慨,和对Tmxmall目前发展的欣慰。
Jost Zetzsche是美国翻译技术领域的大咖。十二年前,他和MultiLingual的Donna Parrish一起创建了一家名叫TM Marketplace的公司,创建宗旨就是将语料拥有者同语料使用方联系起来,从而将语料应用于翻译业务。为了这一创新概念,他们曾经做了大量的工作,但遗憾的是,由于他们当时没能正确地了解市场和技术状态,导致语料的使用极其单一,仅使用翻译记忆库在本地进行完全和模糊匹配的检索,使用场景也仅限于CAT。再加上对语料知识产权和语料数据价值的不确定,最终导致其计划失败,并在数年后终止了业务。此后,他人基于Jost等人的想法,继续探索语料共享这一事业,这其中就包括翻译自动化用户协会(TAUS)。
如今,语料共享这一宏伟事业在中国开展的有声有色,而且取得的发展远在Jost Zetzsche先生的意料之外。Tmxmall多达3万的语料用户和每天超过50万次的语料API搜索,使他大感意外。他感叹道,美国和中国的译员就像活在平行宇宙之间,当初源于美国的语料共享理念至今未果,而中国已经在对语料进行大规模的翻译应用。其实海量翻译记忆库(翻译大数据)在中国的快速发展,主要得益于翻译界对翻译大数据的共识和互联网云计算的普及,从而使得利用亿级句对的翻译记忆库进行基于云端的毫秒级预翻译或模糊匹配成为可能。
由于双方都是语料共享这一伟大事业的忠实发起者、实践者和探索者,因此在交流中,Jost Zetzsche在张井的简单介绍后,就能很到位地理解Tmxmall的发展理念、服务和产品。他将Tmxmall归纳为翻译记忆库相关系列产品和服务的供应商,产品和服务包括智能对齐、翻译记忆库交换平台、翻译记忆库管理和质量评价系统、翻译记忆库输出API和插件、翻译记忆库市场等。
对于Tmxmall的上述业务和产品,他迫切地询问了Tmxmall何时提供英文界面。为此,Tmxmall将服务和产品的国际化纳入了本年度的工作计划,作为对这位海外知己的热切回应。届时,就可以将Jost Zetzsche所谓的两个平行宇宙联系在一起,让更多的海外译客使用上基于大数据和云计算的的翻译记忆库相关服务和产品。
语言是人类无时不刻在使用的工具,不同语言的人们无时不刻不使用翻译来促进交流,而在这一过程中,人们也不断做着大量的重复劳动。如果能对过去产生的翻译数据进行高效回收和有效利用,那么就可以为所有的翻译从业者节约大量的时间和成本。这就是语料共享者们追求的梦想。人们在翻译过程中的重复劳动量究竟有多大?为此Tmxmall对语料大数据的利用情况进行了分析,发现平均重复量高达20%~30%,这一重复量是惊人的,如果能够避免这样的重复劳动,且不说翻译记忆技术对翻译质量和速度的提高,但就成本而言,站在全球的高度,这一价值是十分惊人的。
此外,实际翻译过程中积累的语料更是提升机器翻译的基础条件,是人工智能发展的基础性建设。因此,可以说语料共享是真正再建巴别塔的必经之路。为了这一宏伟实业,海内外的有识之士目前正在辛勤耕耘,不断探索,也许在不久的将来,这一天终将变成现实。
下面附上Jost Zetzsche在Tool Box Journal上关于Tmxmall的文章。
Twelve years ago, I partnered with Donna Parrish from MultiLingual to launch TM Marketplace, a company that was built around the idea of creating connections between owners of translation memory data and users who could put that data to new and productive use. It was an innovative idea, and we did our homework as far as the legal structuring of the business and the amount of energy we would need to invest (lots). Unfortunately, we failed to read the market correctly, or the current state of technology for that matter. At that point, TMs were used in very one-dimensional ways (essentially just as containers of perfect and fuzzy matches and only in the context of translation environment or CAT tools), and there was even greater uncertainty on questions like proprietary rights and value of data. Our plan didn't work out, and we closed our business a few years down the road.
Others did pick up some of our ideas, in particular TAUS, and you can benefit from that up to the present day by using the terminology services of the TAUS Data Cloud.
More recently, a company in China has come up with their own offering, and it might just work for them. The company is called TMXMall (the website as of now is only in Chinese, but there will be an English version in "two or three months"), and I conversed a bit with their CEO Zhang Jing.
TMXMall is essentially a collection of services and products that all revolve around translation memories:
A "smart aligner" that is able to provide better results because it is run with a large corpus in the background which is used to verify alignments.
A TM exchange platform where one can retrieve two translation units for every translation unit uploaded (for the uploaded data, they offer an anonymization service where any identifying phrase can be replaced)
A TM management and verification system that assesses and improves the quality of TM data (again with the help of the corpus)
API and plugins to a few translation environment tools, including Trados and memoQ, to access the corpus
TM marketplace (hurts a little bit to write that...) where users can upload documents, search for matches, and purchase those.
An (upcoming) peer-to-peer system where users can sell and trade data with each other.
Data sold on the TM marketplace costs app. $1.50 for 1000 words with a 100% match and then in several stages down to about $.45 for 1000 words with a 75-84% match. The money goes to the owner of the data (if it is owned by a third party), with TMXMall receiving 10-20% from the transaction fees.
Here are some usage numbers that Jing supplied me with: There is a total number of about 30,000 users with 500,000 API calls every day. Here's what I find most amazing about these numbers: These are substantial numbers by any stretch of the imagination, and I can almost bet that -- unless you're a Chinese translator -- you haven't even heard about this, let alone realized that this service is processing some real data. It seems a little bit like a case of parallel universes.
The service is used not only by individual translators but by translation companies and, as you won't be surprised to hear, machine translation companies that use the data to train their engines.
The vast majority of the data is in English<>Chinese, but there is also data available in Chinese<>Japanese, Korean, French, German, Hindi, and Spanish.




