Uplug is a collection of tools for linguistic corpus processing, word alignment, and term extraction from parallel corpora.
Version: 0.2.0cUplug is a collection of tools for linguistic corpus processing, word alignment, and term extraction from parallel corpora. Several tools have been integrated in Uplug.
Operating System: Linux
Pre-processing tools include a sentence splitter, tokenizer, and external part-of-speech tagger and shallow parsers. The following external tools are used: the Grok system for English (tagging and chunking) and the morphological analyzer ChaSen for Japanese.
Other tools such as the TreeTagger can easily be added. Translated documents can be sentence aligned using the length-based approach by Gale & Church. Words and phrases can be aligned using the clue alignment approach and the toolbox for training statistical alignment models GIZA++.
What's New in This Release:
· robust conversion of encodings in tag.pl/toktag.pl/chunk.pl
· added treetagger startup scripts for es and nl, replace "nbsp" to " "
· robust conversion between encodings in bitext-indexer.pl/opus-indexer.pl
· added startup scripts for spanish and dutch tree-tagger models
· updated startup scripts for other treetagger models according to latest TreeTagger distribution
· fixed hunalign (bug in converting alignment output to xml)
· added missing ';' at line 40 in Uplug.pm