Uplug software provides a set of efficient tools to process linguistic corpus including word alignment and term extraction from parallel corpora.
The Grok system, which is used for English tagging and chunking, and the morphological analyzer ChaSen, which is used for Japanese, are just a few examples of the external tools that can be found in Uplug. Users can easily add additional tools such as the popular TreeTagger.
The software also allows for sentence alignment using a length-based approach. Furthermore, words and phrases can be aligned using the clue alignment approach and training statistical alignment models with GIZA++.
This latest release of Uplug has made several important improvements to the software. For example, the software now features robust conversion of encodings in tag.pl, toktag.pl and chunk.pl. There are also new treetagger startup scripts for Spanish and Dutch, adding to the already available scripts for es and nl. The release also has an updated startup script for other treetagger models to correspond to the latest TreeTagger distribution.
Additionally, several other improvements have been made, such as fixing a bug in the conversion of alignment output to xml with hunalign and adding a missing semicolon at line 40 in Uplug.pm.
Overall, Uplug is an excellent software tool with powerful features for anyone who needs to extract linguistic data from parallel corpora.
Version 0.2.0c: N/A