Mguesser is a component of libmnogosearch that enables the identification of a text file's character set and language. It is available as a standalone application.
Mguesser package comprises of C written N-gram based algorithms and several language and character set maps for different texts. You can find these maps in the "maps" directory of the package. The software supports various languages and character sets, all clearly listed for your convenience. Among the new features that come with the latest version include the "d" command-line option for loading language maps from a non-default directory. Also, the "-t" command enables you to specify the number of top n-grams to print on the output map. About 30 new maps have also been added to the latest release.
To use mguesser, you need plain text data to your STDIN. It's important to note that other "almost text" formats like HTML may not give accurate results. If need be, you can always add a command-line switch to inform the software that you are inputting HTML. Mguesser works best for text files starting from 500 bytes and above. Texts shorter than this are usually not accurately guessed.
To guess the language and character set of a text file, simply run the following command: "mguesser < text_file". The software will display how well your file corresponds to different language maps, in order of quality. Mguesser can return values ranging between 0 and 1. If you want to display specific results, use the "-n" command-line switch. For instance, "mguesser -n3 < text_file" will display the top 3 results.
To make mguesser load language maps from a non-default directory, use the "-d/path/to/maps/" command. You can also load language maps from multiple directories by using a colon-separated list. To create a new language map, use the "-p -c charset -l language < text_file" command. With this command, mguesser creates a new map based on the text_file and prints it to STDOUT. For best results, use a high-quality source text file, usually around 500 KB.
Finally, you can include mguesser in your own applications. Simply check the main() function located in the guesser.c file to see the order of guesser function calls. Mguesser is a powerful software that can effectively guess language and character set with great accuracy.
Version 0.4: N/A