Lingo - A full-featured automatic indexing system
VERSION
This documentation refers to Lingo version 1.8.3
DESCRIPTION
Lingo is an open source indexing system for research and teachings. The main functions of Lingo are:
-
identification of (i.e. reduction to) basic word form by means of dictionaries and suffix lists
-
algorithmic decomposition
-
dictionary-based synonymisation and identification of phrases
-
generic identification of phrases/word sequences based on patterns of word classes
Introduction
If you want to perform linguistic analysis on some text, Lingo is there to support your endeavour with all its flexibility and extendability. Lingo enables you to assemble a network of practically unlimited functionality from modules with limited functions. This network is built by configuration files. Here's a minimal example:
meeting:
attendees:
- text_reader: { files: 'README' }
- debugger: { eval: 'true', ceval: 'cmd!="EOL"', prompt: '<debug>: ' }
Lingo is told to invite two attendees. And Lingo wants them to talk to each other, hence the name Lingo (= the technical language).
The first attendee is the text_reader (Lingo::Attendee::TextReader). It can read files (as well as standard input) and communicate its content to other attendees. For this purpose the text_reader is given an output channel. Everything that the text_reader has to say is steered through this channel. It will do nothing further until Lingo will tell the first attendee to speak. Then the text_reader will open the file README (files parameter) and babble the content to the world via its output channel.
The second attendee debugger (Lingo::Attendee::Debugger) does nothing else than to put everything on the console (standard error, actually) that comes into its input channel. If you write the Lingo configuration which is shown above as an example into the file readme.cfg and then run lingo -c readme -l en, the result will look something like this:
<debug>: *FILE('README')
<debug>: "= Lingo - [...]"
...
<debug>: "If you want to perform linguistic analysis on some text, [...]"
<debug>: "support your endeavour with all its flexibility and [...]"
...
<debug>: *EOF('README')
What we see are lines with an asterisk (*) and lines without. That's because Lingo distinguishes between commands and data. The text_reader did not only read the content of the file, but also communicated through the commands when a file begins and when it ends. This can (and will) be an important piece of information for other attendees that will be added later.
To try out Lingo's functionality without installing it first, have a look at Lingo Web. There you can enter some text and see the debug output Lingo generated, including tokenization, word identification, decomposition, etc.
Attendees
Available attendees that can be used for solving a specific problem (for more information see each attendee's documentation):
text_reader |
Reads files and puts their content into the channels line by line. (Lingo::Attendee::TextReader) |
tokenizer |
Dissects lines into defined character strings, i.e. tokens. (Lingo::Attendee::Tokenizer) |
abbreviator |
Identifies abbreviations and produces the long form if listed in a dictionary. (Lingo::Attendee::Abbreviator) |
word_searcher |
Identifies tokens and turns them into words for further processing. To do this right it looks into the dictionary. (Lingo::Attendee::WordSearcher) |
decomposer |
Tests any character strings not identified by the word_searcher for being compounds. (Lingo::Attendee::Decomposer) |
synonymer |
Extends words with synonyms. (Lingo::Attendee::Synonymer) |
noneword_filter |
Filters out everything and lets through only those tokens that are unknown. (Lingo::Attendee::NonewordFilter) |
vector_filter |
Filters out everything and lets through only those tokens that are considered useful for indexing. (Lingo::Attendee::VectorFilter) |
object_filter |
Similar to the vector_filter. (Lingo::Attendee::ObjectFilter) |
text_writer |
Writes anything that it receives into a file (or to standard output). (Lingo::Attendee::TextWriter) |
formatter |
Similar to the text_writer, but allows for custom output formats. (Lingo::Attendee::Formatter) |
debugger |
Shows everything for debugging. (Lingo::Attendee::Debugger) |
variator |
Tries to correct spelling errors and the like. (Lingo::Attendee::Variator) |
dehyphenizer |
Tries to undo hyphenation. (Lingo::Attendee::Dehyphenizer) |
multi_worder |
Identifies phrases (word sequences) based on a multiword dictionary. (Lingo::Attendee::MultiWorder) |
sequencer |
Identifies phrases (word sequences) based on patterns of word classes. (Lingo::Attendee::Sequencer) |
Furthermore, it may be useful to have a look at the configuration files lingo.cfg and en.lang.
Filters
Lingo is able to read HTML, XML, and PDF in addition to plain text.
TODO: Examples.
Markup
Lingo is able to parse HTML/XML and MediaWiki markup.
TODO: Examples.
Inline annotation
Lingo is able to annotate input text inline, instead of printing results to external files.
TODO: Examples.
Plugins
Lingo has a plugin system that allows you to implement additional features (e.g. add new attendees) or modify existing ones. Just create a file named lingo_plugin.rb in your Gem's lib directory or any directory that's in $LOAD_PATH. You can also define an environment variable LINGO_PLUGIN_PATH (by default ~/.lingo/plugins) with additional directories to load plugins from (*.rb).
A dedicated API to support writing and integrating plugins will be added in the future.
EXAMPLE
TODO: Full-fledged example to show off Lingo's features and provide a basis for further discussion.
INSTALLATION AND USAGE
Since version 1.8.0, Lingo is available as a RubyGem. So a simple gem install lingo will install Lingo and its dependencies (you might want to run that command with administrator privileges, depending on your environment). Then you can call the lingo executable to process your data. See lingo --help for starters.
Please note that Lingo requires Ruby version 1.9.2 or higher to run (2.0.0 is the currently recommended version). If you want to use Lingo on Ruby 1.8, please refer to the legacy version (see below).
Since Lingo depends on native extensions, you need to make sure that development files for your Ruby version are installed. On Debian-based Linux platforms they are included in the package ruby1.9.1-dev; other distributions may have a similarly named package. On Windows those development files are currently not required.
Prior to version 1.8.0, Lingo expected to be run from its installation directory. This is no longer necessary. But if you prefer that use case, you can either download and extract an archive file or unpack the Gem archive (gem unpack lingo); or you can install the legacy version of Lingo (see below).
Dictionary and configuration file lookup
Lingo will search different locations to find dictionaries and configuration files. By default, these are the current directory, your personal Lingo directory (~/.lingo) and the installation directory (in that order). You can control this lookup path by either moving files up the chain (using the lingoctl executable) or by setting various environment variables.
With lingoctl you can copy dictionaries and configuration files from your personal Lingo directory or the installation directory to the current directory so you can modify them and they will take precedence over the original ones. See lingoctl --help for usage information.
In order to change the search path in itself, you can define the LINGO_PATH environment variable as a whole or its individual parts LINGO_CURR (the local Lingo directory), LINGO_HOME (your personal Lingo directory), and LINGO_BASE (the system-wide Lingo directory).
Inside of any of these directories dictionaries and configuration files are typically organized in the following directory structure:
config |
Configuration files (*.cfg). |
dict |
Dictionary source files (*.txt); in language-specific subdirectories (de, en, …). |
lang |
Language definition files (*.lang). |
store |
Compiled dictionaries, generated from source files. |
But for compatibility reasons these naming conventions are not enforced.
Legacy version
As Lingo 1.8 introduced some major disruptions and no longer runs on Ruby 1.8, there is a maintenance branch for Lingo 1.7.x that will remain compatible with both Ruby 1.8 and the previous line of Lingo prior to 1.8. This branch will receive occasional bug fixes and minor feature updates. However, the bulk of the development efforts will be directed towards Lingo 1.8+.
To install the legacy version, download and extract the ZIP archive from RubyForge. No additional dependencies are required. This version of Lingo works with both Ruby 1.8 (1.8.5 or higher) and 1.9 (1.9.2 or higher).
The executable is named lingo.rb. It's located at the root of the installation directory and may only be run from there. See ruby lingo.rb -h for usage instructions.
Configuration and language definition files are also located at the root of the installation directory (*.cfg and *.lang, respectively). Dictionary source files are found in language-specific subdirectories (de, en, …) and are named *.txt. The compiled dictionaries are found beneath these subdirectories in a directory named store.
FILE FORMATS
Lingo uses three different types of files to determine its behaviour. Configuration files control the details of the indexing process. Language definitions specify grammar rules and dictionaries available for indexing. Dictionaries, finally, hold the vocabulary used in indexing the input text and producing the results.
Configuration
TODO…
Language definition
TODO…
Dictionaries
TODO…
ISSUES AND CONTRIBUTIONS
If you find bugs or want to suggest new features, please write to the mailing list or report them on GitHub. Include your Ruby version (ruby --version) and the version of Lingo you are using (typically lingo --version, provided it's new enough to support that flag).
If you want to contribute to Lingo, please fork the project on GitHub and submit a pull request (bonus points for topic branches) or clone the repository locally and send your formatted patch to the developer list.
To make sure that Lingo's tests pass, install hen (typically gem install hen) and all development dependencies (either gem install --development lingo or manually; see rake gem:dependencies). Then run rake test for the basic tests or rake test:all for the full test suite.
LINKS
Website |
|
Demo |
|
Documentation |
|
Source code |
|
RubyGem |
|
RubyForge project |
|
Mailing list |
|
Bug tracker |
LITERATURE
-
Lepsky, K., Vorhauer, J.: Lingo: ein open source System für die automatische Indexierung deutschsprachiger Dokumente. (German) In: ABI Technik 26, 2006. p. 18-29.
-
Gödert, W., Lepsky, K., Nagelschmidt, M.: Informationserschließung und Automatisches Indexieren: ein Lehr- und Arbeitsbuch. (German) Berlin etc.: Springer, 2012.
-
Nohr, H.: Grundlagen der automatischen Indexierung: ein Lehrbuch. (German) Berlin: Logos, 2005.
-
Hausser, R.: Grundlagen der Computerlinguistik. Mensch-Maschine-Kommunikation in natürlicher Sprache. (German) Berlin etc.: Springer, 2000.
-
Allen, J.: Natural language understanding. (English) Redwood City, CA: Benjamin/Cummings, 1995.
-
Grishman, R.: Computational linguistics: an introduction. (English) Cambridge: Cambridge Univ. Press, 1986.
-
Salton, G., McGill, M.: Introduction to modern information retrieval. (English) New York etc.: McGraw-Hill, 1983.
-
Porter, M.: An algorithm for suffix stripping. (English) In: Program 14, 1980. p. 130-137.
CREDITS
Lingo is based on a collective development by Klaus Lepsky and John Vorhauer.
Authors
-
John Vorhauer <lingo@vorhauer.de>
-
Jens Wille <jens.wille@gmail.com>
Contributors
-
Klaus Lepsky <klaus@lepsky.de>
-
Jan-Helge Jacobs <plancton@web.de>
-
Thomas Müller <thomas.mueller@gesis.org>
-
Yulia Dorokhova <jdorokhova@hse.ru>
LICENSE AND COPYRIGHT
Copyright (C) 2005-2007 John Vorhauer Copyright (C) 2007-2013 John Vorhauer, Jens Wille
Lingo is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Lingo is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with Lingo. If not, see <www.gnu.org/licenses/>.