This CRAN task view contains a list of packages useful for natural language processing.
Side-note on text mining: In recent years, we have elaborated a framework to be used in
packages dealing with the processing of written material: the package
tm.
Extension packages in this area are highly recommended to interface with tm's basic routines
and developers are cordially invited to join in the discussion on further developments of this
framework package.
Phonetics and Speech Processing:
-
emu
is a collection of tools for the creation, manipulation, and analysis of speech databases. At the core of EMU is a database search engine which allows the researcher to find various speech segments based on the sequential and hierarchical structure of the utterances in which they occur. EMU includes an interactive labeller which can display spectrograms and other speech waveforms, and which allows the creation of hierarchical, as well as sequential, labels for a speech utterance.
Lexical Databases:
-
wordnet
provides an R interface
to
WordNet
, a large
lexical database of English.
Keyword Extraction and General String Manipulation:
-
R's base package already provides a rich set of character manipulation
routines. See
help.search(keyword = "character", package = "base")
for more information on these capabilities.
-
RKEA
provides an R interface to
KEA
(Version 5.0). KEA (for
Keyphrase Extraction Algorithm) allows for extracting keyphrases from
text documents. It can be either used for free indexing or for indexing
with a controlled vocabulary.
-
gsubfn
can be used for certain parsing tasks such as
extracting words from strings by content rather than by delimiters.
demo("gsubfn-gries")
shows an example of this in a natural language
processing context.
-
tau
contains basic string manipulation and analysis routines needed in text processing such as dealing with character encoding, language, pattern counting, and tokenization.
Natural Language Processing:
-
openNLP
provides an R interface
to
OpenNLP
, a
collection of natural language processing tools including a
sentence detector, tokenizer, pos-tagger, shallow and full
syntactic parser, and named-entity detector, using the Maxent
Java package for training and using maximum entropy
models.
-
openNLPmodels.en
ships trained models for English and
openNLPmodels.es
for Spanish to be used
with
openNLP.
-
RWeka
is a interface
to
Weka
which is a collection of machine learning algorithms for data
mining tasks written in Java. Especially useful in the context
of natural language processing is its functionality for
tokenization and stemming.
-
Snowball
provides the Snowball stemmers which contain the Porter
stemmer and several other stemmers for different
languages. See
the
Snowball
webpage for details.
-
Rstem
is an alternative interface to a C version of Porter's word
stemming algorithm.
-
KoNLP
provides a collection of conversion routines (e.g. Hangul to Jamos),
stemming, and part of speech tagging through interfacing with the Lucene's HanNanum analyzer.
In version 0.0-8.0, the documentation is sparse and still needs some help.
String Kernels:
-
kernlab
allows to create and compute with string kernels, like full string,
spectrum, or bounded range string kernels. It can directly use
the document format used
by
tm
as input.
Text Mining:
-
tm
provides a comprehensive text mining framework for
R. The
Journal of Statistical Software
article
Text Mining
Infrastructure in R
gives a detailed overview and presents
techniques for count-based analysis methods, text clustering,
text classification and string kernels.
-
RcmdrPlugin.TextMining
adds a new menu TextMining to R-Commander, a graphical user interface for R.
-
lsa
provides routines for performing a latent semantic analysis with R.
The basic idea of latent semantic analysis (LSA) is,
that text do have a higher order (=latent semantic) structure which,
however, is obscured by word usage (e.g. through the use of synonyms
or polysemy). By using conceptual indices that are derived statistically
via a truncated singular value decomposition (a two-mode factor analysis)
over a given document-term matrix, this variability problem can be overcome.
The article
Investigating
Unstructured Texts with Latent Semantic Analysis
gives a detailed overview and demonstrates the use of the package
with examples from the are of technology-enhanced learning.
-
topicmodels
provides an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) by David M. Blei and co-authors and the C++ code for fitting LDA models using Gibbs sampling by Xuan-Hieu Phan and co-authors.
-
RTextTools
is a machine learning package for automatic
text classification. It implements the nine different algorithms (svm, slda,
boosting, bagging, rf, glmnet, tree, nnet, and maxent) and routines supporting
the evaluation of accuracy.
-
textcat
provides support for n-gram based text categorization.
-
corpora
offers utility functions for the statistical analysis of corpus frequency data.
-
languageR
provides data sets and functions exemplifying statistical methods, and some
facilitatory utility functions used in the book by R. H. Baayen: "Analyzing Linguistic Data: a Practical
Introduction to Statistics Using R", Cambridge University Press, 2008.
-
zipfR
offers some statistical models for word frequency distributions. The
utilities include functions for loading, manipulating and visualizing word frequency data and
vocabulary growth curves. The package also implements several statistical models for the
distribution of word frequencies in a population. (The name of this library derives from the
most famous word frequency distribution, Zipf's law.)
-
maxent
is an implementation of maxinum entropy minimising memory consumption of very large data-sets.
-
TextRegression
predict valued outputs based on an input matrix and assess predictive power ('the bag-of-words oracle').
-
wordcloud
provides a visualisation similar to the famous wordle ones: it horizontally and vertically distributes features in a pleasing visualisation with the font size scaled by frequency.
Import filters and Data Handling:
-
tm.plugin.dc
allows for distributing corpora across storage devices (local files or Hadoop Distributed File System).
-
tm.plugin.mail
helps with importing mail messages from archive files such as used in Thunderbird (mbox, eml).