The use of Wiktionary in Natural Language Processing

Torsten Zesch

In contrast to Wikipedia, Wiktionary has received little attention by
the NLP research community so far.

I know of its use for subjectivity and polarity classification (Chesley
et al., 2006), and for diachronic phonology (Bouchard et al., 2007).

Alexandre Bouchard, Percy Liang, Thomas Griffiths, and Dan Klein. 2007.
  A probabilistic approach to diachronic phonology. In Proceedings of
  the 2007. In Proceedings of EMNLP-CoNLL, pages 887–896.

Paula Chesley, Bruce Vincent, Li Xu, and Rohini Srihari. 2006.
  Using verbs and adjectives to automatically classify blog sentiment.
  In Proceedings of AAAI-CAAW-06, the Spring Symposia on Computational
  Approaches to Analyzing Weblogs.

If anybody knows of other papers that describe work where Wiktionary has
been used in NLP, I would be happy to hear about it.

At UKP Lab, we have recently used Wiktionary as a lexical semantic resource for
computing semantic relatedness.

Our main findings are:
* Wiktionary offers an astonishing amount of lexical semantic
  information, but also poses new challenges due to its collaborative
  construction approach and the resulting occasional instance
  incompleteness and inconsistency.

* Wiktionary can be used as a substitute for traditional semantic networks
  like Princeton WordNet for some tasks, for example computing semantic
  relatedness. Somewhat surprisingly, it outperforms traditional wordnets
  as well as Wikipedia on this task.

Some recent publications devoted to this issue are:

Zesch, T.; Mueller, C. & Gurevych, I.
  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary.
  In Proceedings of the Conference on Language Resources and Evaluation
  (LREC), 2008

Recently, collaboratively constructed resources such as Wikipedia and
Wiktionary have been discovered as valuable lexical semantic knowledge
bases with a high potential in diverse Natural Language Processing (NLP)
tasks. Collaborative knowledge bases however significantly differ from
traditional linguistic knowledge bases in various respects, and this
constitutes both an asset and an impediment for research in NLP. This paper
addresses one such major impediment, namely the lack of suitable
programmatic access mechanisms to the knowledge stored in these large
semantic knowledge bases. We present two application programming interfaces
for Wikipedia and Wiktionary which are especially designed for mining the
rich lexical semantic information dispersed in the knowledge bases, and
provide efficient and structured access to the available knowledge. As we
believe them to be of general interest to the NLP community, we have made
them freely available for research purposes.


Zesch, T.; Mueller, C. & Gurevych, I.
  Using Wiktionary for Computing Semantic Relatedness.
  In Proceedings of AAAI, 2008


We introduce Wiktionary as an emerging lexical semantic resource that can be
used as a substitute for expert-made resources in AI applications. We evaluate
Wiktionary on the pervasive task of computing semantic relatedness for English
and German by means of correlation with human rankings and solving word choice
problems. For the first time, we apply a concept vector based measure to a set
of different concept representations like Wiktionary pseudo glosses, the first
paragraph of Wikipedia articles, English WordNet glosses, and GermaNet pseudo
glosses. We show that: (i) Wiktionary is the best lexical semantic resource in
the ranking task and performs comparably to other resources in the word choice
task, and (ii) the concept vector based approach yields the best results on all
datasets in both evaluations.


UKP Lab is working on the release of a freely available Java-based API to
access the lexical semantic information contained in Wiktionary.
The release is scheduled for June 2008 at

There is also a new release of the Java-based API for Wikipedia.
It is much faster now and contains a Mediawiki markup parser that
can be used to analyze the contents of a Wikipedia page. The parser
can also be used stand-alone to analyze further web pages using
MediaWiki markup.


