On Friday I posted an updated version of Syuzhet (1.0.4) to CRAN. This version has been available over on GitHub for a while now. In version 1.0.4, support for sentiment detection in several languages was added by using the expanded NRC lexicon from Saif Mohammed. The lexicon includes sentiment values for 13,901 words in each of the following languages: Arabic, Basque, Bengali, Catalan, Chinese_simplified, Chinese_traditional, Danish, Dutch, English, Esperanto, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi, Irish, Italian, Japanese, Latin, Marathi, Persian, Portuguese, Romanian, Russian, Somali, Spanish, Sudanese, Swahili, Swedish, Tamil, Telugu, Thai, Turkish, Ukranian, Urdu, Vietnamese, Welsh, Yiddish, Zulu.

At the time of this release, however, Syuzhet will only work with languages that use Latin character sets. This effectively means that “Arabic”, “Bengali”, “Chinese_simplified”, “Chinese_traditional”, “Greek”, “Gujarati”, “Hebrew”, “Hindi”, “Japanese”, “Marathi”, “Persian”, “Russian”, “Tamil”, “Telugu”, “Thai”, “Ukranian”, “Urdu”, “Yiddish” are not supported even though these languages are part of the extended NRC dictionary and can be accessed via the get_sentiment_dictionary() function. I have heard from several of my non-English native speaking students and a few others on Twitter that the German, French, and Spanish results seem to be good. Your mileage may vary. For details on the lexicon, see NRC Emotion Lexicon.

Also in this release is support for user created lexicons. To work, users create their own custom lexicon as a data frame with at least two columns named “word” and “value.” Here is a simplified example:

With contributions from Philip Bulsink, support for parallel processing was added so that one can call get_sentiment() and provide cluster information from parallel::makeCluster() to achieve results quicker on systems with multiple cores.

Thanks also to Jennifer Isasi, Tyler Rinker, “amrrs,” and Oliver Keyes for recent suggestions/contributions/QA.

Examples of how to use these new functions and languages are in the updated vignette.