% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/sentiment_engines.R
\name{compute_sentiment}
\alias{compute_sentiment}
\title{Compute document-level sentiment across features and lexicons}
\usage{
compute_sentiment(x, lexicons, how = "proportional", tokens = NULL,
  nCore = 1)
}
\arguments{
\item{x}{either a \code{sentocorpus} object created with \code{\link{sento_corpus}}, a \pkg{quanteda}
\code{\link[quanteda]{corpus}} object, or a \code{character} vector. The latter two do not incorporate a
date dimension. In case of a \code{\link[quanteda]{corpus}} object, the \code{numeric} columns from the
\code{\link[quanteda]{docvars}} are considered as features over which sentiment will be computed. In
case of a \code{character} vector, sentiment is only computed across lexicons.}

\item{lexicons}{a \code{sentolexicons} object created using \code{\link{sento_lexicons}}.}

\item{how}{a single \code{character} vector defining how aggregation within documents should be performed. For currently
available options on how aggregation can occur, see \code{\link{get_hows}()$words}.}

\item{tokens}{a \code{list} of tokenized documents, to specify your own tokenization scheme. Can result from the
\pkg{quanteda}'s \code{\link[quanteda]{tokens}} function, the \pkg{tokenizers} package, or other. Make sure the tokens are
constructed from (the texts from) the \code{x} argument, are unigrams, and preferably set to lowercase, otherwise, results
may be spurious and errors could occur. By default set to \code{NULL}.}

\item{nCore}{a positive \code{numeric} that will be passed on to the \code{numThreads} argument of the
\code{\link[RcppParallel]{setThreadOptions}} function, to parallelize the sentiment computation across texts. A
value of 1 (default) implies no parallelization. Parallelization is expected to improve speed of the sentiment
computation only for sufficiently large corpora.}
}
\value{
If \code{x} is a \code{sentocorpus} object, a \code{sentiment} object, i.e., a \code{data.table} containing
the sentiment scores \code{data.table} with an \code{"id"}, a \code{"date"} and a \code{"word_count"} column,
and all lexicon--feature sentiment scores columns. A \code{sentiment} object can be used for aggregation into
time series with the \code{\link{aggregate.sentiment}} function.

If \code{x} is a \pkg{quanteda} \code{\link[quanteda]{corpus}} object, a sentiment scores
\code{data.table} with an \code{"id"} and a \code{"word_count"} column, and all lexicon--feature
sentiment scores columns.

If \code{x} is a \code{character} vector, a sentiment scores
\code{data.table} with a \code{"word_count"} column, and all lexicon--feature sentiment scores columns.
}
\description{
Given a corpus of texts, computes (net) sentiment per document using the bag-of-words approach
based on the lexicons provided and a choice of aggregation across words per document.
}
\details{
For a separate calculation of positive (resp. negative) sentiment, one has to provide distinct positive (resp.
negative) lexicons. This can be done using the \code{do.split} option in the \code{\link{sento_lexicons}} function, which
splits out the lexicons into a positive and a negative polarity counterpart. All \code{NA}s are converted to 0, under the
assumption that this is equivalent to no sentiment. If \code{tokens = NULL} (as per default), texts are tokenized as
unigrams using the \code{\link[tokenizers]{tokenize_words}} function. Punctuation and numbers are removed, but not
stopwords. The number of words for each document is computed based on that same tokenization. All tokens are converted
to lowercase, in line with what the \code{\link{sento_lexicons}} function does for the lexicons and valence shifters.
}
\section{Calculation}{

If the \code{lexicons} argument has no \code{"valence"} element, the sentiment computed corresponds to simple unigram
matching with the lexicons [\emph{unigrams} approach]. If valence shifters are included in \code{lexicons} with a
corresponding \code{"y"} column, these have the effect of modifying the polarity of a word detected from the lexicon if
appearing right before such word (examples: not good, very bad or can't defend) [\emph{bigrams} approach]. If the valence
table contains a \code{"t"} column, valence shifters are searched for in a cluster centered around a detected polarity
word [\emph{clusters} approach]. The latter approach is similar along the one utilized by the \pkg{sentimentr} package,
but simplified. A cluster amounts to four words before and two words after a polarity word. A cluster never overlaps with
a preceding one. Roughly speaking, the polarity of a cluster is calculated as \eqn{n(1 + 0.80d)S + \sum s}. The polarity
score of the detected word is \eqn{S}, \eqn{s} represents polarities of eventual other sentiment words, and \eqn{d} is
the difference between the number of amplifiers (\code{t = 2}) and the number of deamplifiers (\code{t = 3}). If there
is an odd number of negators (\code{t = 1}), \eqn{n = -1} and amplifiers are counted as deamplifiers, else \eqn{n = 1}.
All scores, whether per unigram, per bigram or per cluster, are summed within a document, before the scaling defined
by the \code{how} argument is applied. The \code{how = "proportionalPol"} option divides each document's sentiment
score by the number of detected polarized words (counting words that appear multiple times by their frequency), instead
of the total number of words which the \code{how = "proportional"} option gives. The \code{how = "counts"} option
does no normalization. See the vignette for more details.
}

\examples{
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")

l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]])
l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
                     list_valence_shifters[["en"]][, c("x", "t")])

# from a sentocorpus object, unigrams approach
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 200)
sent1 <- compute_sentiment(corpusSample, l1, how = "proportionalPol")

# from a character vector, bigrams approach
sent2 <- compute_sentiment(usnews[["texts"]][1:200], l2, how = "counts")

# from a corpus object, clusters approach
corpusQ <- quanteda::corpus(usnews, text_field = "texts")
corpusQSample <- quanteda::corpus_sample(corpusQ, size = 200)
sent3 <- compute_sentiment(corpusQSample, l3, how = "counts")

# from an already tokenized corpus, using the 'tokens' argument
toks <- as.list(quanteda::tokens(corpusQSample, what = "fastestword"))
sent4 <- compute_sentiment(corpusQSample, l1[1], how = "counts", tokens = toks)

}
\author{
Samuel Borms
}
