% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/textcleaner.R
\name{textcleaner}
\alias{textcleaner}
\title{Text Cleaner}
\usage{
textcleaner(data, miss = 99, partBY = c("row", "col"),
  dictionary = NULL, tolerance = 1)
}
\arguments{
\item{data}{Matrix or data frame.
A dataset of text data.
Participant IDs should be made to be row or
column names to specify whether participants
are by row or column (see argument \code{partBY}).
If no IDs are provided, then their order in the corresponding
row (or column is used).
A message will notify the user how IDs were assigned}

\item{miss}{Numeric or character.
Value for missing data.
Defaults to \code{99}}

\item{partBY}{Character.
Are participants by row or column?
Set to \code{"row"} for by row.
Set to \code{"col"} for by column}

\item{dictionary}{Character vector.
Can be a vector of a corpus or any text for comparison.
Dictionary to be used for more efficient text cleaning.
Defaults to \code{NULL}, which will use \code{\link[SemNetDictionaries]{general.dictionary}}

Use \code{dictionaries()} or \code{find.dictionaries()} for more options
(See \code{\link{SemNetDictionaries}} for more details)}

\item{tolerance}{Numeric.
The distance tolerance set for automatic spell-correction purposes.
This function uses the function \code{\link[stringdist]{stringdist}}
to compute the \href{https://en.wikipedia.org/wiki/Damerau-Levenshtein_distance}{Damerau-Levenshtein}
(DL) distance, which is used to determine potential best guesses.

Unique words (i.e., \emph{n} = 1) that are within the (distance) tolerance are
automatically output as best guess responses, which are then passed through
\code{\link[SemNetCleaner]{word.check.wrapper}}. If there is more than one word
that is within or below the distance tolerance, then these will be provided as potential
options.

The recommended and default distance tolerance is \code{tolerance = 1},
which only spell corrects a word if there is only one word with a DL distance of 1.}
}
\value{
This function returns a list containing the following objects:

\item{binary}{A matrix of responses where each row represents a participant
and each column represents a unique response. A response that a participant has provided is a '\code{1}'
and a response that a participant has not provided is a '\code{0}'}

\item{responses}{A list containing two objects:

\itemize{

\item{clean.resp}{A response matrix that has been spell-checked and de-pluralized with duplicates removed.
This can be used as a final dataset for analyses (e.g., fluency of responses)}

\item{orig.resp}{The original response matrix that has had white spaces before and
after words response. Also converts all upper-case letters to lower case}

}

}

\item{spellcheck}{A list containing three objects:

\itemize{

\item{\code{full}}
{All responses regardless of spell-checking changes}

\item{\code{unique}}
{Only responses that were changed during spell-check (includes
correct responses that were changed to singular form and lower case)}

\item{\code{auto}}
{Only the incorrect responses that were changed during spell-check}

}

}

\item{removed}{A list containing two objects: 

\itemize{

\item{\code{rows}}
{Identifies removed participants by their row (or column) location in the original data file}

\item{\code{ids}}
{Identifies removed participants by their ID (see argument \code{data}}

}

}

\item{partChanges}{A list where each participant is a list index with each
response that was been changed. Participants are identified by their ID (see argument \code{data}).
This can be used to replicate the cleaning process and to keep track of changes more generally.
Participants with \code{NA} did not have any changes from their original data
and participants with missing data are removed (see \code{removed$ids})}
}
\description{
An automated cleaning function for spell-checking, de-pluralizing,
removing duplicates, and binarizing text data
}
\details{
When working through the menu options in \code{\link[SemNetCleaner]{textcleaner}},
there may be mistakes. For instance, selecting to \code{REMOVE} a response when really
all you wanted to do was \code{RENAME} a response. There are a couple of options:

RECOMMENDED

1. You can make a note in your \code{R} script for the change you wanted
to make (you can keep moving through the cleaning process).
After the cleaning process is through, you can check the \code{spellcheck$unique}
output of \code{\link[SemNetCleaner]{textcleaner}} to see what changes
you made. To correct any changes you made in the cleaning process,
you can use the \code{\link[SemNetCleaner]{corr.chn}} function

NOT RECOMMENDED

2. You can use \code{esc} to exit out of a menu selection process.
This is NOT recommended because you will lose all changes that
you've made up to that point
}
\examples{
# Toy example
raw <- open.animals[c(1:10),-c(1,2)]

# Clean and prepocess data
clean <- textcleaner(raw, partBY = "row", dictionary = "animals")
\donttest{
#Full test
clean <- textcleaner(open.animals[,-1], partBY = "row", dictionary = "animals")
}

}
\references{
Hornik, K., & Murdoch, D. (2010).
Watch Your Spelling!.
\emph{The R Journal}, \emph{3}, 22-28.
doi:\href{https://doi.org/10.32614/RJ-2011-014}{10.32614/RJ-2011-014}
}
\author{
Alexander Christensen <alexpaulchristensen@gmail.com>
}
