Main Content

Text Data Preparation

Import text data into MATLAB® and preprocess it for analysis

Text Analytics Toolbox™ includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. Use these tools to extract text from popular file formats, preprocess raw text, extract individual words or multiword phrases (n-grams), convert text into numerical representations, and build statistical models. For an example showing how to get started, see Prepare Text Data for Analysis.

Text Analytics Toolbox supports the languages English, Japanese, German, and Korean. Most Text Analytics Toolbox functions work with text from other languages. For more information, see Language Considerations.

Live Editor Tasks

Preprocess Text DataPreprocess and clean up text data for analysis

Functions

expand all

extractFileTextRead text from PDF, Microsoft Word, HTML, and plain text files
extractHTMLTextExtract text from HTML
readPDFFormDataRead data from PDF forms
pdfinfoPDF file information
writeTextDocumentWrite documents to text file
htmlTreeParsed HTML tree
findElementFind elements in HTML tree
getAttributeRead HTML attribute of root node of HTML tree
ismissingFind HTML trees without values
stringConvert parsed HTML tree to string
tokenizedDocumentArray of tokenized documents for text analysis
erasePunctuationErase punctuation from text and documents
eraseTagsErase HTML and XML tags from text
eraseURLsErase HTTP and HTTPS URLs from text
removeStopWordsRemove stop words from documents
removeShortWordsRemove short words from documents or bag-of-words model
removeLongWordsRemove long words from documents or bag-of-words model
removeWordsRemove selected words from documents or bag-of-words model
normalizeWordsStem or lemmatize words
replaceWordsReplace words in documents
replaceNgramsReplace n-grams in documents
splitSentencesSplit text into sentences
splitParagraphsSplit text into paragraphs
stopWordsList of stop words
decodeHTMLEntitiesConvert HTML and XML entities into characters
lowerConvert documents to lowercase
upperConvert documents to uppercase
contextSearch documents for word or n-gram occurrences in context
tokenDetailsDetails of tokens in tokenized document array
addSentenceDetailsAdd sentence numbers to documents
addPartOfSpeechDetailsAdd part-of-speech tags to documents
addLemmaDetailsAdd lemma forms of tokens to documents
addLanguageDetailsAdd language identifiers to documents
addEntityDetailsAdd entity tags to documents
addDependencyDetailsAdd grammatical dependency details to documents
addTypeDetailsAdd token type details to documents
splitSentencesSplit text into sentences
splitParagraphsSplit text into paragraphs
corpusLanguageDetect language of text
abbreviationsTable of common abbreviations
topLevelDomainsList of top-level domains
bagOfWordsBag-of-words model
bagOfNgramsBag-of-n-grams model
addDocumentAdd documents to bag-of-words or bag-of-n-grams model
removeDocumentRemove documents from bag-of-words or bag-of-n-grams model
removeInfrequentWordsRemove words with low counts from bag-of-words model
removeInfrequentNgramsRemove infrequently seen n-grams from bag-of-n-grams model
removeNgramsRemove n-grams from bag-of-n-grams model
removeEmptyDocumentsRemove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
topkwordsMost important words in bag-of-words model or LDA topic
topkngramsMost frequent n-grams
encodeEncode documents as matrix of word or n-gram counts
tfidfTerm Frequency–Inverse Document Frequency (tf-idf) matrix
joinCombine multiple bag-of-words or bag-of-n-grams models
correctSpellingCorrect spelling of words
editDistanceFind edit distance between two strings or documents
editDistanceSearcherEdit distance nearest neighbor searcher
knnsearchFind nearest neighbors by edit distance
rangesearchFind nearest neighbors by edit distance range
splitGraphemesSplit string into graphemes
docfunApply function to words in documents
containsWordsCheck if word is member of documents
containsNgramsCheck if n-gram is member of documents
containsCheck if pattern is substring in documents
plusAppend documents
replaceReplace substrings in documents
regexprepReplace text in words of documents using regular expression
doclengthLength of documents in document array
doc2cellConvert documents to cell array of string vectors
joinWordsConvert documents to string by joining words
stringConvert scalar document to string vector
textanalytics.unicode.nfcUnicode composed normalized form (NFC)
textanalytics.unicode.nfdUnicode decomposed normalized form (NFD)
textanalytics.unicode.nfkcUnicode compatibility composed normalized form (NFKC)
textanalytics.unicode.nfkdUnicode compatibility decomposed normalized form (NFKD)
textanalytics.unicode.UTF32Unicode UTF-32 string representation
characterCategoriesUnicode character categories
hexConvert UTF-32 representation to hexadecimal values
stringConvert UTF-32 representation to string

Topics

Import

Preprocessing

Language Support