This package describes and validates formats for storing common object arising in text analysis as native R objects. Representations of a text corpus, document term matrix, and tokenized text are included. The tokenized text format is extensible to include other annotations. There are two versions of the corpus and tokens objects; packages should accept both and return or coerce to at least one of these.corpus (data frame) - A valid corpus data frame object is a data frame with at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.