| Title: | Chinese Text Segmentation, POS Tagging, and Keyword Extraction for R |
|---|---|
| Description: | Provides fast Chinese text segmentation, keyword extraction via 'TF-IDF' and 'TextRank', and part-of-speech tagging, powered by a 'Rust' backend ('jieba-rs'). Supports custom dictionaries, user words, stop words, IDF files, and HMM models, with parallel batch processing of multiple strings. Serves as a modern, maintained replacement for the 'jiebaR' package. |
| Authors: | Hao Cheng [aut, cre, cph] |
| Maintainer: | Hao Cheng <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-07-02 16:45:03 UTC |
| Source: | https://github.com/Yousa-Mirage/jiebaRS |
Count contiguous n-grams from a segmented character vector or from each element of a list of segmented character vectors.
This function is a drop-in replacement for jiebaR::get_tuple(), which
is deprecated in jiebaRS. See Details for more information.
count_ngrams( x, ..., n = 2, sep = " ", sort = TRUE, format = c("data.frame", "vector") )count_ngrams( x, ..., n = 2, sep = " ", sort = TRUE, format = c("data.frame", "vector") )
x |
A character vector of tokens or a list of character vectors. |
... |
Must be empty. This enforces that optional arguments such as |
n |
A positive integer or integer vector giving the n-gram sizes to
count. The default is |
sep |
Separator inserted between tokens when constructing the n-gram
label. The default is |
sort |
Whether to sort results by descending frequency. The default
is |
format |
Output format. |
The original jiebaR::get_tuple() interface has several design problems:
Its n-gram extraction behavior does not match the most obvious reading of
the argument name: size = n counts all contiguous n-grams from 2:n,
not just the exact size n.
Its documentation says it accepts list input, but the original exported implementation does not reliably support lists.
It concatenates tokens without a separator, which makes tuple boundaries ambiguous.
count_ngrams() addresses these issues, providing more explicit and
abundant parameters. In addition, this function is about 1.3x to
2.0x faster than jiebaR::get_tuple().
N-gram counts in the requested format.
count_ngrams(c("\u6211", "\u7231", "R"), n = 2) count_ngrams(c("\u6211", "\u7231", "R"), n = 1:2, format = "data.frame") count_ngrams(c("a", "b", "b", "b", "a"), n = 1, sort = FALSE) count_ngrams(list(c("a", "b", "c"), c("a", "b")), n = 2)count_ngrams(c("\u6211", "\u7231", "R"), n = 2) count_ngrams(c("\u6211", "\u7231", "R"), n = 1:2, format = "data.frame") count_ngrams(c("a", "b", "b", "b", "a"), n = 1, sort = FALSE) count_ngrams(list(c("a", "b", "c"), c("a", "b")), n = 2)
Remove selected words from a segmented character vector or from each element of a list of segmented character vectors.
filter_segment(input, filter_words, keep_na = TRUE)filter_segment(input, filter_words, keep_na = TRUE)
input |
A character vector or a list of character vectors. |
filter_words |
A character vector of words to remove. |
keep_na |
Whether to keep |
This is a modern reimplementation of jiebaR::filter_segment() with the
same core filtering behavior under the default settings.
In the reproducible benchmark, this version is about 110x to 140x
faster than jiebaR::filter_segment() on the tested workloads.
An object with the same shape as input, with matching words
removed.
filter_segment(c("abc", "def", " ", "."), c("abc")) filter_segment(c("a", NA, "b", "a"), c("b"), keep_na = FALSE) input <- list( c("\u6211", "\u662f", "\u6d4b\u8bd5"), c("\u6d4b\u8bd5", "\u6587\u672c", "\u6211") ) filter_segment(input, "\u6211")filter_segment(c("abc", "def", " ", "."), c("abc")) filter_segment(c("a", NA, "b", "a"), c("b"), keep_na = FALSE) input <- list( c("\u6211", "\u662f", "\u6d4b\u8bd5"), c("\u6d4b\u8bd5", "\u6587\u672c", "\u6211") ) filter_segment(input, "\u6211")
This function returns the frequency of words.
freq(x, ..., sort = FALSE)freq(x, ..., sort = FALSE)
x |
A character vector of words. |
... |
Must be empty. This enforces that optional arguments such as
|
sort |
Whether to sort the result by descending frequency. The default
|
A data frame with char and freq columns.
freq(c("b", "a", "b", "c", "a")) freq(c("b", "a", "b", "c", "a"), sort = TRUE)freq(c("b", "a", "b", "c", "a")) freq(c("b", "a", "b", "c", "a"), sort = TRUE)
Generate IDF dict from a list of documents.
get_idf(x, stop_word = NULL, stop_word_file = NULL, path = NULL)get_idf(x, stop_word = NULL, stop_word_file = NULL, path = NULL)
x |
a list of character vectors. Each vector represents a document of already-segmented words. |
stop_word |
Optional character vector of stop words supplied directly. |
stop_word_file |
Optional file path containing one stop word per line. |
path |
Optional output file path. When |
Input list contains multiple character vectors with words, and each vector represents a document.
Stop words will be removed from the result.
If path is not NULL, it will write the result to the path.
A data frame with name and count columns, or a file path
(invisibly) when path is supplied.
get_idf(list(c("abc", "def"),c("abc", " ")))get_idf(list(c("abc", "def"),c("abc", " ")))
jiebaR::get_tuple()
get_tuple() is kept only for compatibility with jiebaR. New code should
use count_ngrams() instead.
get_tuple(x, size = 2, dataframe = TRUE)get_tuple(x, size = 2, dataframe = TRUE)
x |
A character vector of tokens or a list of character vectors. |
size |
A single integer >= 2. The compatibility semantics count all
contiguous n-grams from 2 up to |
dataframe |
Whether to return a data frame. If |
This function is deprecated and should not be used in new code.
It is provided only as a compatibility wrapper around count_ngrams()
and replicates the behavior of jiebaR::get_tuple().
Prefer count_ngrams() because the original jiebaR::get_tuple() interface
has several design problems:
Its n-gram extraction behavior does not match the most obvious reading of
the argument name: size = n counts all contiguous n-grams from 2:n,
not just the exact size n.
Its documentation says it accepts list input, but the original exported implementation does not reliably support lists.
It concatenates tokens without a separator, which makes tuple boundaries ambiguous.
If dataframe = TRUE, a data frame with name and count columns,
sorted by descending count. Otherwise, a named integer vector.
suppressWarnings(get_tuple(c("sd", "sd", "sd", "rd"), 2))suppressWarnings(get_tuple(c("sd", "sd", "sd", "rd"), 2))
Extract TF-IDF keywords from a single in-memory string with a keyword worker
created by worker(). This is separate from textrank(), which uses
TextRank weighting.
keywords(code, jiebar, ..., format = c("vector", "data.frame", "legacy"))keywords(code, jiebar, ..., format = c("vector", "data.frame", "legacy"))
code |
A character to analyze. |
jiebar |
A |
... |
Must be empty. This enforces that optional arguments such as
|
format |
Output format. |
Keyword results in the requested format.
Convenience wrapper around keywords() that always returns a data frame with
term and weight columns.
keywords_df(x, jiebar)keywords_df(x, jiebar)
x |
A character to analyze. |
jiebar |
A |
A data frame with term and weight columns.
Add one or more custom words to a jieba worker.
new_user_word(worker, words, tags = "n", freq = NULL) add_word(worker, words, tags = "n", freq = NULL)new_user_word(worker, words, tags = "n", freq = NULL) add_word(worker, words, tags = "n", freq = NULL)
worker |
A |
words |
A single string or a character vector of new words. |
tags |
A single tag or a character vector of tags. Defaults to |
freq |
Optional non-negative integer frequency or integer vector of
frequencies. Defaults to |
cutter <- worker() segment("\u91cf\u5b50\u673a\u5668\u72d7", cutter) new_user_word(cutter, "\u91cf\u5b50\u673a\u5668\u72d7", tags = "n", freq = 1000L) segment("\u91cf\u5b50\u673a\u5668\u72d7", cutter) cutter2 <- worker() add_word( cutter2, c("\u8d85\u5bfc\u91cf\u5b50\u6bd4\u7279", "\u91cf\u5b50\u673a\u5668\u72d7"), tags = c(NA, "n"), freq = c(NA, 1000L) ) segment("\u8d85\u5bfc\u91cf\u5b50\u6bd4\u7279", cutter2)cutter <- worker() segment("\u91cf\u5b50\u673a\u5668\u72d7", cutter) new_user_word(cutter, "\u91cf\u5b50\u673a\u5668\u72d7", tags = "n", freq = 1000L) segment("\u91cf\u5b50\u673a\u5668\u72d7", cutter) cutter2 <- worker() add_word( cutter2, c("\u8d85\u5bfc\u91cf\u5b50\u6bd4\u7279", "\u91cf\u5b50\u673a\u5668\u72d7"), tags = c(NA, "n"), freq = c(NA, 1000L) ) segment("\u8d85\u5bfc\u91cf\u5b50\u6bd4\u7279", cutter2)
Segment one or more strings with a jieba_worker created by worker().
segment( code, jiebar, ..., mod = NULL, batch = c("list", "data.frame", "flatten") )segment( code, jiebar, ..., mod = NULL, batch = c("list", "data.frame", "flatten") )
code |
A character vector to segment. |
jiebar |
A |
... |
Must be empty. This enforces that optional arguments such as
|
mod |
Deprecated Compatibility argument retained from |
batch |
Batch aggregation mode for multi-string input. Must be
one of |
For a single input string, segment() always returns a character vector of
segmented tokens.
In the current release benchmarks on the bundled Fortress Besieged and
Dream of the Red Chamber texts, jiebaRS::segment() is about 1.7x to
1.9x faster than jiebaR::segment() when each novel is segmented as one
long string. When the input is many short strings segmented in parallel,
jiebaRS::segment() reaches about 7x to 12x speedup over jiebaR.
For very long texts, splitting into about 32 to 128 chunks before segmentation is recommended for good throughput.
For multiple input strings, the argument batch controls how the
per-string token vectors are aggregated:
"list": one character vector per input string.
"data.frame": a data frame with doc_id and word columns.
"flatten": all token vectors concatenated into one character vector.
When batch is omitted, jiebaRS returns list output for multi-string
input.
The mod argument from jiebaR::segment() is retained only as a deprecated
compatibility placeholder. In jiebaRS, segmentation behavior should be
controlled by the worker type itself (for example, worker(type = "mix") or
worker(type = "query")), not by mutating behavior at call time. When mod
is supplied, jiebaRS warns and ignores it.
Segmented tokens in the requested aggregation form.
seg <- worker() text1 <- "\u5357\u4eac\u5e02\u957f\u6c5f\u5927\u6865" text2 <- "\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5" segment(text1, seg) segment(c(text1, text2), seg, batch = "list") segment(c(text1, text2), seg, batch = "data.frame")seg <- worker() text1 <- "\u5357\u4eac\u5e02\u957f\u6c5f\u5927\u6865" text2 <- "\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5" segment(text1, seg) segment(c(text1, text2), seg, batch = "list") segment(c(text1, text2), seg, batch = "data.frame")
Convenience wrapper around segment() for multi-string input. When
batch is omitted, segment_batch() will return list output by default.
segment_batch(texts, jiebar, ..., batch = c("list", "data.frame", "flatten"))segment_batch(texts, jiebar, ..., batch = c("list", "data.frame", "flatten"))
texts |
A character vector of strings to segment. |
jiebar |
A |
... |
Must be empty. This enforces that optional arguments such as
|
batch |
Batch aggregation mode. Must be one of |
segment_batch() is a convenience wrapper around segment() for explicit
batch processing. It always treats texts as multi-string input. The
returned object depends on batch:
"list": one character vector per input string.
"data.frame": a data frame with doc_id and word columns.
"flatten": one concatenated character vector.
In the current release benchmarks on the bundled Fortress Besieged and
Dream of the Red Chamber texts, batch segmentation reaches about 7x to
12x speedup over the comparable jiebaR workflow on many-string inputs.
For very long texts, splitting into about 32 to 128 chunks before calling
segment_batch() is recommended for good throughput.
Segmented tokens in the requested aggregation form.
seg <- worker() texts <- c("\u5357\u4eac\u5e02\u957f\u6c5f\u5927\u6865", "\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5") segment_batch(texts, seg) segment_batch(texts, seg, batch = "flatten")seg <- worker() texts <- c("\u5357\u4eac\u5e02\u957f\u6c5f\u5927\u6865", "\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5") segment_batch(texts, seg) segment_batch(texts, seg, batch = "flatten")
Tag one or more strings with a jieba_worker created by worker().
tagging( code, jiebar, ..., format = c("vector", "data.frame", "legacy"), batch = c("list", "flatten") )tagging( code, jiebar, ..., format = c("vector", "data.frame", "legacy"), batch = c("list", "flatten") )
code |
A non-empty character vector to tag. |
jiebar |
A |
... |
Must be empty. This enforces that optional arguments such as
|
format |
Output format for a single tagged string. Must be one of
|
batch |
Aggregation mode for multi-string input. Must be one of
|
format controls the shape of each single-string tagging result:
"vector": a named character vector with token names and tag values.
"data.frame": a data frame with term and tag columns.
"legacy": the old jiebaR layout with token values and tag names.
In the current release benchmarks on the bundled Fortress Besieged and
Dream of the Red Chamber texts, jiebaRS::tagging() is about 1.6x to
1.8x faster than jiebaR::tagging() when each novel is tagged as one long
string. When the same content is split into many strings and processed in
batch, jiebaRS::tagging() is about 2x to 5x faster than jiebaR.
For very long texts, splitting before tagging is usually faster than sending one huge string. In the same release benchmarks, the best results appeared around 32 to 128 chunks, while much finer splitting still helped but was no longer optimal.
When code contains multiple strings, batch controls how the per-string
results are aggregated:
"list": one single-string result per input string.
"flatten": concatenate all results into one. The shape is decided by
format: "vector"/"legacy" produce a named character vector, while
"data.frame" produces a combined data frame with a doc_id column.
When batch is omitted, jiebaRS returns "vector" for single-string
input and "list" for multi-string input.
Tagging results in the requested format.
tagger <- worker(type = "tag") text1 <- "\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5" text2 <- "\u518d\u6765\u4e00\u6b21" tagging(text1, tagger) tagging(c(text1, text2), tagger) tagging(c(text1, text2), tagger, format = "data.frame", batch = "flatten")tagger <- worker(type = "tag") text1 <- "\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5" text2 <- "\u518d\u6765\u4e00\u6b21" tagging(text1, tagger) tagging(c(text1, text2), tagger) tagging(c(text1, text2), tagger, format = "data.frame", batch = "flatten")
Convenience wrapper around tagging() for multi-string input. When batch
is not supplied, tagging_batch() always returns list output.
tagging_batch( texts, jiebar, ..., format = c("vector", "data.frame", "legacy"), batch = c("list", "flatten") )tagging_batch( texts, jiebar, ..., format = c("vector", "data.frame", "legacy"), batch = c("list", "flatten") )
texts |
A non-empty character vector to tag. |
jiebar |
A |
... |
Must be empty. This enforces that optional arguments such as
|
format |
Output format for each single tagged result. Must be one of
|
batch |
Aggregation mode. Must be one of |
tagging_batch() is a convenience wrapper for explicit multi-string input.
The returned object depends on both format and batch:
batch = "list": returns one single-string tagging result per input
string.
batch = "flatten": concatenates all results into one. The shape is
decided by format: "vector"/"legacy" produce a named character
vector, while "data.frame" produces a combined data frame with a
doc_id column.
In the current release benchmarks on the bundled Fortress Besieged and
Dream of the Red Chamber texts, batch tagging is about 2x to 5x faster
than the comparable jiebaR workflow on many-string inputs. For very long
texts, the best throughput was usually reached by splitting into about 32
to 128 chunks, while much finer splitting still helped but was no longer
optimal.
Tagging results in the requested format.
tagger <- worker(type = "tag") texts <- c("\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5", "\u518d\u6765\u4e00\u6b21") tagging_batch(texts, tagger) tagging_batch(texts, tagger, format = "legacy", batch = "flatten")tagger <- worker(type = "tag") texts <- c("\u8fd9\u662f\u4e00\u4e2a\u6d4b\u8bd5", "\u518d\u6765\u4e00\u6b21") tagging_batch(texts, tagger) tagging_batch(texts, tagger, format = "legacy", batch = "flatten")
Extract TextRank-ranked keywords from a single in-memory string with a
TextRank worker created by worker(). This is separate from keywords(),
which uses TF-IDF weighting.
textrank(code, jiebar, ..., format = c("vector", "data.frame", "legacy"))textrank(code, jiebar, ..., format = c("vector", "data.frame", "legacy"))
code |
A character to analyze. |
jiebar |
A |
... |
Must be empty. This enforces that optional arguments such as
|
format |
Output format. |
TextRank results in the requested format.
Convenience wrapper around textrank() that always returns a data frame with
term and weight columns.
textrank_df(x, jiebar)textrank_df(x, jiebar)
x |
A character to analyze. |
jiebar |
A |
A data frame with term and weight columns.
This function can initialize a jiebaRS worker. See Details for more information.
worker( type = c("mix", "mp", "hmm", "full", "query", "tag", "keywords", "textrank"), stop_word = NULL, stop_word_file = NULL, hmm = TRUE, topn = 5L, idf = NULL, dict = NULL, user = NULL, symbol = FALSE, bylines = FALSE )worker( type = c("mix", "mp", "hmm", "full", "query", "tag", "keywords", "textrank"), stop_word = NULL, stop_word_file = NULL, hmm = TRUE, topn = 5L, idf = NULL, dict = NULL, user = NULL, symbol = FALSE, bylines = FALSE )
type |
Worker type. Supported values are |
stop_word |
Optional character vector of stop words supplied directly. |
stop_word_file |
Optional file path containing one stop word per line. |
hmm |
Logical scalar or character scalar. If logical, controls whether
to enable HMM fallback for unknown terms. If character, must be a path to a
custom HMM model file compatible with |
topn |
Integer. The number of terms returned by |
idf |
Optional character scalar. A path to a custom IDF dictionary
file for |
dict |
Optional character scalar. A path to a custom main dictionary
file that replaces the embedded dictionary. Each line should be
|
user |
Optional character scalar. A path to a user dictionary file
whose entries are appended to the main dictionary. Same line format as
|
symbol |
Logical. Whether to keep symbol-like tokens in the sentence. Default is |
bylines |
Deprecated compatibility argument retained from |
The qmax argument is not supported. Although jiebaR documented
qmax for query workers, the value was never actually passed to the
underlying segmentation call. Similarly, the jieba-rs backend implements
search-mode segmentation without a configurable query threshold. To avoid
user confusion, jiebaRS omits the qmax argument entirely rather than
retaining a no-op parameter.
jieba-rs does not expose dedicated public implementations for mp or
hmm workers. jiebaRS therefore maps mp to cut(..., false) and hmm
to cut(..., true). This is a compatibility approximation rather than a
byte-for-byte reimplementation of jiebaR, and jiebaRS warns once per R
session when either type is requested.
tag workers use jieba-rs tagging on top of the default mixed
segmentation path, which is the closest public behavior to jiebaR.
stop_word and stop_word_file can be both supplied at once and then
be merged together. Then they will be normalized.
In jiebaRS, hmm accepts either a logical scalar or a file path. A
logical value controls whether the underlying jieba-rs
segmentation/tagging pipeline may fall back to HMM for unknown terms. A
character scalar is interpreted as a path to a custom HMM model file and
enables HMM fallback with that model. The flag affects mix and query
workers directly, tag workers through the underlying mixed tagging path,
and keywords workers through TF-IDF keyword extraction. mp, hmm, and
full workers ignore the runtime switch because their jieba-rs backends
do not use this runtime switch.
dict and user load dictionary files at worker creation time. dict
replaces the embedded main dictionary entirely; user appends entries
to whatever main dictionary is in place (default or custom dict). Both
files use the same line format: word [freq] [tag], whitespace-separated,
one entry per line. freq is an integer word frequency (default 0 if
omitted); tag is a part-of-speech tag string (default empty if omitted).
For user files, a word with no freq is assigned frequency 0.
A jieba_worker S3 object.