同步操作将从 Gitee 极速下载/Prose-Go 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
prose
is Go library for text (primarily English at the moment) processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. The library's functionality is split into subpackages designed for modular use. See the documentation for more information.
$ go get github.com/jdkato/prose/...
Word, sentence, and regexp tokenizers are available. Every tokenizer implements the same interface, which makes it easy to customize tokenization in other parts of the library.
package main
import (
"fmt"
"github.com/jdkato/prose/tokenize"
)
func main() {
text := "They'll save and invest more."
tokenizer := tokenize.NewTreebankWordTokenizer()
for _, word := range tokenizer.Tokenize(text) {
// [They 'll save and invest more .]
fmt.Println(word)
}
}
The tag
package includes a port of Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:
Library | Accuracy | 5-Run Average (sec) |
---|---|---|
NLTK | 0.893 | 7.224 |
prose |
0.961 | 2.538 |
(See scripts/test_model.py
for more information.)
package main
import (
"fmt"
"github.com/jdkato/prose/tag"
"github.com/jdkato/prose/tokenize"
)
func main() {
text := "A fast and accurate part-of-speech tagger for Golang."
words := tokenize.NewTreebankWordTokenizer().Tokenize(text)
tagger := tag.NewPerceptronTagger()
for _, tok := range tagger.Tag(words) {
fmt.Println(tok.Text, tok.Tag)
}
}
The tranform
package currently only has one function: converting strings to title case. Unlike strings.Title
, tranform
adheres to common guidelines—including styles for both the AP Stylebook and The Chicago Manual of Style. Additionally, you can easily add your own custom style by defining an IgnoreFunc
callback.
Inspiration and test data taken from python-titlecase and to-title-case.
package main
import (
"fmt"
"strings"
"github.com/jdkato/prose/transform"
)
func main() {
text := "the last of the mohicans"
tc := transform.NewTitleConverter(transform.APStyle)
fmt.Println(strings.Title(text)) // The Last Of The Mohicans
fmt.Println(tc.Title(text)) // The Last of the Mohicans
}
The summarize
package includes functions for computing standard readability and usage statistics. It's among the most accurate implementations available due to its reliance on legitimate tokenizers (whereas others, like readability-score, rely on naive regular expressions).
It also includes a TL;DR algorithm for condensing text into a user-indicated number of paragraphs.
package main
import (
"fmt"
"github.com/jdkato/prose/summarize"
)
func main() {
doc := summarize.NewDocument("This is some interesting text.")
fmt.Println(doc.SMOG(), doc.FleschKincaid())
}
The chunk
package implements named-entity extraction using a regular expression indicating what chunks you're looking for and pre-tagged input.
package main
import (
"fmt"
"github.com/jdkato/prose/chunk"
"github.com/jdkato/prose/tag"
"github.com/jdkato/prose/tokenize"
)
func main() {
words := tokenize.TextToWords("Go is a open source programming language created at Google.")
regex := chunk.TreebankNamedEntities
tagger := tag.NewPerceptronTagger()
for _, entity := range chunk.Chunk(tagger.Tag(words), regex) {
fmt.Println(entity) // [Go Google]
}
}
If not otherwise specified (see below), the source files are distributed under MIT License found in the LICENSE file.
Additionally, the following files contain their own license information:
tag/aptag.go
: MIT © Matthew Honnibal.tokenize/punkt.go
: MIT © Eric Bower.tokenize/pragmatic.go
: MIT © Kevin S. Dias.此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。