0 Star 1 Fork 2

红尘星云 / ansj_seg

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

Ansj中文分词

#####使用帮助http://nlpchina.github.io/ansj_seg/

#####在线测试地址http://demo.ansj.org

摘要

这是一个基于google语义模型+条件随机场模型的中文分词的java实现.

分词速度达到每秒钟大约200万字左右(mac air下测试),准确率能达到96%以上

目前实现了.中文分词. 中文姓名识别 . 用户自定义词典

可以应用到自然语言处理等方面,适用于对分词效果要求高的各种项目.

下载jar
  • 访问 http://maven.nlpcn.org/org/ansj/ 最好下载最新版 ansj_seg/
    • 如果你用的是1.x版本需要下载tree_split.jar
    • 如果你用的是2.x版本需要下载nlp-lang.jar
    • 如果你用的是3.x以上版本只需要下载 ansj_seg-[version]-all-in-one.jar 一个jar包就能浪了。
  • 导入到eclipse ,开始你的程序吧
maven
  1. 使用git下载本项目:
git clone https://github.com/NLPchina/ansj_seg
  1. 进入ansj_seg目录,使用maven安装项目:
mvn clean install -Dmaven.test.skip=true
  1. 在dependencies标签中粘贴如下:(其实version 以最新的为标准.)
	<!-- 增加新的maven源 -->	
	<repositories>
		<repository>
			<id>mvn-repo</id>
			<url>http://maven.nlpcn.org/</url>
		</repository>
	</repositories>


    <dependencies>
        ....
        
        <dependency>
            <groupId>org.ansj</groupId>
            <artifactId>ansj_seg</artifactId>
            <version>3.3</version>
        </dependency>
        ....
    </dependencies>
调用demo

如果你第一次下载只想测试测试效果可以调用这个简易接口


 String str = "欢迎使用ansj_seg,(ansj中文分词)在这里如果你遇到什么问题都可以联系我.我一定尽我所能.帮助大家.ansj_seg更快,更准,更自由!" ;
 System.out.println(ToAnalysis.parse(str));
 
 [欢迎/, 使用/, ansj/, _/, seg/, ,/, (/, ansj/, 中文/, 分词/, )/, 在/, 这里/, 如果/, 你/, 遇到/, 什么/, 问题/, 都/, 可以/, 联系/, 我/, 房/, 我/, 一定/, 尽/, 我/, 所/, 能/, ./, 帮助/, 大家/, ./, ansj/, _/, seg/, 更/, 快/, ,/, 更/, 准/, ,/, 更/, 自由/, !/]

##大事记要

#2016年1月14日

经过了很久很久.如大家所看.最后一次事件纪要是一年半以前.没什么好说的了.之前因为我的不负责任让这个项目停滞了好久.这次做了很多改进.断断续续,大致如下

  • 大幅提升了NlpAnalysis的准确性
  • 将crfmodel 从jar包中移除.提供DownLibrary进行下载
  • 增加了用户自定义词典优先的功能.(原谅我曾经的倔强)通过UserDefineAnalysis进行分词
  • 将jar包版本重新改为jdk6
  • 关键词提取增加keyword关键词.可以只是英文了算是
  • 其他种种.....祝愿所有人身体健康万事如意.也祝愿我自己

#2014年6月13日

额,今天是黑色星期五。正在紧张而有序的在做ansj2.0版本的升级。如果你用的版本是2.0x都是预览版。不保证稳定性。所以非版本控。不要跟着更新,这次修改的内容主要有:

  • 修复对reader流偏移量的错误。主要是因为\n\r造成的不可读错误
  • 修复term类的get方法乱用的错误。主要是为了让term对象更好的被json序列号
  • 对核心词典的重构。
  • 放弃tree-split。jar 迁移到nlp-lang.jar中。nlp-lang是我的一个新项目。内容多多大家又兴趣可以到https://github.com/nlpchina/nlp-lang去看看。
  • 重构crf模块。让每个人可以训练自己的crf。补充文档。

#2014年3月10日

此次更新,对tree-split中潜伏了n久的偏移量错误进行了修正。之所以修改这个是要作关键字标红。所以理所当然我把摘要做了,同时对关键词抽取作了一些补充了规则。目前摘要还处于alaph阶段。说白了,光一个摘要都能写一篇博士论文,对于开放场景的摘要其实未必需要高深的算法,这个可以在工程中用,能用,简单,但是无法做到行业顶级,每一个应用如果要做细作精,必须去结合自己的需求,以及业务来作。所以这些基础件理念是用20%的工作完成80%的功能。突然想起一句广告词。好不好看疗效。俄。。。写到这里发现走题了。总结下。这次新增了 。摘要 ,基于query的摘要 ,文章标红,优化了关键词抽取。等功能。很想做一件事情,做一个开源的nlp处理工具包。包括摘要,关键词抽取,倾向性分析,主题发现等功能,不在这里做了另起一个项目。有兴趣的可以联系我。加油

#2014年2月10日

终于把文档补全,因为时间比较仓促。本人比较懒,也不喜欢写字。所以断断续续的,大致也有了个好的结果。比较让人高兴的是我找到了一个写文档的方式。之写markdown文件。然后用ajax调用marked进行渲染最后通过hightlight插件。对code标记。:-)全部js搞定。也得力于上面两个优秀的开源项目,妈妈再也不担心我的文档了。:-)。对了利用过年这几天,分词做了大量的改动。经过多次的内心挣扎,放弃了一些分词结果较好的办法。将分词程序控制到了50m一下。对于不需要新词发现的用户。比如做搜索建议用ansj_min版本。只有4m左右大小。方便移动节能环保。

#2014年1月21日

增加了crf模型的解析。用crf来做未登录词的识别。取得了不错的效果,增加了对长词的进一步解析。将颗粒度防到最低。但是随之而来的影响造成了。分词jar包过大。大约有500多m,无法很顺利发布到git 和 maven库中。试了oschina的maven库也是不可以。如果没有很好的方案。ansj决定放弃maven支持。对于这方面需求的朋友只能说非常抱歉了。我不想因为担心项目的庞大。而畏首畏尾。当然对于jar包的发布可能选择云盘的方案。对于用于搜索的朋友。不建议跟着更新。因为index分词没有作更多的改变。祝好。剩下今年的时间(阴历),有下面几个打算。重构代码。优化里面的关键性算法。完善文档。随缘

#2013年12月12日

把由字构词的方式加到了分词中,对未登录词有了很大的提高。对外国人名的识别做了特定的优化。目前正在测试中。新增了httpserver 的控制台。可以直接方便调用分词结果

#2013年9月26日

我更新完了发表此帖为止的一次更新。在核心辞典上作了一些手脚。这个版本更像以前的版本。在分词的颗粒度上保持了优良的传统。尤其是面向搜索的用户。一定要更新

#2013-08-28

经过无数网友的抗议。ansj终于支持了maven。在这里感谢帮我把项目转换到maven的那个兄弟。你qq我找不到了。名字我也忘记了。

#改进

断断续续修改了无数个版本。在csdn的搜索系统上。用12年的历史数据.检索分析等.ansj经受住了考验。但是根据网友和自己的发现。找到了项目中的很多不足于是。开工。。。。。 同时在改进的过程中。我认识了更多的朋友。太多了。恩还有在读这篇文章的你。感谢你们对这个小工具的支持。在这里不一一例举了。主要找你们的名字比较麻烦。而我有是个很懒惰的人

#崩溃

如大多数的开源者一样,项目带来了很多负担

比如。在你工作或者思考的时候。别人就会打断你的思路。qq or email 提出了数个问题。或者bug。当然这些中大多都是友善的很有意义的建议。一方面让我更加坚定做好这个开源分词的决心。另一方面也给我的工作生活带来了一些效率上的影响。大多数提问我都是会回答。而且尽可能的保持耐心。但是如果有怠慢的地方。我在这里对大家表示歉意。

#诞生

2012-9-7 日Ansj中文分词。在我整整一夜的奋斗中终于完成了,真的是一夜的奋斗。写着写着一抬头天亮了。当然中间的快乐与心酸这里就不牢骚了。

通过微薄@了52nlp希望他能帮我推广下。在他的帮助下。ansj结识了很多朋友。@完后我就去睡觉了。辗转的一个夜晚。当下午醒来的时候。很多人微薄@我。我开玩笑的和cq说。我火了。

同时也@了我的启蒙导师张华平老师。他对我表示了支持。在这里感谢他

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

ansj分词.ict的真正java实现.分词效果速度都超过开源版的ict. 中文分词,人名识别,词性标注,用户自定义词典 展开 收起
Java
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
Java
1
https://gitee.com/ldz/ansj_seg.git
git@gitee.com:ldz/ansj_seg.git
ldz
ansj_seg
ansj_seg
master

搜索帮助