自己动手写中文分词解析器完整教程,并对出现的问题进行探讨和解决(附完整c#代码和相关dll文件、txt文件下载)
中文分词插件很多,当然都有各自的优缺点,近日刚接触自然语言处理这方面的,初步体验中文分词。
首先感谢harry.guo楼主提供的学习资源,博文链接http://www.cnblogs.com/harryguo/archive/2007/09/26/906965.html,在此基础上进行深入学习和探讨。
接下来进入正文。。。大牛路过别喷,菜鸟有空练练手~~完整的项目源码下载在文章末尾~~
因为是在Lucene.Net下进行中文分词解析器编写的,新建项目Lucene.China,然后将Lucene.Net.dll添加到项目中。(附:资源Lucene.Net.rar)
与英文不同,中文词之间没有空格,于是对于中文分词就比英文复杂了些。
第一,构建树形词库,在所建项目目录下的bin/Debug文件下新建一个文件夹data(如果文件夹已经存在,则不用新建),然后在data文件夹中加入sDict.txt。
(附:资源sDict.rar,解压后得到是sDict.txt文档,放入指定文件夹中)
构建树形词库实现代码如下:
using System; using System.Collections.Generic; using System.Text; using System.Collections; using System.Text.RegularExpressions; using System.IO; namespace Lucene.China { /// <summary> /// 词库类,生成树形词库 /// </summary> public class WordTree { /// <summary> /// 字典文件的路径 /// </summary> //private static string DictPath = Application.StartupPath + "\data\sDict.txt"; private static string DictPath = Environment.CurrentDirectory + "\data\sDict.txt"; /// <summary> /// 缓存字典的对象 /// </summary> public static Hashtable chartable = new Hashtable(); /// <summary> /// 字典文件读取的状态 /// </summary> private static bool DictLoaded = false; /// <summary> /// 读取字典文件所用的时间 /// </summary> public static double DictLoad_Span = 0; /// <summary> /// 正则表达式 /// </summary> public string strChinese = "[u4e00-u9fa5]"; public string strNumber = "[0-9]"; public string strEnglish = "[a-zA-Z]"; /// <summary> /// 获取字符类型 /// </summary> /// <param name="Char"></param> /// <returns> /// 0: 中文,1:英文,2:数字 ///</returns> public int GetCharType(string Char) { if (new Regex(strChinese).IsMatch(Char)) return 0; if (new Regex(strEnglish).IsMatch(Char)) return 1; if (new Regex(strNumber).IsMatch(Char)) return 2; return -1; } /// <summary> /// 读取字典文件 /// </summary> public void LoadDict() { if (DictLoaded) return; BuidDictTree(); DictLoaded = true; return; } /// <summary> /// 建立树 /// </summary> private void BuidDictTree() { long dt_s = DateTime.Now.Ticks; string char_s; StreamReader reader = new StreamReader(DictPath, System.Text.Encoding.UTF8); string word = reader.ReadLine(); while (word != null && word.Trim() != "") { Hashtable t_chartable = chartable; for (int i = 0; i < word.Length; i++) { char_s = word.Substring(i, 1); if (!t_chartable.Contains(char_s)) { t_chartable.Add(char_s, new Hashtable()); } t_chartable = (Hashtable)t_chartable[char_s]; } word = reader.ReadLine(); } reader.Close(); DictLoad_Span = (double)(DateTime.Now.Ticks - dt_s) / (1000 * 10000); System.Console.Out.WriteLine("读取字典文件所用的时间: " + DictLoad_Span + "s"); } } }
第二,构建一个支持中文的分析器,
需要停用词表 :String[] CHINESE_ENGLISH_STOP_WORDS,下面代码只是构造了个简单的停用词表。 (附资源:相对完整的停用词表stopwords.rar)
具体实现代码如下:
using System; using System.Collections.Generic; using System.Text; using Lucene.Net.Analysis; using Lucene.Net.Analysis.Standard; namespace Lucene.China { /**//// <summary> /// /// </summary> public class ChineseAnalyzer:Analyzer { //private System.Collections.Hashtable stopSet; public static readonly System.String[] CHINESE_ENGLISH_STOP_WORDS = new System.String[] { "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "s", "such", "t", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with", "我", "我们" }; /**//// <summary>Constructs a {@link StandardTokenizer} filtered by a {@link /// StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. /// </summary> public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader) { TokenStream result = new ChineseTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(result, CHINESE_ENGLISH_STOP_WORDS); return result; } } }