【朴素无华贝叶斯】实战朴素贝叶斯_代码实现_数据和接口

【朴素贝叶斯】实战朴素贝叶斯_代码实现_数据和接口

接下来，进行代码实现。

【样本格式】

首先，定义一下输入样本的格式，这些样本用于对朴素贝叶斯模型进行训练。定义格式如下：

1:9242 13626 28005 41622 41623 34625 36848 5342 51265
0:16712 49100 2933 65827 6200
1:53396 3675 43979 25739
0:17347 61515 53679 59426
1:32712 39134 63265 65430

每一行是一个样本。行首是这个样本所属的类别的索引，类别索引从0开始。当类别是以标签的形式存在的时候，可以建立类别标签与类别索引之间的联系。这个由其他程序来做，很简单。类别索引与后面的输入特征用冒号（“：”）相分隔。后面是特征词的索引。同样，特征词词串与索引之间的映射关系也可以建立。用索引的好处是：1. 使得分类器算法更加通用化；2. 整数处理相对字符串更加高效。

【数据结构】

接下来，定义数据结构。在“实战朴素贝叶斯——基本原理”一文中，我们已经分析了模型的参数空间。接下来，我直接给出数据结构代码：

//节点类，用来存储特征以及特征对应的后验概率。注意，此时默认特征取值为1
struct FeaProbNode
	{
		int iFeadId;
		double dProb;
	};
//节点类，用来存储类别、类别的先验概率、以及特征向量在该类别下的后验概率
struct ClassFeaNode
	{
		int iClassId;
		double dClassProb;	// the prior probability of a class
		vector<FeaProbNode> FeaVec;
	};

//对各个类别存储上述信息
vector<ClassFeaNode> ClassFeaVec;

对于FeaProbNode的内容，我们直接将特征与他的后验概率相联系起来了。那么在前文中我们说到，特征可以取不同的值，而对于每一个特征的取值，在相应的类别中，都有一个后验概率。在我们文本分类任务中，特征通常是词，词的取值，我们前文中已经提到过，有两种，分别是1和0，对应着这个词是否出现了。上面FeaProbNode中存储的概率，是词取值为1的概率；对于取值为0的概率，可以通过概率归一化条件计算得出。

【函数接口】

函数接口很简单，分别对应着模型训练、模型预测、模型序列化（存取），如下：

// The format of input samples:
//		ClassLabelIndex segmenter(not whitespace) ItemOneIndex whitespace ItemTwoIndex......
//	iClassNum: the number of class label index, [0, iClassNum-1]
//	iFeaTypeNum: the number of feature type index, [0, iFeaTypeNum-1]
//	sSegmenter: the segmenter between Class label index and Item index
//	iFeaExtractNum: the number of features which is going to extract
//	sFileModel: the output model into txt file
//	bCompactModel: whether to show some infor for debug, true for not include those infor
// 
// The format of compact model parameters 
//	1. the number of class
//	2. the prior probability of class
//	3. the conditional probability of p(item|class)
bool Train (const char * sFileSample, int iClassNum, int iFeaTypeNum, 
	string & sSegmenter, int iFeaExtractNum, const char * sFileModel, bool bCompactModel = true);

// Load the naive bayes model
bool LoadNaiveBayesModel (const char * sFileModel);

// predict according to the input features
bool PredictByInputFeas (vector<int> & FeaIdVec, int & iClassId);

// predict by input test corpus whose format is the same with the training corpus
bool PredictFrmTstCorpus (const char * sFileTestCorpus, string & sSegmenter, const char * sFileOutput);

Train函数的参数比较多。首先，要有输入样本的文本文件，格式在上文中已经描述；然后，是样本类别数目（决定了ClassFeaVec的大小）和特征类别数目（通常是词表的大小）；然后，是样本和特征之间的分隔符，在我们的例子中是“：”；接下来，是指定我们最终在模型中要选择多少特征——在这个程序中，我把特征选择和参数训练放在一起了，好处是只扫描一遍样本，效率高，坏处是代码偶合性太强，不容易扩展；最后，是输出的模型文件，bCompactModel是个标志，是否输出额外的信息，以便调试模型，默认是关闭的。

其实上面这些内容都是放到一个类NaiveBayes里面的，为了叙述方便，拆开来说了。下一篇，讲训练过程。

【朴素无华贝叶斯】实战朴素贝叶斯_代码实现_数据和接口

相关推荐