本文共 10169 字,大约阅读时间需要 33 分钟。
回顾上一篇文章讲到的聚合模型,三个臭皮匠顶一个诸葛亮。于是出现了blending,bagging,boost,stacking。blending有uniform和non-uniform,stacking是属于条件类的,而boost里面的Adaboost是边学习边做linear,bagging也是属于边学习边做uniform的。Decision Tree就是属于边做学习然后按照条件分的一种。如下图,aggregation model就是是补全了:
决策树是一种很传统的算法,出现的很早,例如下面,按照下班时间,是否约会,提交截止时间进行判断,和人的处理方式很像:
Strengths and Weaknesses
优点:模型直观,便于理解,应用很广泛简单,容易实现。训练和预测的时候,时间短预测准确率高缺点缺少足够的理论支持,后面给出的VC dimension没有什么太完备的道理。对于找到合适的树要花额外的时间。决策树代表性的演算法比较少
根据上面的公式,基本算法:
CART算法对决策树算法追加了一些限制:
①c = 2,分支的个数要等于2,和二叉树有点想。②本着g(x)simplify的原则,g(x)规定他就是一个常数,也就是类别。③按照Ein最小化的原则每一次选择condition。纯净度其实就是熵了。熵是代表混乱程度的。几个比较常见的算法:ID3,ID4.5,gini系数。
以信息论为基础,以信息熵和信息增益为衡量标准,从而实现对数据的归纳分类。
改进就是ID4.5了,这个就不是信息增益了,是信息增益率。
Cart算法里面用的是gini系数,但是还是有必要说一下decision tree做拟合的时候Ein要怎么optimal。
对于regression问题,首先想到的肯定是均方差了:
y杆就是yn的平均。
对于分类:
可以看到gini系数和熵差不了多少,一定程度上可以代表熵。
对于CART的Teminal condition,自然就是两个条件:1.首先是yn只有一个种类,分不了了。2.其次就是Xn都是一样的不能再分。
基本流程:
可以看到CART算法在处理binary classification和regression问题时非常简单实用,而且,处理muti-class classification问题也十分容易。
但是要注意一个问题,既然有错误就分,那么到最后肯定是一个二分完全树,Ein一定是0,这样是有过拟合的。对于overfit,要引入的就是过拟合:这里其实就是刚刚说的decision tree理论不是特别的完善,事实上NumberOfLeaves ≈ Ω其实我们在实践中得到的。因为叶子越多复杂度越大。所以就直接把叶子数量当做是复杂度Ω了。
在决策树中预测中,还会遇到一种问题,就是当某些特征缺失的时候,没有办法进行切割和分支选择。一种常用的方法就是surrogate branch,即寻找与该特征相似的替代feature。如何确定是相似的feature呢?做法是在决策树训练的时候,找出与该特征相似的feature,如果替代的feature与原feature切割的方式和结果是类似的,那么就表明二者是相似的,就把该替代的feature也存储下来。当预测时遇到原feature缺失的情况,就用替代feature进行分支判断和选择。
貌似和Adaboost很像啊!
最后在总结一下:
包括创建树,预测,可视化树,这篇东西内容不多,代码讲解多。
首先引入一个计算gini系数:def cal_gini(data): '''calculate the gini index input:data(list) output:gini(float) ''' total_sample = len(data) if total_sample == 0: return 0 label_count = label_uniqueness(data) gini = 0 for label in label_count: gini = gini + pow(label_count[label] , 2) gini = 1 - float(gini) / pow(total_sample , 2) return gini pass
传进的是一个list,计算这个list里面label数量,然后统计gini系数返回。
还有一个分别计算类别数量的函数,刚刚的gini系数用到的:def label_uniqueness(data): '''Counting the number of defferent labels in the dataset input:dataset output:Number of labels ''' label_uniq = {} for x in data: label = x[len(x) - 1] if label not in label_uniq: label_uniq[label] = 0 label_uniq[label] += 1 return label_uniq pass
这个就是tool文件里面的。
创建节点node:class node: '''Tree node ''' def __init__(self , fea = -1, value = None, results = None, right = None, left = None): ''' initialization function :param fea:column index value :param value:split value :param results:The class belongs to :param right:right side :param left:left side ''' self.fea = fea self.value = value self.results = results self.right = right self.left = left pass
fea就是当前分割的维度,value就是分割的值,result就是label,right右子树,left左子树。
接下来就是主要创建树的类了:class decision_tree(object): def build_tree(self,data): '''Create decision tree input:data output:root ''' if len(data) == 0: return node() currentGini = tool.cal_gini(data) bestGain = 0.0 bestCriterria = None # store the optimal cutting point bestSets = None # store two datasets which have been splited feature_num = len(data[0]) - 1 # Number of features for fea in range(0 , feature_num): feature_values = {} for sample in data: feature_values[sample[fea]] = 1 # store the value in the demension fea possibly for value in feature_values.keys(): (set_first, set_second) = self.split_tree(data, fea, value) nowGini = float(len(set_first) * tool.cal_gini(set_first) + len(set_second) * tool.cal_gini(set_second)) / len(data) gain = currentGini - nowGini if gain > bestGain and len(set_first) > 0 and len(set_second) > 0: bestGain = gain bestCriterria = (fea , value) bestSets = (set_first , set_second) pass if bestGain > 0: right = self.build_tree(bestSets[0]) left = self.build_tree(bestSets[1]) return node(fea = bestCriterria[0], value = bestCriterria[1], right = right, left = left) else: return node(results=tool.label_uniqueness(data)) def split_tree(self , data , fea , value): '''split the dataset according demension and value input:data output:two data ''' set_first = [] set_second = [] for x in data: if x[fea] >= value: set_first.append(x) else: set_second.append(x) return (set_first, set_second) pass def predict(self, sample, tree): '''prediction input:sample, the tree which we have been built output:label ''' if tree.results != None: return tree.results else: val_sample = sample[tree.fea] branch = None if val_sample >= tree.value: branch = tree.right else: branch = tree.left return self.predict(sample, branch) def predcit_samples(self, samples, tree): predictions = [] for sample in samples: predictions.append(self.predict(sample, tree)) return predictions pass
其实很简单,就是按照feature和value分类。忘了这个是前向还是后向了,我是看那个二叉树跟着搞的,大一的时候学过,过了半年差不多忘光了。
看看预测效果吧! 使用的数据还是iris数据集,可视化还得降维,麻烦,于是就是可视化树了,发现更麻烦:if __name__ == '__main__': print('load_data......') dataSet = load_data() data = dataSet.data target = dataSet.target dataframe = pd.DataFrame(data = data, dtype = np.float32) dataframe.insert(4, 'label', target) dataMat = np.mat(dataframe) '''test and train ''' X_train, X_test, y_train, y_test = train_test_split(dataMat[:, 0:-1], dataMat[:, -1], test_size=0.3, random_state=0) data_train = np.hstack((X_train, y_train)) data_train = data_train.tolist() X_test = X_test.tolist() tree = decisionTree.decision_tree() tree_root = tree.build_tree(data_train) predictions = tree.predcit_samples(X_test, tree_root) pres = [] for i in predictions: pres.append(list(i.keys())) y_test = y_test.tolist() accuracy = 0 for i in range(len(y_test)): if y_test[i] == pres[i]: accuracy += 1 print('Accuracy : ', accuracy / len(y_test))
准确率还是蛮高的。
首先要求树的叶子数: 一样是递归。def getNumLeafs(myTree): if myTree == None: return 0 elif myTree.right == None and myTree.left == None: return 1 else: return getNumLeafs(myTree.right) + getNumLeafs(myTree.left)
然后是求深度:
def getDepth(myTree): if myTree == None: return 0 right = getDepth(myTree.right) left = getDepth(myTree.left) return max(right+1, left+1)
之后就是画节点了,求深度和叶子数只是想着可以按照深度把树画的分开点。
还有一个装parent节点坐标的:class TreeNode(object): def __init__(self, x, y, parentX = None, parentY = None): self.x = x self.y = y self.parentX = parentX self.parentY = parentY pass
最后就是主要的画图了:
def drawNode(x, y ,parent,color, marker, myTree, position): if myTree.results == None or len(list(myTree.results.keys())) > 1: plt.scatter(x, y, c=color, marker=marker, s=200) if myTree.right == None and myTree.left == None: results = list(myTree.results.keys()) plt.annotate(s = 'label == ' + str(results[0]), xy=(x - 15, y)) if results[0] == 0.0: plt.annotate(s='label == 0.0', xy=(x , y)) plt.scatter(x, y, c='orange', marker='H', s=100) if results[0] == 1.0: plt.scatter(x, y, c='pink', marker='8', s=100) if results[0] == 2.0: plt.scatter(x, y, c='r', marker='+', s=100) if myTree.value != None and myTree.fea != None: po = 5 if position == 'right': plt.annotate(s = 'dimension' + str(myTree.fea) + '>' + str(round(myTree.value, 2)), xy = (x-25 - po, y)) else: plt.annotate(s='dimension' + str(myTree.fea) + '>' + str(round(myTree.value, 2)), xy=(x - 25 + po, y)) if parent != None: plt.plot([x, parent.x], [y, parent.y], color = 'gray', alpha = 0.5)def draw(myTree, parent = None, x = 100, y = 100, color = 'r', marker = '^', position = None): NumberLeaf = getNumLeafs(myTree) Depth = getDepth(myTree) delta = (NumberLeaf+Depth) drawNode(x, y, parent, color, marker, myTree,position) if myTree.right != None: draw(myTree.right, parent=TreeNode(x, y) ,x=x+5*delta, y=y-5-delta,color='b', marker='x', position='right') if myTree.left != None: draw(myTree.left,parent=TreeNode(x, y) ,x=x-5*delta, y=y-2-delta, color='g', marker='o', position='left') pass
加上这句 plt.annotate(s='label == 0.0', xy=(x , y))是因为那个注释死活画不出来,应该是挡住了。主要还是draw函数,drawNode只是画而已,判断都是为了加注释的,来看看效果图:
如果当时学数据结构用的是python多好!
所有代码在GitHub上:
转载地址:http://tdosl.baihongyu.com/