2017年4月

首页2017年4月
30
Apr
0

logistic回归

encoding:utf-8

from numpy import *

预测类的值

def sigmoid(inX):

return 1.0 / (1 + exp(-inX))

梯度上升算法 计算回归系数

def gradAscent(dataMatIn,classLabels):

dataMatrix = mat(dataMatIn)
#transpose转换为numpy矩阵数据类型
labelMat = mat(classLabels).transpose()
m,n = shape(dataMatrix)
#移动步长
alpha = 0.001
#迭代次数
maxCycles = 500
weithts = ones((n,1))
for k in range(maxCycles):
    #与上一轮的预测类值相乘
    h = sigmoid(dataMatrix * weithts)
    #真实类别与预测类的差
    error = (labelMat - h)
    #按照差值的方向调整回归系数
    weithts = weithts + alpha * dataMatrix.transpose() * error
return weithts

随机梯度上升

def stocGradAscent0(dataMatrix,classLabels,numIter = 150):

m,n = shape(dataMatrix)
weights = ones(n)
for j in range(numIter):
    dataIndex = range(m)
    for i in range(m):
        #一开始用比较大的参数 使其快速收拢
        alpha = 4 / (1.0 + j + i) + 0.01
        randIndex = int(random.uniform(0,len(dataIndex)))
        # tmpData = dataMatrix[randIndex]
        # sumValue = tmpData * weights
        # sumValue = sum(sumValue)
        h = sigmoid(sum(dataMatrix[randIndex] * weights))
        error = classLabels[randIndex] - h
        weights = weights + alpha * error * dataMatrix[randIndex]
        del(dataIndex[randIndex])
return weights

进行分类

def classifyVector(inX,weights):

prob = sigmoid(sum(inX * weights))
if prob > 0.5:
    return 1
else:
    return 0

def colicTest():

frTrain = open('horseColicTraining.txt')
frTest = open('horseColicTest.txt')
trainingSet = []; trainLabels = []
for line in frTrain.readlines():
    currLine = line.strip().split('\t')
    lineArr = []
    for i in range(21):
        lineArr.append(float(currLine[i]))
    trainingSet.append(lineArr)
    trainLabels.append(float(currLine[21]))
trainWeights = stocGradAscent0(array(trainingSet),trainLabels,500)
errorCount = 0; numTestVec = 0.0
for line in frTest.readlines():
    numTestVec += 1.0
    currLine = line.strip().split('\t')
    lineArr = []
    for i in range(21):
        lineArr.append(float(currLine[i]))
    if int(classifyVector(array(lineArr),trainWeights)) != int(currLine[21]):
        errorCount += 1
errorRate = (float(errorCount) / numTestVec)
print "the error rate of this test is: %f" %errorRate
return errorRate

def multiTest():

numTests = 10; errorSum = 0.0
for k in range(numTests):
    errorSum += colicTest()
print "after %d iterations the average error rate is : %f" %(numTests, errorSum / float(numTests))

def plotBestFit(wei):

import matplotlib.pyplot as plt
weights = wei
dataMat ,labelMat = loadDataSet()
dataArr = array(dataMat)
n = shape(dataArr)[0]
xcord1 = []; ycord1 = []
xcord2 = []; ycord2 = []
for i in range(n):
    if int(labelMat[i]) == 1:
        xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
    else:
        xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
fig = plt.figure()
ax = fig.add_subplot(111)
#划出那一百个点
ax.scatter(xcord1, ycord1, s = 30, c = 'red', marker = 's')
ax.scatter(xcord2, ycord2, s = 30, c = 'green')
#划分隔线
x = arange(-3.0, 3.0, 0.1)
y = (-weights[0] - weights[1] * x) / weights[2]   #weights[0]代表当时X0初始值为1的间隔线 解出X2 X1的分隔线方程
ax.plot(x,y)

plt.xlabel('X1')
plt.ylabel('X2')
plt.show()

def loadDataSet():

dataMat = []; labelMat = []
fr = open('testSet.txt')
for line in fr.readlines():
    lineArr = line.strip().split()
    dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
    labelMat.append(int(lineArr[2]))
return dataMat,labelMat

if name == '__main__':

dataArr,labelArr = loadDataSet()
#梯度上升
weights = gradAscent(dataArr, labelArr)
print weights
# plotBestFit(weights.getA())
#随机梯度上升
weights = stocGradAscent0(array(dataArr), labelArr,80)
# plotBestFit(weights)
#测试病死率
print 'dead logistics'
multiTest()
29
Apr
0

itertools迭代器

Python的内建模块itertools提供了非常有用的用于操作迭代对象的函数。

首先,我们看看itertools提供的几个“无限”迭代器:

import itertools
natuals = itertools.count(1)
for n in natuals:
... print n

...
1
2
3
...
因为count()会创建一个无限的迭代器,所以上述代码会打印出自然数序列,根本停不下来,只能按Ctrl+C退出。

cycle()会把传入的一个序列无限重复下去:

import itertools
cs = itertools.cycle('ABC') # 注意字符串也是序列的一种
for c in cs:

... print c
...
'A'
'B'
'C'
'A'
'B'
'C'
...
同样停不下来。

repeat()负责把一个元素无限重复下去,不过如果提供第二个参数就可以限定重复次数:

ns = itertools.repeat('A', 10)
for n in ns:

... print n
...
打印10次'A'
无限序列只有在for迭代时才会无限地迭代下去,如果只是创建了一个迭代对象,它不会事先把无限个元素生成出来,事实上也不可能在内存中创建无限多个元素。

无限序列虽然可以无限迭代下去,但是通常我们会通过takewhile()等函数根据条件判断来截取出一个有限的序列:

natuals = itertools.count(1)
ns = itertools.takewhile(lambda x: x <= 10, natuals)
for n in ns:

... print n
...
打印出1到10
itertools提供的几个迭代器操作函数更加有用:

chain()

chain()可以把一组迭代对象串联起来,形成一个更大的迭代器:

for c in itertools.chain('ABC', 'XYZ'):

print c

迭代效果:'A' 'B' 'C' 'X' 'Y' 'Z'

groupby()

groupby()把迭代器中相邻的重复元素挑出来放在一起:

for key, group in itertools.groupby('AAABBBCCAAA'):
... print key, list(group) # 为什么这里要用list()函数呢?

...
A ['A', 'A', 'A']
B ['B', 'B', 'B']
C ['C', 'C']
A ['A', 'A', 'A']
实际上挑选规则是通过函数完成的,只要作用于函数的两个元素返回的值相等,这两个元素就被认为是在一组的,而函数返回值作为组的key。如果我们要忽略大小写分组,就可以让元素'A'和'a'都返回相同的key:

for key, group in itertools.groupby('AaaBBbcCAAa', lambda c: c.upper()):

... print key, list(group)
...
A ['A', 'a', 'a']
B ['B', 'B', 'b']
C ['c', 'C']
A ['A', 'A', 'a']
imap()

imap()和map()的区别在于,imap()可以作用于无穷序列,并且,如果两个序列的长度不一致,以短的那个为准。

for x in itertools.imap(lambda x, y: x * y, [10, 20, 30], itertools.count(1)):

... print x
...
10
40
90
注意imap()返回一个迭代对象,而map()返回list。当你调用map()时,已经计算完毕:

r = map(lambda x: x*x, [1, 2, 3])
r # r已经计算出来了

[1, 4, 9]
当你调用imap()时,并没有进行任何计算:

r = itertools.imap(lambda x: x*x, [1, 2, 3])
r

<itertools.imap object at 0x103d3ff90>

r只是一个迭代对象

必须用for循环对r进行迭代,才会在每次循环过程中计算出下一个元素:

for x in r:

... print x
...
1
4
9
这说明imap()实现了“惰性计算”,也就是在需要获得结果的时候才计算。类似imap()这样能够实现惰性计算的函数就可以处理无限序列:

r = itertools.imap(lambda x: x*x, itertools.count(1))
for n in itertools.takewhile(lambda x: x<100, r):

... print n
...
结果是什么?
如果把imap()换成map()去处理无限序列会有什么结果?

r = map(lambda x: x*x, itertools.count(1))

结果是什么?
ifilter()

不用多说了,ifilter()就是filter()的惰性实现。

小结

itertools模块提供的全部是处理迭代功能的函数,它们的返回值不是list,而是迭代对象,只有用for循环迭代的时候才真正计算

29
Apr
0

collections

namedtuple

我们知道tuple可以表示不变集合,例如,一个点的二维坐标就可以表示成:

p = (1, 2)
但是,看到(1, 2),很难看出这个tuple是用来表示一个坐标的。

定义一个class又小题大做了,这时,namedtuple就派上了用场:

from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
p = Point(1, 2)
p.x

1

p.y

2
namedtuple是一个函数,它用来创建一个自定义的tuple对象,并且规定了tuple元素的个数,并可以用属性而不是索引来引用tuple的某个元素。

这样一来,我们用namedtuple可以很方便地定义一种数据类型,它具备tuple的不变性,又可以根据属性来引用,使用十分方便。

可以验证创建的Point对象是tuple的一种子类:

isinstance(p, Point)

True

isinstance(p, tuple)

True
类似的,如果要用坐标和半径表示一个圆,也可以用namedtuple定义:

namedtuple('名称', [属性list]):

Circle = namedtuple('Circle', ['x', 'y', 'r'])
deque

使用list存储数据时,按索引访问元素很快,但是插入和删除元素就很慢了,因为list是线性存储,数据量大的时候,插入和删除效率很低。

deque是为了高效实现插入和删除操作的双向列表,适合用于队列和栈:

from collections import deque
q = deque(['a', 'b', 'c'])
q.append('x')
q.appendleft('y')
q

deque(['y', 'a', 'b', 'c', 'x'])
deque除了实现list的append()和pop()外,还支持appendleft()和popleft(),这样就可以非常高效地往头部添加或删除元素。

defaultdict

使用dict时,如果引用的Key不存在,就会抛出KeyError。如果希望key不存在时,返回一个默认值,就可以用defaultdict:

from collections import defaultdict
dd = defaultdict(lambda: 'N/A')
dd['key1'] = 'abc'
dd['key1'] # key1存在

'abc'

dd['key2'] # key2不存在,返回默认值

'N/A'
注意默认值是调用函数返回的,而函数在创建defaultdict对象时传入。

除了在Key不存在时返回默认值,defaultdict的其他行为跟dict是完全一样的。

OrderedDict

使用dict时,Key是无序的。在对dict做迭代时,我们无法确定Key的顺序。

如果要保持Key的顺序,可以用OrderedDict:

from collections import OrderedDict
d = dict([('a', 1), ('b', 2), ('c', 3)])
d # dict的Key是无序的

{'a': 1, 'c': 3, 'b': 2}

od = OrderedDict([('a', 1), ('b', 2), ('c', 3)])
od # OrderedDict的Key是有序的

OrderedDict([('a', 1), ('b', 2), ('c', 3)])
注意,OrderedDict的Key会按照插入的顺序排列,不是Key本身排序:

od = OrderedDict()
od['z'] = 1
od['y'] = 2
od['x'] = 3
od.keys() # 按照插入的Key的顺序返回

['z', 'y', 'x']
OrderedDict可以实现一个FIFO(先进先出)的dict,当容量超出限制时,先删除最早添加的Key:

from collections import OrderedDict

class LastUpdatedOrderedDict(OrderedDict):

def __init__(self, capacity):
    super(LastUpdatedOrderedDict, self).__init__()
    self._capacity = capacity

def __setitem__(self, key, value):
    containsKey = 1 if key in self else 0
    if len(self) - containsKey >= self._capacity:
        last = self.popitem(last=False)
        print 'remove:', last
    if containsKey:
        del self[key]
        print 'set:', (key, value)
    else:
        print 'add:', (key, value)
    OrderedDict.__setitem__(self, key, value)

Counter

Counter是一个简单的计数器,例如,统计字符出现的个数:

from collections import Counter
c = Counter()
for ch in 'programming':
... c[ch] = c[ch] + 1

...

c

Counter({'g': 2, 'm': 2, 'r': 2, 'a': 1, 'i': 1, 'o': 1, 'n': 1, 'p': 1})
Counter实际上也是dict的一个子类,上面的结果可以看出,字符'g'、'm'、'r'各出现了两次,其他字符各出现了一次

29
Apr
0

朴素贝叶斯

encoding:utf-8

from numpy import *

创建词典

def createVocabList(dataSet):

vocabSet = set([])
for document in dataSet:
    vocabSet = vocabSet | set(document)
return list(vocabSet)

标记输入词汇在词典中是否含有

def setOfWords2Vec(vocabList,inputSet):

resultVec = [0] * len(vocabList)
for word in inputSet:
    if word in vocabList:
        resultVec[vocabList.index(word)] = 1
    else:
        print "the word: %s is not in my Vocabulary!" % word
return resultVec

标记输入词汇在词典中是否含有(词袋模型)

def bagOfWords2Vec(vocabList,inputSet):

resultVec = [0] * len(vocabList)
for word in inputSet:
    if word in vocabList:
        resultVec[vocabList.index(word)] += 1
    else:
        print "the word: %s is not in my Vocabulary!" % word
return resultVec

训练数据各个单词占相关类别的比例

def trainNBO(trainMatrix,trainCategory):

#文档长度
numTrainDocs = len(trainMatrix)
#词典长度
numWords = len(trainMatrix[0])
#训练数据各组为辱骂词汇组和 / 文档长度 = 垃圾占所有词汇组的比例
pAbusive = sum(trainCategory) / float(numTrainDocs)
#同时初始分子为1 分母为2 避免有时确实没有导致值为0造成数据不准确
p0Num = ones(numWords) #zeros(numWords)
p1Num = ones(numWords) #zeros(numWords)
# 同时初始分子为1 分母为2 避免有时确实没有导致值为0造成数据不准确
p0Denom = 2.0 #0.0
p1Denom = 2.0 #0.0
#循环各组训练词汇
for i in range(numTrainDocs):
    #如果当前组是垃圾词组 则当前辱骂词汇组各个词的数量  /  当前辱骂词总数  反之
    if trainCategory[i] == 1:
        p1Num += trainMatrix[i]
        p1Denom += sum(trainMatrix[i])
    else:
        p0Num += trainMatrix[i]
        p0Denom += sum(trainMatrix[i])
#取对数可以避免小数太小导致成为0 ==========重要=============
#以自然对数e为底的对数
#函数f(x)=e^x展开为x的幂级数(Maclaurin级数)是   f(x)=e^x=1+x+(x^2)/2!+(x^3)/3!+…+(x^n)/n!+…;   特别地,当x=1时就得到了e的展开式   e=1+1+1/2!+1/3!+…+1/n!+….
#e是一个无限不循环小数,其值约等于2.718281828…,它是一个超越数.
p1Vect = log(p1Num / p1Denom)
p0Vect = log(p0Num / p0Denom)
return p0Vect,p1Vect,pAbusive

朴素贝叶斯分类

def classifyNB(vec2Classify,p0Vec,p1Vec,pClassl):

p1 = sum(vec2Classify * p1Vec) + log(pClassl)
p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClassl)
if p1 > p0:
    return 1
else:
    return 0

解析英文句子为单词列表

def textParse(bigStr):

import re
b = re.split(r'\W*',bigStr)
b = [tok for tok in b if len(tok) > 2]
return b

def loadDataSet():

postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
             ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
             ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
             ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
             ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
             ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
return postingList,classVec

def spamTest():

print '============================================='
docList = []
classList = []
fullList = []
for i in range(1,26):
    #训练数据垃圾邮件
    wordList = textParse(open('spam/%d.txt' %i).read())
    docList.append(wordList)
    fullList.extend(wordList)
    classList.append(1)
    #训练数据正常邮件
    wordList = textParse(open('ham/%d.txt' %i).read())
    docList.append(wordList)
    fullList.extend(wordList)
    classList.append(0)
#取词典表
vocabList = createVocabList(docList)
#共有垃圾邮件和正常邮件各50封 随机拿出10封用于测试准确率
trainingSet = range(50)
testSet = []
for i in range(10):
    rangeIndex = int(random.uniform(0,len(trainingSet)))
    testSet.append(trainingSet[rangeIndex])
    del(trainingSet[rangeIndex])
#训练朴素贝叶斯数据
trainMat = []
trainClasses = []
for docIndex in trainingSet:
    trainMat.append(setOfWords2Vec(vocabList,docList[docIndex]))
    trainClasses.append(classList[docIndex])
p0V,p1V,pSpam = trainNBO(array(trainMat),array(trainClasses))
#拿出刚才排除的10封邮件用于测试
errorCount = 0
for docIndex in testSet:
    wordVector = setOfWords2Vec(vocabList,docList[docIndex])
    if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
        errorCount += 1
print 'the error rate is:',float(errorCount) / len(testSet)

计算高频词 或 辐助词 可以通过这个函数去掉 还有就是停用词表 http://www.ranks.nl/stopwords

def calcMostFreq(vocabList,fullText):

import operator
freqDict = {}
for token in vocabList:
    freqDict[token] = fullText.count(token)
sortedFreq = sorted(freqDict.iteritems(),key=operator.itemgetter(1),reverse=True)
#取高频前30 去少点的话 错误率就会降低
sortedFreq = sortedFreq[:30]
stopword = "a about above after again against all am an and any are aren't as at be because been before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most mustn't my myself no nor not of off on once only or other ought our ours    ourselves out over own same shan't she she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't you you'd you'll you're you've your yours yourself yourselves"
stopword = stopword.split()
stopword = [ (i,) for i in stopword]
sortedFreq.extend(stopword)
return sortedFreq[:30]

def localWords(feed1,feed0):

import feedparser
docList = []; classList = []; fullText = []
minLen = min(len(feed1['entries']),len(feed0['entries']))
for i in range(minLen):
    wordList = textParse(feed1['entries'][i]['summary'])
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(1)
    wordList = textParse(feed0['entries'][i]['summary'])
    docList.append(wordList)
    fullText.extend(wordList)
    classList.append(0)
#创建词典
vocabList = createVocabList(docList)
#去掉高频词汇 辅词和停用词汇
top3Words = calcMostFreq(vocabList,fullText)
#把高频词汇从字典中去掉
for pairW in top3Words:
    if pairW[0] in vocabList:
        vocabList.remove(pairW[0])
#训练词典大小  词少的那一组的两倍
trainingSet = range(2 * minLen)
testSet = []
#随机去掉20个词典内内容
for i in range(20):
    #随机生成x到y的实数
    randIndex = int(random.uniform(0,len(trainingSet)))
    testSet.append(trainingSet[randIndex])
    del(trainingSet[randIndex])
trainMat = [];trainClasses = []
for docIndex in trainingSet:
    trainMat.append(bagOfWords2Vec(vocabList,docList[docIndex]))
    trainClasses.append(classList[docIndex])
#训练数据
p0v,p1v,pSpam = trainNBO(array(trainMat),array(trainClasses))
errorCount = 0
#测试数据正确率
for docIndex in testSet:
    wordVector = bagOfWords2Vec(vocabList,docList[docIndex])
    if classifyNB(array(wordVector),p0v,p1v,pSpam) != classList[docIndex]:
        errorCount += 1
print 'the error rate is :',float(errorCount) / len(testSet)
return vocabList,p0v,p1v

def testRss():

import feedparser
print 'testRss........'
#直接存本地的文件
ny = feedparser.parse('newyork.rss')
sf = feedparser.parse('sfbay.rss')
#直接网上拿
# ny = feedparser.parse('https://newyork.craigslist.org/search/stp?format=rss')  #
# sf = feedparser.parse('https://sfbay.craigslist.org/search/stp?format=rss')    #
vocabList,pSF,pNY = localWords(ny,sf)
#print 'vocabList,pSF,pNY:',vocabList,pSF,pNY

if name == '__main__':

#取得训练数据
postingList,calssVec = loadDataSet()
#生成词典
wordDict = createVocabList(postingList)
print 'dict:',wordDict
#得到输入数据在词典产生的值
markList = setOfWords2Vec(wordDict,postingList[0])
print 'mark:',markList
# markList = setOfWords2Vec(wordDict, postingList[3])
# print 'mark:', markList
trainMat = []
for doc in postingList:
    trainMat.append(setOfWords2Vec(wordDict,doc))
# 各个单词占相关类别的比例
p0v,p1v,pa = trainNBO(trainMat,calssVec)
print p0v
print p1v
print 'NBO:' , pa
#测试
testEntry = ['love','my','dalmation']
thisDoc = array(setOfWords2Vec(wordDict,testEntry))
print 'thisDoc:',thisDoc
print testEntry,'classified as:',classifyNB(thisDoc,p0v,p1v,pa)
testEntry = ['stupid','you','me']
thisDoc = array(setOfWords2Vec(wordDict,testEntry))
print 'thisDoc:',thisDoc
print testEntry,'classified as:',classifyNB(thisDoc,p0v,p1v,pa)
str = 'Yii Framework is a good choice for developing new high quality web-applications in rapid time. The well designed foundation with excellent documentation helped us developing Chives remarkable user experience and functionality in very short time'
arrStr = textParse(str)
thisDoc = array(bagOfWords2Vec(wordDict,arrStr))
print thisDoc,'classified as:',classifyNB(thisDoc,p0v,p1v,pa)
spamTest()
testRss()
25
Apr
0

决策树 ID3

encoding:utf-8

from math import log
import operator
import matplotlib.pyplot as plt
try:

import cPickle as pickle

except ImportError:

import pickle

计算香农熵

def calcShannonEnt(dataSet):

numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet:
    currentLable = featVec[-1]
    if currentLable not in labelCounts.keys():
        labelCounts[currentLable] = 0
    labelCounts[currentLable] += 1
shannonEnt = 0.0
for key in labelCounts:
    #单个分类的频率
    prob = float(labelCounts[key])/numEntries
    #单个分类 的熵   以2为底求对数  2的X次方等于prob
    shannonEnt -= prob * log(prob,2)
return shannonEnt

按给定特征划分数据集

def splitDataSet(dataSet,axis,value): #1.待划分的数据集 2.要划分的数据集特征 3.该特征的返回值

retDataSet = []
for featVec in dataSet:
    if featVec[axis] == value:
        reducedFeatVec = featVec[:axis]
        reducedFeatVec.extend(featVec[axis + 1:])
        retDataSet.append(reducedFeatVec)
return retDataSet

选择最好属性用来划分数据集的方式

def chooseBestFeatureToSplit(dataSet):

#取属性数据
numFeatures = len(dataSet[0]) - 1
#所有数据的熵
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0 ; bestFeature = -1
for i in range(numFeatures):
    #取单独属性的所有值
    featList = [example[i] for example in dataSet]
    #把值唯一化
    uniqueValues = set(featList)
    newEntropy = 0.0
    for value in uniqueValues:
        subDataSet = splitDataSet(dataSet,i,value)
        prob = len(subDataSet) / float(len(dataSet))
        #根据每个值 计算它的熵
        newEntropy += prob * calcShannonEnt(subDataSet)
    #总熵减去当前属性的熵
    infoGain = baseEntropy - newEntropy
    #保存熵值大的属性
    if (infoGain > bestInfoGain):
        bestInfoGain = infoGain
        bestFeature = i
return bestFeature

得到每个属性值个数的数量排序

def majorityCnt(classList):

classCount = {}
for vote in classList:
    if vote not in classCount.keys():classCount[vote] = 0
    classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]

创建决策树

def createTree(dataSet,labels):

#得到当前dataSet的值个数
classList = [example[-1] for example in dataSet]
#如果只有一个值,则返回结果
if classList.count(classList[0]) == len(classList):
    return classList[0]
#如果dataSet的只有值了 则返回结果
if len(dataSet[0]) == 1:
    return majorityCnt(classList)
#当前dataSet里最好的属性
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
#当前节点值
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
#最好属性值
featValues = [example[bestFeat] for example in dataSet]
uniqueValues = set(featValues)
for value in uniqueValues:
    #这样写是为了复制 如果直接等于 那就是引用  查内存地址 id(object)
    subLables = labels[:]
    myTree[bestFeatLabel][value] = createTree(splitDataSet\
                                                  (dataSet,bestFeat,value),subLables)
return myTree

叶数目

def getNumLeafs(myTree):

numLeafs = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
    if type(secondDict[key]).__name__ == 'dict':
        numLeafs += getNumLeafs(secondDict[key])
    else:
        numLeafs += 1
return numLeafs

树层数

def getTreeDepth(myTree):

maxDepth = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
    if type(secondDict[key]).__name__ == 'dict':
        thisDepth = 1 + getTreeDepth(secondDict[key])
    else:
        thisDepth = 1
    if thisDepth > maxDepth:
        maxDepth = thisDepth
return maxDepth

使用决策树分类

def classify(inputTree,featLabels,testVec):

#取属性名
firstStr = inputTree.keys()[0]
#取属性对应的所有值
secondDict = inputTree[firstStr]
#取属性名对应值所在位置
featIndex = featLabels.index(firstStr)
for key in secondDict.keys():
    #把所有属性对应值依次对应输入值 再看值是否是叶子 然后返回
    if testVec[featIndex] == key:
        if type(secondDict[key]).__name__ == 'dict':
            classLabel = classify(secondDict[key],featLabels,testVec)
        else:
            classLabel = secondDict[key]
return classLabel

存储决策树

def storeTree(inputTree,fileName):

import pickle
with open(fileName,'w') as f:
    data = pickle.dumps(inputTree)
    f.write(data)

读出决策树

def grabTree(fileName):

import pickle
with open(fileName,'r') as f:
    tmpData = f.read()
    data = pickle.loads(tmpData)
return data
绘图=========================

decisionNode = dict(boxstyle = 'sawtooth', fc = '0.8')
leafNode = dict(boxstyle = 'round4', fc = '0.8')
arrow_args = dict(arrowstyle='<-')

def plotNode(nodeTxt, centerPt,parentPt, nodeType):

createPlot.ax1.annotate(nodeTxt,xy = parentPt,xycoords='axes fraction',xytext=centerPt,textcoords='axes fraction',va='center',ha='center',bbox=nodeType,arrowprops=arrow_args)

def createPlot():

fig = plt.figure(1,facecolor='white')

fig.clf()

createPlot.ax1 = plt.subplot(111,frameon=False)

plotNode('a decision node',(0.5,0.1),(0.1,0.5),decisionNode)

plotNode('a leaf node',(0.8,0.1),(0.3,0.8),leafNode)

plt.show()

def plotMidText(cntrPt,parentPt,txtString):

xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
createPlot.ax1.text(xMid,yMid,txtString)

def plotTree(myTree,parentPt,nodeTxt):

numLeafs = getNumLeafs(myTree)
depth = getTreeDepth(myTree)
firstStr = myTree.keys()[0]
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalW,\
          plotTree.yOff)
plotMidText(cntrPt,parentPt,nodeTxt)
plotNode(firstStr,cntrPt,parentPt,decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD
for key in secondDict.keys():
    if type(secondDict[key]).__name__  == 'dict':
        plotTree(secondDict[key],cntrPt,str(key))
    else:
        plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW
        plotNode(secondDict[key],(plotTree.xOff,plotTree.yOff),cntrPt,leafNode)
        plotMidText((plotTree.xOff,plotTree.yOff),cntrPt,str(key))
plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD

def createPlot(inTree):

fig = plt.figure(1,facecolor='white')
fig.clf()
axprops = dict(xticks=[],yticks=[])
createPlot.ax1 = plt.subplot(111,frameon = False,**axprops)
plotTree.totalW = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5 / plotTree.totalW
plotTree.yOff = 1.0
plotTree(inTree,(0.5,1.0),'')
plt.show()
测试代码

两个现成树

def retrieveTree(i):

listOfTree = [
    {'no surfacing':{0:'no',1:{'flippers':{0:'no',1:'yes'}}}},
    {'no surfacing':{0:'no',1:{'flippers':{0:{'head':{0:'no',1:'yes'}},1:'no'}}}}
]
return listOfTree[i]

def createDataSet():

dataSet = [
    [1,1,'Yes'],
    [1,1,'Yes'],
    [1,0,'No'],
    [0,1,'No'],
    [0,1,'No']
]
labels = ['no surfacing','flippers']
return dataSet,labels

if name == '__main__':

dataSet , lables = createDataSet()
print dataSet
print calcShannonEnt(dataSet)
#分类越多 熵越高
dataSet[0][-1] = 'maybe'
print dataSet
print calcShannonEnt(dataSet)
#测试划分函数
print '测试划分函数'
dataSet[0][-1] = 'Yes'
print splitDataSet(dataSet, 0, 1)
print splitDataSet(dataSet, 1, 1)
#最好属性的信息增益
print chooseBestFeatureToSplit(dataSet)
#创建树
print 'tree'
tmpTree = createTree(dataSet,lables[:])
print tmpTree

print '叶数:', getNumLeafs(retrieveTree(0))
print '层数:', getTreeDepth(retrieveTree(0))
print '叶数:', getNumLeafs(retrieveTree(1))
print '层数:', getTreeDepth(retrieveTree(1))
print retrieveTree(1)
#绘制决策树图像
#createPlot(tmpTree)
print classify(tmpTree,lables,[1,1])
#序列化和反序列化
print '序列化'
print pickle.dumps(tmpTree)
print pickle.loads(pickle.dumps(tmpTree))
storeTree(tmpTree,'a.txt')
print grabTree('a.txt')
#眼镜实例
fr = open(u'lenses.txt','r')
lenses = [inst.strip().split('\t') for inst in fr.readlines()]
lensesLabels = ['age','prescript','astigmatic','tearRate']
lensesTree = createTree(lenses,lensesLabels)
print lensesTree
#createPlot(lensesTree)

list的 extend append区别

a = [1,2,3]

b = [4,5,6]

a.extend(b) a = [,2,3,4,5,6]

a.append(b) a = [1,2,3,[4,5,6]]

数据
young myope no reduced no lenses
young myope no normal soft
young myope yes reduced no lenses
young myope yes normal hard
young hyper no reduced no lenses
young hyper no normal soft
young hyper yes reduced no lenses
young hyper yes normal hard
pre myope no reduced no lenses
pre myope no normal soft
pre myope yes reduced no lenses
pre myope yes normal hard
pre hyper no reduced no lenses
pre hyper no normal soft
pre hyper yes reduced no lenses
pre hyper yes normal no lenses
presbyopic myope no reduced no lenses
presbyopic myope no normal no lenses
presbyopic myope yes reduced no lenses
presbyopic myope yes normal hard
presbyopic hyper no reduced no lenses
presbyopic hyper no normal soft
presbyopic hyper yes reduced no lenses
presbyopic hyper yes normal no lenses