下面是用python的pycrfsuite库实现的命名实体识别,是我最初为了感知命名实体识别到底是什么,调研命名实体识别时跑的案例,记录在下面,为了以后查阅。
案例说明:
内容:在通用语料库CoNLL2002上,用crf方法做命名实体识别(地点、组织和人名)。
工具:Anaconda2
语料库介绍:
- 通用语料库: CoNLL2002
- 语言: 西班牙语
- 训练集: 8323句
- 测试集: 1517句
- 语料格式: 三列,分别表示词汇、词性、实体类型;使用Bakeoff-3评测中所采用的的BIO标注集,即B-PER、I-PER代表人名首字、人名非首字,
B-LOC、I-LOC代表地名首字、地名非首字,B-ORG、I-ORG代表组织机构名首字、组织机构名非首字,O代表该字不属于命名实体的一部分。
如:EFE NC B-ORG
特征处理:
主要选择处理了如下几个特征:
- 当前词的小写格式
- 当前词的后缀
- 当前词是否全大写 isupper
- 当前词的首字母大写,其他字母小写判断 istitle
- 当前词是否为数字 isdigit
- 当前词的词性
- 当前词的词性前缀
算法选择:crf
预测效果:
precision recall f1-score support
B-LOC 0.78 0.75 0.76 1084
I-LOC 0.66 0.60 0.63 325
B-MISC 0.69 0.47 0.56 339
I-MISC 0.61 0.49 0.54 557
B-ORG 0.79 0.81 0.80 1400
I-ORG 0.80 0.79 0.80 1104
B-PER 0.82 0.87 0.84 735
I-PER 0.87 0.93 0.90 634
avg / total 0.77 0.76 0.76 6178
脚本:
"""
@author:
@contact:
@time:
@context: makes a simple example of NER.
"""
from itertools
import chain
import nltk
,pycrfsuite
from sklearn
.metrics
import classification_report
, confusion_matrix
from sklearn
.preprocessing
import LabelBinarizer
nltk
.download
("conll2002", "E:/nltk_data/")
print(nltk
.corpus
.conll2002
.fileids
())
train_sents
= list(nltk
.corpus
.conll2002
.iob_sents
('esp.train'))
test_sents
= list(nltk
.corpus
.conll2002
.iob_sents
('esp.testb'))
"""
特征处理流程,主要选择处理了如下几个特征:
- 当前词的小写格式
- 当前词的后缀
- 当前词是否全大写 isupper
- 当前词的首字母大写,其他字母小写判断 istitle
- 当前词是否为数字 isdigit
- 当前词的词性
- 当前词的词性前缀
- 还有就是与之前后相关联的词的上述特征(类似于特征模板的定义)
"""
def word2features(sent
, i
):
word
= sent
[i
][0]
postag
= sent
[i
][1]
features
= [
'bias',
'word.lower=' + word
.lower
(),
'word[-3:]=' + word
[-3:],
'word[-2:]=' + word
[-2:],
'word.isupper=%s' % word
.isupper
(),
'word.istitle=%s' % word
.istitle
(),
'word.isdigit=%s' % word
.isdigit
(),
'postag=' + postag
,
'postag[:2]=' + postag
[:2],
]
if i
> 0:
word1
= sent
[i
-1][0]
postag1
= sent
[i
-1][1]
features
.extend
([
'-1:word.lower=%s' % word1
.lower
(),
'-1:word.istitle=%s' % word1
.istitle
(),
'-1:word.issupper=%s' % word1
.isupper
(),
'-1:postag=%s' % postag1
,
'-1:postag[:2]=%s' % postag1
[:2],
])
else:
features
.append
('BOS')
if i
< len(sent
)-1:
word1
= sent
[i
+1][0]
postag1
= sent
[i
+1][1]
features
.extend
([
'+1:word.lower=%s' % word1
.lower
(),
'+1:word.istitle=%s' % word1
.istitle
(),
'+1:word.issupper=%s' % word1
.isupper
(),
'+1:postag=%s' % postag1
,
'+1:postag[:2]=%s' % postag1
[:2],
])
else:
features
.append
('EOS')
return features
def sent2features(sent
):
return [word2features
(sent
,i
) for i
in range(len(sent
))]
def sent2labels(sent
):
return [label
for token
,postag
,label
in sent
]
def sent2tokens(sent
):
return [token
for token
,postag
,label
in sent
]
X_train
= [sent2features
(s
) for s
in train_sents
]
Y_train
= [sent2labels
(s
) for s
in train_sents
]
X_test
= [sent2features
(s
) for s
in test_sents
]
Y_test
= [sent2labels
(s
) for s
in test_sents
]
print(len(Y_test
))
print(type(Y_test
))
trainer
= pycrfsuite
.Trainer
(verbose
=False)
for xseq
,yseq
in zip(X_train
,Y_train
):
trainer
.append
(xseq
,yseq
)
trainer
.set_params
({
'c1' : 1.0,
'c2' : 1e-3,
'max_iterations':50,
'feature.possible_transitions':True
})
tagger
= pycrfsuite
.Tagger
()
tagger
.open('conll2002-esp.crfsuite')
example_sent
= test_sents
[0]
def bio_classification_report(y_true
, y_pred
):
lb
= LabelBinarizer
()
y_true_combined
= lb
.fit_transform
(list(chain
.from_iterable
(y_true
)))
y_pred_combined
= lb
.transform
(list(chain
.from_iterable
(y_pred
)))
tagset
= set(lb
.classes_
) - {'O'}
tagset
= sorted(tagset
, key
=lambda tag
: tag
.split
('-', 1)[::-1])
class_indices
= {cls
: idx
for idx
, cls
in enumerate(lb
.classes_
)}
return classification_report
(
y_true_combined
,
y_pred_combined
,
labels
= [class_indices
[cls
] for cls
in tagset
],
target_names
= tagset
,
)
Y_pred
= [tagger
.tag
(xseq
) for xseq
in X_test
]
print(type(Y_pred
))
print(type(Y_test
))
print(bio_classification_report
(Y_test
, Y_pred
))
报错
下载数据时出现了报错,需要加“nltk.download(“conll2002”, “E:/nltk_data/”)”这一行脚本。 下载数据出现报错如何解决的资料
参考资料:
1.[Python]How to use CRFSuite ? (2) 2.Let’s use CoNLL 2002 data to build a NER system