最近自己在做一个基于朴素贝叶斯算法的微博情感分类,首先朴素贝叶斯算法的基本推到我这里就不细说了。分类中我们一般会进行下面几个步骤:
1 对我们的语料库(训练文本)进行分词
2 对分词之后的文本进行TF-IDF的计算(TF-IDF介绍可以参考这边文章http://blog.csdn.net/yqlakers/article/details/70888897)
3 利用计算好的TF-IDF记性分词
下面我直接上代码分析这个过程:
分词器:
package tfidf; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.util.CharArraySet; import org.apache.lucene.util.Version; public class MMAnalyzer { public MMAnalyzer() { // TODO Auto-generated constructor stub } public String segment(String splitText, String str) throws IOException{ BufferedReader reader = new BufferedReader(new FileReader(new File("F:/Porject/Hadoop project/TDIDF/TFIDF/TFIDF/stop_words.txt"))); String line = reader.readLine(); String wordString = ""; while(line!=null){ //System.out.println(line); wordString = wordString + " " + line; line = reader.readLine(); } String[] self_stop_words = wordString.split(" "); //String[] self_stop_words = { "我","你","他","它","她","的", "了", "呢", ",", "0", ":", ",", "是", "流","!"}; CharArraySet cas = new CharArraySet(Version.LUCENE_46, 0, true); for (int i = 0; i < self_stop_words.length; i++) { cas.add(self_stop_words[i]); } @SuppressWarnings("resource") SmartChineseAnalyzer sca = new SmartChineseAnalyzer(Version.LUCENE_46, cas); //分词器做好处理之后得到的一个流,这个流中存储了分词的各种信息.可以通过TokenStream有效的获取到分词单元 TokenStream ts = sca.tokenStream("field", splitText); // 语汇单元对应的文本 //CharTermAttribute ch = ts.addAttribute(CharTermAttribute.class); //Resets this stream to the beginning. ts.reset(); // 递归处理所有语汇单元 //Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. String words = ""; while (ts.incrementToken()) { String word = ts.getAttribute(CharTermAttribute.class).toString(); System.out.println(word); words = words + word + ' '; //System.out.println(ch.toString()); } ts.end(); ts.close(); return words; } }
这里的分词采用的是SmartChineseAnalyzer分词器,加上自己去网上找了点stopwords的素材,通过这样的方式对微博文本进行分词
获取文本信息:
package tfidf; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStreamReader; import java.io.UnsupportedEncodingException; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; public class ReadFiles { private static List<String> fileList = new ArrayList<String>(); private static HashMap<String, HashMap<String, Float>> allTheTf = new HashMap<String, HashMap<String, Float>>(); private static HashMap<String, HashMap<String, Integer>> allTheNormalTF = new HashMap<String, HashMap<String, Integer>>(); public static List<String> readDirs(String filepath) throws FileNotFoundException, IOException { try { File file = new File(filepath); if (!file.isDirectory()) { System.out.println("Please input the name of the file:"); System.out.println("filepath: " + file.getAbsolutePath()); } else if (file.isDirectory()) { String[] filelist = file.list(); for (int i = 0; i < filelist.length; i++) { File readfile = new File(filepath + "\\" + filelist[i]); if (!readfile.isDirectory()) { //System.out.println("filepath: " + readfile.getAbsolutePath()); fileList.add(readfile.getAbsolutePath()); } else if (readfile.isDirectory()) { readDirs(filepath + "\\" + filelist[i]); } } } } catch (FileNotFoundException e) { System.out.println(e.getMessage()); } return fileList; } public static String readFiles(String file) throws FileNotFoundException, IOException { StringBuffer sb = new StringBuffer(); InputStreamReader is = new InputStreamReader(new FileInputStream(file), "utf-8"); BufferedReader br = new BufferedReader(is); String line = br.readLine(); while (line != null) { sb.append(line).append("\r\n"); line = br.readLine(); } br.close(); return sb.toString(); } public static String[] cutWord(String file) throws IOException { String[] cutWordResult = null; String text = ReadFiles.readFiles(file); MMAnalyzer analyzer = new MMAnalyzer(); //System.out.println("file content: "+text); //System.out.println("cutWordResult: "+analyzer.segment(text, " ")); String tempCutWordResult = analyzer.segment(text, " "); cutWordResult = tempCutWordResult.split(" "); return cutWordResult; } //用来计算tf public static HashMap<String, Float> tf(String[] cutWordResult) { HashMap<String, Float> tf = new HashMap<String, Float>(); int wordNum = cutWordResult.length; int wordtf = 0; for (int i = 0; i < wordNum; i++) { wordtf = 0; for (int j = 0; j < wordNum; j++) { if (cutWordResult[i] != " " && i != j) { if (cutWordResult[i].equals(cutWordResult[j])) { cutWordResult[j] = " "; wordtf++; } } } if (cutWordResult[i] != " ") { tf.put(cutWordResult[i], (new Float(++wordtf)) / wordNum); cutWordResult[i] = " "; } } return tf; } public static HashMap<String, Integer> normalTF(String[] cutWordResult) { HashMap<String, Integer> tfNormal = new HashMap<String, Integer>(); int wordNum = cutWordResult.length; int wordtf = 0; for (int i = 0; i < wordNum; i++) { wordtf = 0; if (cutWordResult[i] != " ") { for (int j = 0; j < wordNum; j++) { if (i != j) { if (cutWordResult[i].equals(cutWordResult[j])) { cutWordResult[j] = " "; wordtf++; } } } tfNormal.put(cutWordResult[i], ++wordtf); cutWordResult[i] = " "; } } return tfNormal; } public static Map<String, HashMap<String, Float>> tfOfAll(String dir) throws IOException { List<String> fileList = ReadFiles.readDirs(dir); for (String file : fileList) { HashMap<String, Float> dict = new HashMap<String, Float>(); dict = ReadFiles.tf(ReadFiles.cutWord(file)); allTheTf.put(file, dict); } return allTheTf; } public static Map<String, HashMap<String, Integer>> NormalTFOfAll(String dir) throws IOException { List<String> fileList = ReadFiles.readDirs(dir); for (int i = 0; i < fileList.size(); i++) { HashMap<String, Integer> dict = new HashMap<String, Integer>(); dict = ReadFiles.normalTF(ReadFiles.cutWord(fileList.get(i))); allTheNormalTF.put(fileList.get(i), dict); } return allTheNormalTF; } public static Map<String, Float> idf(String dir) throws FileNotFoundException, UnsupportedEncodingException, IOException { Map<String, Float> idf = new HashMap<String, Float>(); List<String> located = new ArrayList<String>(); Map<String, HashMap<String, Integer>> allTheNormalTF = ReadFiles.NormalTFOfAll(dir); float Dt = 1; float D = allTheNormalTF.size(); List<String> key = fileList; Map<String, HashMap<String, Integer>> tfInIdf = allTheNormalTF; for (int i = 0; i < D; i++) { HashMap<String, Integer> temp = tfInIdf.get(key.get(i)); for (String word : temp.keySet()) { Dt = 1; if (!(located.contains(word))) { for (int k = 0; k < D; k++) { if (k != i) { HashMap<String, Integer> temp2 = tfInIdf.get(key.get(k)); if (temp2.keySet().contains(word)) { located.add(word); Dt = Dt + 1; continue; } } } idf.put(word, Log.log((1 + D) / Dt, 10)); } } } return idf; } public static Map<String, HashMap<String, Float>> tfidf(String dir) throws IOException { Map<String, Float> idf = ReadFiles.idf(dir); Map<String, HashMap<String, Float>> tf = ReadFiles.tfOfAll(dir); for (String file : tf.keySet()) { Map<String, Float> singelFile = tf.get(file); for (String word : singelFile.keySet()) { singelFile.put(word, (idf.get(word)) * singelFile.get(word)); } } return tf; } } 计算:
package tfidf; public class Log { public static float log(float value, float base) { return (float) (Math.log(value) / Math.log(base)); } }
分类:
package tfidf; import java.io.IOException; import java.util.HashMap; import java.util.Map; public class Classifier { public static String classify(String text, Map<String, HashMap<String, Float>> tfidf) throws IOException{ String[] result = null; MMAnalyzer cutTextToWords = new MMAnalyzer(); String tmpResult = cutTextToWords.segment(text, " "); result = tmpResult.split(" "); int len = result.length; double[] finalScores = new double[2]; String[] name = new String[2]; int k = 0; for(String fileName:tfidf.keySet()){ //System.out.println(fileName); double scores = 0; HashMap<String, Float> perFile = tfidf.get(fileName); for (int i = 0; i<len; i++){ if (perFile.containsKey(result[i])){ scores += perFile.get(result[i]); } } finalScores[k] = scores; name[k] = fileName; System.out.println(name[k]+" socres is: "+finalScores[k]); k++; } if(finalScores[0]>=finalScores[1]){ //System.out.println(name[0]); return name[0]; }else{ //System.out.println(name[1]); return name[1]; } } }
这里我得重点说一下 我这里的微博情感分类只有两类,正向和负向的情感分类,因此在读取的文件的时候只有类似neg.txt和pos.txt的两个提供情感分类语料库的文件。通过计算每个单词在每个语料库中的TF-IDF得分 ,并将待判断的句子中所有的分词进行计算,并把TFIDF得分相加并比较,得分大的即为分类的结果。
主函数:package tfidf; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileReader; import java.io.FileWriter; import java.io.IOException; import java.util.HashMap; import java.util.Map; public class Main { public static void main(String[] args) throws IOException { BufferedReader reader = new BufferedReader(new FileReader(new File("h:/test.txt"))); BufferedWriter writer = new BufferedWriter(new FileWriter(new File("h:/testResult.txt"), true)); String line = reader.readLine(); Map<String, HashMap<String, Float>> tfidf = ReadFiles.tfidf("F:/Porject/Hadoop project/TDIDF/TFIDF/TFIDF/dir"); while(line!=null){ String bestFile = Classifier.classify(line, tfidf); writer.write(bestFile+"\t"+line); writer.flush(); writer.newLine(); System.out.println(bestFile+"\t"+line); line = reader.readLine(); } reader.close(); writer.close(); System.out.println("FINISHED!!!!"); } }test.txt为待分类的数据 testResult.txt存储分类之后的数据结果 F:/Porject/Hadoop project/TDIDF/TFIDF/TFIDF/dir为存放分类语料库的目录,在我的这个目录下存放了 neg.txt pos.txt的文件,里面分别存放了负向和正向的已经分类好的微博语料。
我的这个分类效果自己测试之后发现并不是很好,自己总结了一下主要原因 :
首先在分词上stopwords的文本更加完善的话会有更好的效果,还有就是一个更重要的就是训练分类器的语料库的微博内容,它的质量直接影响着分类的好坏,因为从朴素贝叶斯算法的算理就可以明白这个道理,这个分分类的模型就是通过计算先验概率然后计算后验概率,然后进行分类。所以要改善这个分类的效果可以从这两个方面进行改善。
还有一点要说的就是,我这个实时的分类,是先计算语料库中的先验概率,然后利用计算好的先验概率进行分类,并没有先把训练语料库中的先验概率先计算出来再存储,下次利用这个先验概率去分类的时候直接去读取这个文件即可。为了提高效率,不用每次在进行分类的时候都去对同一个
语料库进行TDIDF计算,那可以将其计算的TFIDF存储起来,使用的时候读取即可,这样可以提高效率。
