|
楼主 |
发表于 2017-4-17 16:26:02
|
显示全部楼层
Example - filtering mobile phone spam with the Naive Bayes algorithm使用朴素贝叶斯分类垃圾信息
Step 1 - collecting data 收集数据
>sms_raw<- read.csv("sms_spam.csv", stringAsFactors=FALSE)
Step 2 - exploring and preparing the data 考察数据
>str(sms_raw)
输出见过:图1
将目标特征转换成 因子类型
>sms_raw$type<- factor(sms_raw$type)
>str(sms_raw$type)
Factor w/ 2 levels "ham", "spam":1 1 1 2 2 1 1 1 2 1.....
>table(sms_raw$type)
ham spam
4812 747
Data preparation- cleaning and standardizing text data 对文本数据进行清洗和规范化
>sms_corpus<- VCorpus(VectorSource(sms_raw$text))
使用text mining(tm包)创建一个文本集
1)将文本集中全部转换为小写
>sms_corpus_clean<- tm_map(sms_corpus, content_transformer(tolower))
2) 去除数字
>sms_corpus_clean<- tm_map(sms_corpus_clean, removeNumbers)
3) 去除停词
>sms_corpus_clean<- tm_map(sms_corpus_clean, removeWords, stopwords())
4) 去除标点符号
>sms_corpus_clean<- tm_map(sms_corpus_clean, removePunctuation)
5)去除一些变形形式如learning learn learned learns 都是learn的变形形式
>sms_corpus_clean<- tm_map(sms_corpus_clean, stemDocument)
6)去除空格
>sms_corpus_clean<- tm_map(sms_corpus_clean, stripWhitespace)
Data preparation - splitting text documents into words将文本转换成单词矩阵
>sms_dtm<- DocumentTermMatrix(sms_corpus_clean)
如果 你之前没有对其中的文本做上面一系列的处理 可以使用以下方式转换成单词矩阵 效果是一样的
>sms_dtm<- DocumentTermMatrix(sms_corpus, control = list(tolower = TRUE, removeNumbers = TRUE, stopwords = TRUE, removePunctuation = TRUE, steming = TRUE))
Data preparation - creating training and test datasets创建训练集和测试集
>sms_dtm_train<- sms_dtm[1:4169,]
>sms_dtm_test<_ sms_dtm[4170:5559,]
>sms_train_labels<- sms_raw[1:4169,]$type
>sms_test_labels<- sms_raw[4170:5559,]$type
Visualizing text data - word clouds文本虚拟化-单词云
>wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)
输出如图2
也可以分别将不同分类的文本集虚拟化
>spam<- subset(sms_raw, type == "spam")
>ham<- subset(sms_raw, type == "ham")
>wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
>wordcloud(ham$text, max.words = 40,scale = c(3, 0.5))
输出分别为图3 图4
Data preparation - creating indicator features for frequent words 使用出现频率最高的单词作为分类的特征
>sms_freq_words<- findFreqTerms(sms_dtm_train, 5)
>sms_dtm_freq_train<- sms_dtm_train[, sms_freq_words]
>sms_dtm_freq_test<- sms_dtm_test[, sms_freq_words]
>convert_count<- function(x){
x<- ifelse(x > 0, "Yes", "No")
}
>sms_train<- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
>sms_test<- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)
Step 3 - training a model on the data训练模型
>sms_classifier<- naiveBayes(sms_train, sms_train_labels)
Step 4 - evaluting model performance 模型评估
>sms_test_pred<- predict(sms_classfifier, sms_test)
>CrossTable(sms_test_pred, sms_test_labels, prop.,chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predict', 'actual'))
输出图5
从图中可以看出6+30 = 36条是分类错误的 总的样本是1390条。错误率为2.6%。其中6条正常信息 被划分为垃圾信息 30条垃圾信息被划分为正常信息.
Step 5 - improving model performance 模型改进
|
|