R 连接DB2数据库，并制作词图-白红宇

R 连接DB2数据库，并制作词图

阅读量：4500 次

发布时间：2019-06-08

本文共 2082 字，大约阅读时间需要 6 分钟。

#写在前面的话：此教程主要是用R连接了DB2数据库，并进行文本分析，制作了词图

#教程为markdown编写

---

title: "网站留言分析"

output: html_document

---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo=FALSE,error=FALSE,warning=FALSE,message = F,comment="")

```

```{r}

#连接数据库并读取数据

library(RODBC)

options(scipen=200)

conn = odbcConnect(dsn="DB2",uid="tjfx","1")

myfile<-sqlQuery(conn,"select * from T_from_last_year_liuyan")

myfile<-myfile[,4]

#head(myfile)

#调入分词的库

library("rJava")

library("Rwordseg")

#调入绘制词云的库

library("RColorBrewer")

library("wordcloud")

#预处理，这步可以将读入的文本转换为可以分词的字符，没有这步不能分词

myfile.res <- myfile[myfile!=" "]

```

```{r,echo=F,error=FALSE,warning=FALSE,message = F,comment="",results='hide'}

#载入语料库,搞好词库后，用下面的语句就可以装入内存：

installDict("ciku.txt",dictname = "my dict")

```

```{r}

#查看装载的词典：

#listDict()

#删除安装的词典：

#as.charecter

myfile.res<-as.character(myfile.res)

#分词，并将分词结果转换为向量

myfile.words <- unlist(lapply(X = myfile.res,FUN = segmentCN))

#剔除URL等各种不需要的字符，还需要删除什么特殊的字符可以依样画葫芦在下面增加gsub的语句

myfile.words <- gsub(pattern="http:[a-zA-Z\\/\\.0-9]+","",myfile.words)

myfile.words <- gsub("\n","",myfile.words)

myfile.words <- gsub("　","",myfile.words)

#去掉停用词

data_stw=read.table(file="mystopword.txt",colClasses="character")

stopwords_CN=c(NULL)

for(i in 1:dim(data_stw)[1]){

stopwords_CN=c(stopwords_CN,data_stw[i,1])

}

for(j in 1:length(stopwords_CN)){

myfile.words <- subset(myfile.words,myfile.words!=stopwords_CN[j])

}

#过滤掉1个字的词

myfile.words <- subset(myfile.words, nchar(as.character(myfile.words))>1)

#统计词频

myfile.freq <- table(unlist(myfile.words))

myfile.freq <- rev(sort(myfile.freq))

myfile.freq <- data.frame(word=names(myfile.freq), freq=myfile.freq)

#按词频过滤词，过滤掉只出现过一次的词，这里可以根据需要调整过滤的词频数

myfile.freq2=subset(myfile.freq, myfile.freq$freq>=5)

#绘制词云

#设置一个颜色系：

mycolors <- brewer.pal(8,"Dark2")

#设置字体

windowsFonts(myFont=windowsFont("微软雅黑"))

#画图

set.seed(123)

wordcloud(myfile.freq2$word,myfile.freq2$freq,random.order=FALSE,

random.color=FALSE,colors=mycolors,family="myFont")

write.table(myfile.freq2,row.names=F,quote = FALSE )

#结果图：

输出的词频文档删除了行号，详见代码：

转载于:https://www.cnblogs.com/miranda-tang/p/5563504.html

你可能感兴趣的文章

Spring cloud实战从零开始一个简单搜索网站_Hystrix断路由的实现(三)

silverlight Image Source URI : 一个反斜杠引发的血案

查看>>

Windows Phone开发（35）：使用Express Blend绘图转：http://blog.csdn.net/tcjiaan/article/details/7493010...

查看>>

Windows Phone开发（33）：路径之其它Geometry 转：http://blog.csdn.net/tcjiaan/article/details/7483835...

查看>>

Android入门（9）AudioRecord和AudioTrack类的使用【转】http://blog.sina.com.cn/s/blog_6309e1ed0100j1rw.html...

查看>>

mybatis整合Spring编码

查看>>

第七章路由 68 路由-前端路由和后端路由的概念

查看>>

dpkg包管理

查看>>

前端JS利用canvas的drawImage()对图片进行压缩

查看>>

一键切换皮肤的解决思想及iframe嵌套时寻找下级iframe的方法

c# Mongodb两个字段不相等 MongoDB原生查询