-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata handling, stats and plots.Rmd
98 lines (72 loc) · 1.84 KB
/
data handling, stats and plots.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
title: "Exam R"
author: "Camilla Andersen"
date: "1/3/2020"
output:
pdf_document: default
html_document: default
---
### WD and data upload
```{r}
setwd("~/Desktop/NLP/LDA")
dTroll=read.csv("fulldata.csv", sep = ",")
dNonTroll=read.csv("NotTrolls.csv", sep = ",")
#Bind the two datasets togeather
data = rbind(dTroll1,dNonTroll1)
```
### Labels and subsetting
```{r}
#Add labels to the troll and non troll data
dTroll["label"] <- 1
dNonTroll["label"] <- 0
#Rename colum to text
names(dNonTroll)[names(dNonTroll) == "X0"] <- "text"
#Choose only text and label columns
dNonTroll1 <- subset(dNonTroll, select = c("text", "label"))
dTroll1 <- subset(dTroll, select = c("text", "label"))
#Save as CSV file
AllData=read.csv("AllDataCleaned.csv", sep = ",")
```
### Split data according to label
```{r}
#Split data into the two groups
AllDataTroll=subset(AllData, label == 1)
AllDataNotTroll=subset(AllData, label == 0)
```
### t.test and r^2
```{r}
#t.test on number of words per tweet
Model= t.test(AllDataTroll$tweet_length, AllDataNotTroll$tweet_length, paried=TRUE)
Model
t = Model$statistic[[1]]
t
df = Model$parameter[[1]]
df
r = sqrt(t^2/(t^2+df))
r
#t.test on number of charecters per tweet
Model1= t.test(AllDataTroll$tweet_word_count, AllDataNotTroll$tweet_word_count, paried=TRUE)
Model1
t1 = Model1$statistic[[1]]
t1
df1 = Model1$parameter[[1]]
df1
r1 = sqrt(t1^2/(t1^2+df1))
r1
```
### Boxplots
```{r}
#Some colors to use
palette(c("red", "#4682B4", "#00008B", "darkgreen"))
#Plots:
plot1=boxplot(tweet_word_count ~ label,
data = AllData,
names=c("NotTrolls","Trolls"),
ylab="Average number of words per tweet",
col=c(2,4))
plot2=boxplot(tweet_length ~ label,
data = AllData,
names=c("NotTrolls","Trolls"),
ylab="Average number of characters per tweet",
col=c(2,4))
```