Tweeather

Tweeather is a Machine Learning project that correlates Twitter sentiment to European weather.

I was inspired by a study where user behaviour on Twitter was used to build a predictive model of income: Studying User Income through Language, Behaviour and Affect in Social Media. I decided it was the perfect opportunity to venture into the world of Big Data and so I learned Spark and Hadoop.

Requirements
Setup
Scripts
Conclusion
Tips
Suggestions
Downloads

Requirements

You need these tools to be able to run the project:

Java 1.7+
scala-sbt

And these clusters:

Apache Spark 1.6 (required for everything)
HDFS (required only for multi-machine Spark clusters)

And the following hardware:

one or more machines
- preferably a single, powerful one so you don't need Hadoop
at least 16GB of RAM
- it should work with less, but make sure you edit the executorHighMem size in SparkSubmit.scala – see Setup
at least 8 logical cores
- it should work with less, but the more you have the better/faster scripts will run
- to run the Collector scripts, you need to have at least <number of Twitter apps> + 1 cores

Setup

Configuration

The project has 2 configuration files:

src/main/resources/com/aluxian/tweeather/res/twitter.properties
- used to provide Twitter credentials for the Collector scripts
src/main/resources/com/aluxian/tweeather/res/log4j.properties
- used to configure logging

Make sure you copy and configure these before running the scripts:

$ cp src/main/resources/com/aluxian/tweeather/res/log4j-template.properties \
    src/main/resources/com/aluxian/tweeather/res/log4j.properties
$ cp src/main/resources/com/aluxian/tweeather/res/twitter-template.properties \
    src/main/resources/com/aluxian/tweeather/res/twitter.properties

Env vars

Make sure you set these up before running a script:

TW_SPARK_MASTER
- a url to a Spark master to connect to
- if not specified, scripts will run on local[*]
- by default, scripts will run in client deploy mode

System properties

Tweeather supports the following custom system properties:

tw.streaming.timeout
- the period, in seconds, after which streaming should stop
- default: unlimited
tw.streaming.interval
- the duration, in seconds, for each streaming batch
- default: 5 minutes

Submit tasks

If your Spark cluster doesn't have at least 14GB of RAM, edit executorHighMem in project/SparkSubmit.scala.

Scripts

The project has 3 sets of scripts.

1. Sentiment140 scripts

I used these scripts to train a naive Bayes sentiment analyser with the Sentiment140 dataset. Nothing fancy here. The resulting model has an accuracy of 80%.

Processing

The same processing steps were taken as for the Emo scripts.

Running

To run the experiment:

# Download the datasets
$ sbt submit-Sentiment140Downloader

# Parse the datasets
$ sbt submit-Sentiment140Parser

# Train the sentiment analyser
$ sbt submit-Sentiment140Trainer

# Test the analyser
$ sbt submit-Sentiment140Repl

Example

Here's an example of prediction done on the Sentiment140 test dataset:

|actual|predicted         |raw_text                                                                                                                                 |
|1.0   |0.9906351122104053|Obama's got JOKES!! haha just got to watch a bit of his after dinner speech from last night... i'm in love with mr. president ;)         |
|0.0   |0.3665541605686164|LEbron james got in a car accident i guess..just heard it on evening news...wow i cant believe it..will he be ok ? http://twtad.com/69750|
|1.0   |0.4466193778555213|is it me or is this the best the playoffs have been in years oh yea lebron and melo in the finals                                        |
|1.0   |0.5779917305269712|@khalid0456 No, Lebron is the best                                                                                                       |
|1.0   |0.7991567513906502|@the_real_usher LeBron is cool.  I like his personality...he has good character.                                                         |
|1.0   |0.5027089309407685|Watching Lebron highlights. Damn that niggas good                                                                                        |
|1.0   |0.1677252771491839|@Lou911 Lebron is MURDERING shit.                                                                                                        |
|1.0   |0.2183415128849378|@uscsports21 LeBron is a monsta and he is only 24. SMH The world ain't ready.                                                            |
|1.0   |0.9184651111073650|@cthagod when Lebron is done in the NBA he will probably be greater than Kobe. Like u said Kobe is good but there alot of 'good' players.|
|1.0   |0.6414672757144448|KOBE IS GOOD BT LEBRON HAS MY VOTE                                                                                                       |
|0.0   |0.7777182849481007|Kobe is the best in the world not lebron .                                                                                               |
|1.0   |0.5581821154365963|@asherroth World Cup 2010 Access?? Damn, that's a good look!                                                                             |

You can download a file with more examples from the downloads section.

2. Emo scripts

I used these scripts to train a naive Bayes sentiment analyser with tweets collected by myself. The resulting model had an accuracy of 75% and is available for download in the downloads section.

Collection

For collecting the tweets, I used Twitter's Streaming API with multiple apps configured. The average throughput was of 325 tweets/sec and I collected over 100M tweets in 4 days. However, after removing all the duplicates, only 8.4M remained.

The stream of tweets I received from the Twitter API was filtered by emoji characters. Tweets that contained positive emojis like were classified as positive, and tweets that contained negative emojis like were classified as negative. Tweets that contained both types of emojis were excluded.

This method allowed me to gather a fairly large dataset of labelled tweets, while the accuracy of the model didn't seem to suffer.

Processing

Before training the analyser, the tweets were pre-processed:

the feature space was reduced
- urls were replaced with URL
- @username mentions were replaced with USERNAME
- repeated letters were replaced with just 2 occurrences of the letter
text was sanitized
- punctuation was removed
- multiple white spaces were replaced with just one
stop words like "not", "is", "less", and "or" were removed

Training

I used 90% of the tweets for training and the remaining 10% for testing.

Running

To run the experiment:

# Collect tweets; leave this running for a few hours
$ sbt submit-TwitterHoseEmoCollector

# Parse the collected tweets
$ sbt submit-TwitterHoseEmoParser

# Train the sentiment analyser
$ sbt submit-TwitterHoseEmoTrainer

# Test the analyser
$ sbt submit-TwitterHoseEmoRepl

Screenshot

Here's a screenshot of my collector running for almost 4 days.

Emo Collector

Example

Here's an example of some tweets and their predicted polarity:

|lat       |lon       |polarity          |raw_text                                                                                                                                                                                                                                                                      |
|33.8733655|35.8495145|0.5220677359995693|@onikashabibi IM HOWLING                                                                                                                                                                                                                                      |
|33.8733655|35.8495145|0.7705892813840320|am i the only one attracted to Hyde                                                                                                                                                                                                                   |
|33.8733655|35.8495145|0.6705864102035392|nicki and gaga better release a track one day or imma cut a bitch                                                                                                                                                     |
|33.8733655|35.8495145|0.0851292224547309|@elissamk_ yeah I don't know.  Maybe schedule or major conflict                                                                                                                                                           |
|33.8733655|35.8495145|0.4380899284568244|@TWlSTEDFANTASY by the show? it had worse seasons.                                                                                                                                                                                    |
|33.8733655|35.8495145|0.8572554520257938|@ElieRustom well deserved.                                                                                                                                                                                                                                    |
|33.8733655|35.8495145|0.0683891442591605|Wanted to wake up at 8 am for a morning jog but here I am at 3:37 am scanning Twitter for what I've missed                                                                    |
|33.8733655|35.8495145|0.5289639040745624|@ayaalhakim_ @_NiZS lol same  between its potential and reality. I think its almost inescapably useless. No matter how relevant the content.|
|33.8733655|35.8495145|0.0462046697117165|You get a temporary high as you watch life pass you by Every single day you want to cry Can we wish the tears a fond goodbye #TroubledSoul    |
|33.8733655|35.8495145|0.9536491693812428|I love Yoda so much                                                                                                                                                                                                                                                   |
|33.8733655|35.8495145|0.9381788967926521|The world is completely fucked  99%  completely fucked..But what are we without our fantasies of causes  heroes  and grand battles.                   |
|33.8733655|35.8495145|0.6512862750937641|@toogucciforyou someone's keeping me up ??                                                                                                                                                                                                    |
|33.8733655|35.8495145|0.8849331055255223|@toogucciforyou Aflanne that's why ma bjarreb ektob Swedish?                                                                                                                                                              |
|33.8733655|35.8495145|0.8364178901299960|Creativity is where you find who you are!                                                                                                                                                                                                     |
|33.8733655|35.8495145|0.6532908315602334|@KrewdPoet all the better id say                                                                                                                                                                                                                      |
|33.8733655|35.8495145|0.4084283943748655|Putin Lists U.S. As One Of The Threats To Russia's National Security https://t.co/jd93Drk4N2 https://t.co/l7LKUN4A5C                                              |
|33.8733655|35.8495145|0.7771673205629203|A Clip on projector  WiGig and LTE all into a single lightweight Lenovo ThinkPad X1 Tablet. WoW! @lenovo  https://t.co/A32Q9RSl6A via @CNET   |
|33.8733655|35.8495145|0.9596987698902406|@Gulan_A thank you. I wish you all the best too                                                                                                                                                                                           |
|33.8733655|35.8495145|0.2695436743946154|AFP: Saudi police shot at in home village of executed cleric                                                                                                                                                              |
|33.8733655|35.8495145|0.2694895125816703|Which Countries Are the Most Expensive for Tourists? https://t.co/hgzR6kPIMH https://t.co/lQJHkFyq24                                                                              |
|33.8733655|35.8495145|0.8311372110468187|Good Morning !????                                                                                                                                                                                                                                                    |
|33.8733655|35.8495145|0.1222478140991693|How do some not even feel the tiniest bit sorry for what they put others through? Curious                                                                                                     |
|33.8733655|35.8495145|0.7039707972753375|@JoumanaGebara @JosephLF @alhayat_ksa joke of the day...                                                                                                                                                                      |
|33.8733655|35.8495145|0.0709410869696622|With laws preventing me from smoking while there are kids in the car  this will be me as a parent https://t.co/wwa2XiOaVW                                     |
|33.8733655|35.8495145|0.1955959507977803|ok im gonna go back to sleep now                                                                                                                                                                                                                      |

You can download the file with 1000 complete rows from the downloads section.

Sentiment across Europe

I collected tweets geo-localised in Europe created between 2015-12-26 and 2016-12-04. I ran them through the sentiment analyser, and this is the result:

Happiness Levels in Europe

The change in the number of data points seems to depend more on the time of day than on weather conditions. In order to draw a pertinent conclusion about the correlation between weather conditions and sentiment, a larger dataset of tweets is required (spread across more than just a week).

3. Fire scripts

I used these scripts to train an artificial neural network that predicted the sentiment polarity given 3 weather variables: temperature, pressure and humidity.

Collection

Tweets were collected using Twitter's Streaming API, filtered by location (Europe) and language (English).

Processing

After they were collected, tweets were ran through the sentiment analyser to get their polarity. The parser script used a NOAA-provided weather dataset to extract the temperature, pressure and humidity for each tweet's location.

Training

After processing the tweets, I used them to train a multilayer perceptron. The 3 weather variables were the input nodes and the polarity was the output node. 90% of the dataset was used for training and the remaining 10% for testing.

Running

To run the experiment:

# Collect tweets; leave this running for a few hours
$ sbt submit-TwitterHoseFireCollector

# Parse the collected tweets
# This parser uses the "emo" sentiment analyser, make sure
#   you've trained it first or edit the script to use the other model
$ sbt submit-TwitterHoseFireParser

# Train the sentiment analyser
$ sbt submit-TwitterHoseFireTrainer

# Test the analyser
$ sbt submit-TwitterHoseFireRepl

Screenshot

Here's a screenshot of my collector running for almost 9 days.

Fire Collector

Conclusion

TODO

Tips

Keep these in mind:

use a powerful machine (otherwise, a cluster might be required)
use the Spark UI to watch your script's progress on http://localhost:4040
you can download my trained models from the downloads section; just place them where the trainers would save them (see the scripts' source code) and you're good to go

Suggestions

A few suggestions to improve the project:

The sentiment analyser doesn't recognise negated words. Sentences like "i am not happy" are incorrectly classified as positive. Use a POS-tagger to merge words with their negations before using them for training (e.g. that sentence would become "i am not_happy")
Use fuzzy matching with an English dictionary to correct spelling mistakes and further reduce the feature space
Other suggestions are welcome!

Downloads

I uploaded some files from my project on the releases page:

happiness.csv – a csv with lat, lon and polarity extracted from tweets
weather.csv – a csv with lat, lon, created_at, temperature (K), pressure (Pa) and humidity (%) extracted from tweets and their locations' forecast
model-sentiment-140.tar.gz – my Sentiment140 sentiment analyser model
model-emo-140.tar.gz – my emo sentiment analyser model
examples-sentiment-140.txt – prediction examples done with the Sentiment140 analyser
examples-emo-140.txt – prediction examples done with the emo analyser

Tweeather

ML project that correlates Twitter sentiment to European weather.

Tweeather

Table of Contents

Requirements

Setup

Configuration

Env vars

System properties

Submit tasks

Scripts

1. Sentiment140 scripts

Processing

Running

Example

2. Emo scripts

Collection

Processing

Training

Running

Screenshot

Example

Sentiment across Europe

3. Fire scripts

Collection

Processing

Training

Running

Screenshot

Conclusion

Tips

Suggestions

Downloads