Twitter is massive. There are about 350,000 tweets sent per minute, from tens of thousands of people around the world. I became interested in the underlying characteristics of Twitter after the NewMR social media study. As a consequence of taking part in that study I had a corpus of nearly 400,000 tweets available for analysis. The tweets were collected over a 24 hour period from 1394 unique users. Using this corpus I did some analysis on what underlies the structure of tweets.
To use any data source effectively knowing what the structure of that data is critical to avoid bias and inaccuracy. No data is pure, they all come with their biases and hidden structures.
To use any data source effectively knowing what the structure of that data is critical to avoid bias and inaccuracy. No data is pure, they all come with their biases and hidden structures.
Tweets, it seems, are mostly about URL's. In my sample 89% of all tweets had embedded URL's. It seems Twitter is a broadcast medium for sharing URL's given this percentage.
Retweeting is not that common. Low frequency tweeters, having posted 10 or less times in the 24 hour period, had slightly a slightly higher rate of retweeting at 16% than high frequency tweeters who retweeted 12% of the time. The idea of sharing tweets with your followers seems to be a canard.
Roughly 10% of all tweets had links to images. I'm not sure at the moment how many of these images were of cats, I'm working on this one.
73% of tweets didn't have a hashtag. Around 10% had one hashtag, 8% had two hashtags in them and just under 5% had three hashtags. The most hashtags in a tweet were 17, this is what it looked like:
1★IAM∞ADDRESSING #PETRONAS #LNG #BCPOLI #MALAYSIA #TPP #FINTECH★#iOT★#VANPOLI #INM★#FTSE∞#Grexit★#DTES #NYSE #LSE #BEIJING #OIL #MOSCOW vi…
Hash tags, at least in this corpus, are not providing much information as to the content of a tweet.
These users tweeted from once to 5768 times during the 24 hour period. The number of tweets that were from users who tweets 10 times or less was 1381 - 0.35%. In terms of accounts, 27.% of accounts tweeted 10 times or less during the 24 hours. There was some indication that low frequency posters included less URL's in their tweets.
Listen Carefully
This corpus was derived from users who sent tweets with the words "market research" in them in a previous study phase. In this respect it is probably biased to financial news twitter accounts that are constantly sending out reports on businesses and market sectors. It shows that you need to be careful to who you are listening to. Hyperactive twitter accounts may be giving different information that quieter accounts. Monitoring how active an account is an important metric.
Hash Tags and Retweets: Not so much
Hash tags can't be relied upon to measure themes in Twitter, there are simply not enough of them produced. Even if the figures in this corpus are skewed low (which I have no reason to believe) they only occur in a small number of tweets. Retweeting doesn't seem to be a huge activity either, Twitter seems to be more broadcast that sharing.
Rise of the Robots
It's clear in this corpus there are a lot of robots (bots) - automated tweeting systems. This is the only way users are able to post thousands of tweets per day. I suspect that because I have picked up business reporting bots in this corpus that they are over represented. However they are a part of the Twitter landscape that has to be considered. Hyperactivity is not necessarily a bad thing, but it does mean that their tweets are not from an individual. All users are not equal.
URL's are the content ?
The high incidence of URL's in this corpus points to the content of the tweet being the web page or image that is included in the tweet. URL's can't be ignored and the content they hold has to be captured.
The Signal and the Noise.
So what is the meaning in a tweet ? Hash tags are too sparse to be reliable markers of content. The text is obviously important, but in this corpus the incidence of a URL in a tweet is so high that this has to be the primary message. The next stage is to analyse the URL's which leads the use of automated processing. This corpus has 388,127 tweets in it. Assuming that it takes 15 seconds manually to process and digest each tweet, it would take 67 days working 24 hours a day for a human read all the tweets and that is without looking at the URL's. We have to sample tweets or automatically process them to extract meaning. What criteria we use for sampling tweets is unclear at the moment. Even if we obtain a sample that is capable of being processed by humans, it's pretty clear that it will always be a relatively tiny amount and hence prone to error, as are all samples. I don't think humans or computers are the complete answer to processing large amounts of text, but at least computers can process the whole rather than part.
We may be starting a battle of the robots, automated tweeting versus automated analysis of tweets.
O brave new world.