Friday, September 4, 2015

Social media sampling and automated analytics.





So, what is a valid sample for analysis of social media posts or any document corpora ?

How do we structure the sample ? What characteristics do we use of the population to generate the sample ? How big should a sample be ? I'm not being rhetorical here, I'd really like to hear some thoughts on these questions.

The issue here is not the difficulty in getting the raw information, since thousands of social media postings can be obtained easily and very quickly. The sampling issue comes into play as a consequence of trying to analyse 100,000 comments or postings. Physically reading them all is expensive, time consuming and impractical. It is impossible to read and summarize 100,000 postings in any reasonable time. 100,000 posts would be the equivalent of a 700,000 or more word novel and that is a very conservative estimate. According to Amazon the median length of a novel is about 64,000 words, 700,000 words is more than 10 novels to read and summarize.

The problem grows when there are embedded links in the posts to websites and images. The tendrils of social media posts can be vast. Therefore some sort of sampling seems logical, take a subset of the posts controlled in some way and analyse the subset. A tenth of 100,000 posts is 10,000, but that is still a large number to code and analyse. It seems a pity to ignore the 90,000 other posts.

Sampling in the survey world is a consequence of the difficulty of obtaining data. What is hardly ever done is discarding the completed survey data once we have it. We analyse all completed (and sometimes incomplete) survey responses, so why not analyse all the social media postings ? Unlike survey data, the problem with social media is not the acquisition of the data, it is the analysis of the data that is challenging.

I think the way forward is to look at social media as a behavioural data stream. Posting is a behaviour, it should be analysed and quantified as any other form of behavioural data. We need a theory of why people post, it can't be random. Throwing away the vast majority of social media posts collected simply because there is a lack of use of analytical tools doesn't seem like a good idea to me. Automated metrics and analyses can be generated for a 100,000 posts fairly easily, coding 100,000 posts is expensive and of limited value.

There is nothing magical about words. Languages are structured systems, they are systems of signs, they are amenable to summary and analysis in the same way as survey data is. The automated methods may be very different from the ones survey data uses but they are available. This is not an attempt to remove human analysts from their role though. Statistics need to be interpreted, they are just results. In the same way automated analyses of social media data need interpretation.

I've often read about how human interpretation of social media data is the only way, applying qualitative techniques to social media as if it is some vast focus group. Writers, rightly, talk about how powerful our minds are when it comes to the interpretation of language and culture. What no one seems to mention is the limitations and biases that the human mind brings to any analysis of text. Charles Pierce, a seminal thinker in the field of semiotics, talked of the understanding or effect of the linguistic sign. He was saying that signs, the elements of language, can have different meanings and effects on the interpreter of the sign. The limits of human memory mean it's not possible to read the equivalent of “War and Peace”, about 560,000 words, in a couple of days and summarize all of it.  And this is a small amount of text, 1.5 million words in a corpora is entirely possible.

The more we learn about the structure of social media posts the better we can segment and analyse them. Automated analysis of text is a great way to build a foundation for better understanding the sea of text we now faced with. After all, it's not like we can read it all.



Thursday, August 6, 2015

Deconstructing Twitter

Twitter is massive. There are about 350,000 tweets sent per minute, from tens of thousands of people around the world. I became interested in the underlying characteristics of Twitter after the NewMR social media study. As a consequence of taking part in that study I had a corpus of nearly 400,000 tweets available for analysis. The tweets were collected over a 24 hour period from 1394 unique users. Using this corpus I did some analysis on what underlies the structure of tweets. 

To use any data source effectively knowing what the structure of that data is critical to avoid bias and inaccuracy. No data is pure, they all come with their biases and hidden structures.

Tweets, it seems, are mostly about URL's. In my sample 89% of all tweets had embedded URL's. It seems Twitter is a broadcast medium for sharing URL's given this percentage.

Retweeting is not that common. Low frequency tweeters, having posted 10 or less times in the 24 hour period, had slightly a slightly higher rate of retweeting at 16% than high frequency tweeters who retweeted 12% of the time. The idea of sharing tweets with your followers seems to be a canard.

Roughly 10% of all tweets had links to images. I'm not sure at the moment how many of these images were of cats, I'm working on this one.

73% of tweets didn't have a hashtag. Around 10% had one hashtag, 8% had two hashtags in them and just under 5% had three hashtags. The most hashtags in a tweet were 17, this is what it looked like:

1★IAM∞ADDRESSING #PETRONAS #LNG #BCPOLI #MALAYSIA #TPP #FINTECH★#iOT★#VANPOLI #INM★#FTSE∞#Grexit★#DTES #NYSE #LSE #BEIJING #OIL #MOSCOW vi…

Hash tags, at least in this corpus, are not providing much information as to the content of a tweet.

These users tweeted from once to 5768 times during the 24 hour period. The number of tweets that were from users who tweets 10 times or less was 1381 - 0.35%. In terms of accounts, 27.% of accounts tweeted 10 times or less during the 24 hours. There was some indication that low frequency posters included less URL's in their tweets.

Listen Carefully

This corpus was derived from users who sent tweets with the words "market research" in them in a previous study phase. In this respect it is probably biased to financial news twitter accounts that are constantly sending out reports on businesses and market sectors. It shows that you need to be careful to who you are listening to. Hyperactive twitter accounts may be giving different information that quieter accounts. Monitoring how active an account is an important metric.

Hash Tags and Retweets: Not so much 

Hash tags can't be relied upon to measure themes in Twitter, there are simply not enough of them produced. Even if the figures in this corpus are skewed low (which I have no reason to believe) they only occur in a small number of tweets. Retweeting doesn't seem to be a huge activity either, Twitter seems to be more broadcast that sharing.

Rise of the Robots

It's clear in this corpus there are a lot of robots (bots)  - automated tweeting systems. This is the only way users are able to post thousands of tweets per day.  I suspect that because I have picked up business reporting bots in this corpus that they are over represented. However they are a part of the Twitter landscape that has to be considered. Hyperactivity is not necessarily a bad thing, but it does mean that their tweets are not from an individual. All users are not equal.

URL's are the content ?

The high incidence of URL's in this corpus points to the content of the tweet being the web page or image that is included in the tweet. URL's can't be ignored and the content they hold has to be captured.

The Signal and the Noise. 

So what is the meaning in a tweet ? Hash tags are too sparse to be reliable markers of content. The text is obviously important, but in this corpus the incidence of a URL in a tweet is so high that this has to be the primary message. The next stage is to analyse the URL's which leads the use of automated processing. This corpus has 388,127 tweets in it. Assuming that it takes 15 seconds manually to process and digest each tweet, it would take 67 days working 24 hours a day for a human read all the tweets and that is without looking at the URL's. We have to sample tweets or automatically process them to extract meaning. What criteria we use for sampling tweets is unclear at the moment. Even if we obtain a sample that is capable of being processed by humans, it's pretty clear that it will always be a relatively tiny amount and hence prone to error, as are all samples. I don't think humans or computers are the complete answer to processing large amounts of text, but at least computers can process the whole rather than part.

We may be starting a battle of the robots, automated tweeting versus automated analysis of tweets. 

O brave new world.



Wednesday, April 15, 2015

Buy Twitter? If you can...

Over the past few days Twitter announced that it is terminating their relationships with resellers of the main “fire hose” Twitter data stream. It’s termed a fire hose because of the sheer volume of tweets generated. There are something like 6,000 tweets generated per second by Twitter users, a lot of data by anyone’s standards. Twitter offers a variety of other ways of getting a sample of tweets through an “API”, Application Programming Interfaces that enable software such as R, Python and many others to access Twitter data. However this is a sample of tweets, not a real time stream of all tweets. You can also access Twitter from user accounts and usually find what you want, but by no means can you guarantee that you get all the tweets that you want . That’s fine though, these streams and accounts are free, it is hard to complain about something that is free. You can read more about these facilities at https://dev.twitter.com/ . 

Some years ago a company called Gnip, now acquired by Twitter, was reselling fire hose access along with Datasift (datasift.com) and NTT Data (nttdata.com) who provided Japanese language tweets only. These companies provided a framework for access to the Twitter fire hose for a fee. Now these companies cannot resell the Twitter fire hose data. Datasift has many social media streams available, including information from Facebook, but losing Twitter access has to be a blow. A couple of weeks ago I decided to try to get access to the Twitter fire hose, this is related to a social media project I will be working on. I had used a free Gnip account some years ago when they first started, I’d now got access to the public Twitter API and I wanted to see what the fire hose would cost. I thought it would be logical to ask Twitter, as they had acquired Gnip recently, and also to ask Datasift. I dutifully went through the sales contact process for Twitter and eventually had a call with what I thought was a sales person and his supervisor. I say “thought” because it proved very hard to get any sort of structured price out of them. The conversation with Twitter was very obtuse, I only received a vague idea of price. In the end I gave up asking . Oh well.

I decided to call Datasift next, and I will confess I was expecting something like the same treatment. Perhaps Twitter is very strict about who buys their data and the “Kafkaesque” lead qualification process was required by Twitter to get access to the fire hose. After all it is Twitters’ data, they can do as they please. Happily Datasift proved to be extremely efficient, I quickly got exactly the information I needed. I got a firm price for Twitter fire hose access, I was told what other facilities I would need from Datasift and what they would cost. My sales person, James Johnson, was the epitome of helpfulness and professionalism. As it turned out the cost was too high for the current project, but within bounds if I was starting any sort of social media analysis business. Datasift do have a range of other social media sources such as Reddit, Tumblr etc. According to the CEO of Datasift, Nick Halstead, Datasift will soon be providing Facebook topic data. Datasift does have an easy to use sign up process for accounts and it costs nothing to experiment a little with their data feeds. If you are interested in raw social media data I recommend Datasift.


I am truly disappointed that Twitter has seen to cut Datasift off from the fire hose. I’m looking forward to the Facebook topic data from Datasift. I am sure I will be told clearly the price and conditions of usage. As for Twitter, who knows? Maybe one day they will make a profit…..

Monday, March 30, 2015

Here comes the sun...time to launch a survey !



We all feel happier in the sun, well most of us anyway. And it’s no stretch of the imagination to imagine that we change our behavior when we are happy. We might go outside more, we may talk more, probably sunbathe more. It also seems, according to paper by Guéguen and Jacob [1],  that we are more likely to respond to requests for an interview when it is sunny. They tested the hypothesis that respondents would be more likely to comply with a request for a face to face survey when it was sunny rather than when it was cloudy. They controlled for outside temperature and interviewer gender, and found that they had more completed interviews when it was sunny. They did also note that male respondents were more likely to complete surveys when the interviewer was female, but there were no other interactions. Of course this was a personal intercept situation, not a web survey. It may be that the interviewers are happier being outside in sunny weather and this made their invitations to take a survey more attractive.  Either way, weather had an effect on respondents’ co-operation. It would be interesting to see the variability of responses to web surveys with relation to the weather. I can make a guess that being stuck indoors on a beautiful day may not help your recall of shampoo products used in the last few months.

The sun is a good example of an environmental influence on respondent behavior. There is also our genetic make up that can also influence how we behave. I found the paper by Hatemi and McDermott [2] on “The genetics of politics: discovery, challenges, and progress” utterly fascinating. Geneticists have developed analytical techniques to parse out what part of a behavior is genetically derived, environmentally derived or “unique” environmentally derived. It is all based on identical twins, meaning that they genetic material is exactly the same in two individuals. Using some fancy statistics they can get indications as to how much a behavior may be hereditary (genetic), derived from the general experience of the person or from their unique experience as an individual Hatemi and McDermott [2] collated studies on political attitudes from twin and kinship studies over a period of some 30 years. According to them “political knowledge and sophistication” is nearly 60% determined by genetics. On the other hand “political party affiliation” is less than 5% determined by genetics. “Participation and voter turnout” is over 40% determined by genetics. It’s seems our politics grow outside the womb.

The Hatemi and McDermott [2] reviewed studies all dealing with politically oriented characteristics. A more survey interview oriented study that used twins by Littvay, Popa and Fazekas [3] attempted to validate measures of survey response propensity. There is always the question that non-responders may not be the same as responders in characteristics that the survey wants to measure. Non-responders represent a possible bias, they can be fundamentally different from responders.  As part of their study of propensity variables Littvay et al [3] identified as part of a larger study a number of monozygotic (identical) and dizygotic (non-identical twins). The idea was to see if genetic variability was related to the validity of measures used for propensity scoring. An interesting fact is that twins, of any kind, have a tendency to respond to surveys more. Littvay et al [3] found a couple of interesting effects. First non-response in a panel or follow-up situation seems to be highly heritable, that is there is a strong genetic component to it. Secondly non-response to requests for information about close friends or the respondents’ social security number is mediated by environmental influences rather than genetic influences.

As usual, why respondents respond or don’t respond is complicated. Some people just don’t like answering survey questions, it’s a genetic thing. It does seem that if you want to ask a respondent about their friends though, pick a sunny day…..



1[1]  Nicolas Guéguen, Céline Jacob. 2014. “‘Here comes the sun’: Evidence of the Effect of the Weather Conditions on Compliance to a Survey Request “.  Survey Practice,  Vol 7, #5.

2[2] Peter K. Hatemi, Rose McDermott. 2012. “The genetics of politics: discovery, challenges, and progress”. Trends in Genetics, Vol. 28, Issue 10, p525–533.

3[3] Levente Littvay, Sebastian Adrian Popa, Zoltán Fazekas, 2013. “Validity of Survey Response Propensity Indicators: A Behavior Genetics Approach”.  Social Science Quarterly, Vol 94, Issue 2, p569-589.

Monday, March 9, 2015

Safe Harbor: Is it safe ?

Safe Harbor is a US government program in co-operation with the EU and Swiss governments providing self-certification for companies concerning the security of data gathered outside of the USA, but residing on servers within the USA. It tells the overseas participants, the EU and Switzerland, that the data will be kept private and secure within the USA. Norway, Iceland and Liechtenstein have also agreed to be bound by this agreement.  You can find out if a company is Safe Harbor compliant on the Safe Harbor website, http://www.export.gov/safeharbor/ .

The Safe Harbor framework is vital for any company in the US that carries out data collection (data import in Safe Harbor terms) in Europe using computer systems based in the USA. Without it, the nightmare of having to comply with 30 countries differing security requirements would be crippling to data collection activities.

The introduction by CASRO (casro.org) of a Safe Harbor assistance program is a tremendous help to US based MR or survey companies who carry out research in Europe. This program makes it easier for CASRO members to become Safe Harbor certified and also provides a mediation channel for dispute resolution, a requirement for Safe Harbor compliance.

So all is right in the world. Become Safe Harbor compliant and you are now all set to collect data from Europe without violating any security requirements of European countries!

The problem is that this isn’t quite true.

There is a threat to Safe Harbor and it raises the specter of a world without a substantial Safe Harbor system. This threat started in Düsseldorf, Germany in 2010. Germany has a federal system of regional government, each of the 16 states within the German federation has significant legal powers. In April of 2010 the “Düsseldorf Circle” met. This was an informal group of data protection officials from each of the 16 states within Germany. They passed a resolution that meant that they no longer accepted membership to the Safe Harbor agreement as reliable enough to allow data collection by US entities within each of the German states. They stated that there was a requirement for further due diligence on the part of German companies “exporting” data to the US beyond those required by Safe Harbor. In short, they needed to undertake their own due diligence with the US data importer and the onus is on the German companies to make sure they are satisfied that the US importer is secure enough.

In practice this means that when you agree a deal with a multinational European company to collect data from all their companies in Europe, you have to not only be a member of the Safe Harbor program but often also sign a separate agreement with the Germany subsidiary company because of German federal law. It also applies to global US based companies; the German subsidiary will often require an agreement of their own. This agreement is often part of the EU directive on data storage, a sort of re-affirmation that the data will be kept safe while in the US. Sometimes the German company simply decided not to be part of the global master agreement and to use local facilities to store German data so it never crosses the shores of the USA.

So far this seems only to be happening with Germany, but it represents a crack in the Safe Harbor system. The United Kingdom has some very strict laws regarding data collection and privacy. For instance, you have to actively agree to allow websites to use cookies on your computer. All UK websites will ask for this permission when you first visit them. Very often UK companies will require that data collected within the UK resides on servers in the UK and that it is not exported to the USA. This trend is becoming more common, companies want their data in the their country. It may only be a matter of time before other European countries follow the lead of Germany and require data exporters to have their own agreements, outside of Safe Harbor, with US data importers.

After the controversy surrounding the revelations by Edward Snowden concerning the USA and government spying, the USA is unfortunately regarded with suspicion in much of Europe when it comes to data security. Earlier last year the French and German governments held talks regarding an Internet communications system that would avoid data (mainly email) passing through the USA to shield it from USA government spying. This shows the level of concern in Europe about USA data security.  It is not in anyone’s interest to go back to having agreements with each nation within the EU concerning data exporting to the USA, it will be very time consuming, chaotic and only to serve to stifle business for US companies who want to collect data globally.

Companies such as Amazon can provide one possible technical solution to local country storage requirements. Amazon, along with selling anything you could possibly think of, also sells cloud-computing resources via “Amazon Web Services” (AWS). AWS is also able to localize the cloud services so that your data can be in a specific place, for instance Frankfurt or Ireland. It could be a solution for US based companies gathering data but needing the data to be stored in another country. But it is by no means simple to split data storage across facilities in this way, so while it sounds like a solution, implementing it could be harder than it looks.


Safe Harbor is very much in the interest of global MR client companies. It allows streamlined data collection operations from a single US source, rather than having to have data collected from many different countries individually. It makes data collection much more efficient and hence more economic, not to mention cutting down the time taken to implement data collection agreements. Safe harbor is vital to US data collection companies and needs to be kept safe.