Friday, September 4, 2015

Social media sampling and automated analytics.

So, what is a valid sample for analysis of social media posts or any document corpora ?

How do we structure the sample ? What characteristics do we use of the population to generate the sample ? How big should a sample be ? I'm not being rhetorical here, I'd really like to hear some thoughts on these questions.

The issue here is not the difficulty in getting the raw information, since thousands of social media postings can be obtained easily and very quickly. The sampling issue comes into play as a consequence of trying to analyse 100,000 comments or postings. Physically reading them all is expensive, time consuming and impractical. It is impossible to read and summarize 100,000 postings in any reasonable time. 100,000 posts would be the equivalent of a 700,000 or more word novel and that is a very conservative estimate. According to Amazon the median length of a novel is about 64,000 words, 700,000 words is more than 10 novels to read and summarize.

The problem grows when there are embedded links in the posts to websites and images. The tendrils of social media posts can be vast. Therefore some sort of sampling seems logical, take a subset of the posts controlled in some way and analyse the subset. A tenth of 100,000 posts is 10,000, but that is still a large number to code and analyse. It seems a pity to ignore the 90,000 other posts.

Sampling in the survey world is a consequence of the difficulty of obtaining data. What is hardly ever done is discarding the completed survey data once we have it. We analyse all completed (and sometimes incomplete) survey responses, so why not analyse all the social media postings ? Unlike survey data, the problem with social media is not the acquisition of the data, it is the analysis of the data that is challenging.

I think the way forward is to look at social media as a behavioural data stream. Posting is a behaviour, it should be analysed and quantified as any other form of behavioural data. We need a theory of why people post, it can't be random. Throwing away the vast majority of social media posts collected simply because there is a lack of use of analytical tools doesn't seem like a good idea to me. Automated metrics and analyses can be generated for a 100,000 posts fairly easily, coding 100,000 posts is expensive and of limited value.

There is nothing magical about words. Languages are structured systems, they are systems of signs, they are amenable to summary and analysis in the same way as survey data is. The automated methods may be very different from the ones survey data uses but they are available. This is not an attempt to remove human analysts from their role though. Statistics need to be interpreted, they are just results. In the same way automated analyses of social media data need interpretation.

I've often read about how human interpretation of social media data is the only way, applying qualitative techniques to social media as if it is some vast focus group. Writers, rightly, talk about how powerful our minds are when it comes to the interpretation of language and culture. What no one seems to mention is the limitations and biases that the human mind brings to any analysis of text. Charles Pierce, a seminal thinker in the field of semiotics, talked of the understanding or effect of the linguistic sign. He was saying that signs, the elements of language, can have different meanings and effects on the interpreter of the sign. The limits of human memory mean it's not possible to read the equivalent of “War and Peace”, about 560,000 words, in a couple of days and summarize all of it.  And this is a small amount of text, 1.5 million words in a corpora is entirely possible.

The more we learn about the structure of social media posts the better we can segment and analyse them. Automated analysis of text is a great way to build a foundation for better understanding the sea of text we now faced with. After all, it's not like we can read it all.