How
do we structure the sample ? What characteristics do we use of the
population to generate the sample ? How big should a sample be ? I'm
not being rhetorical here, I'd really like to hear some thoughts
on these questions.
The
issue here is not the difficulty in getting the raw information,
since thousands of social media postings can be obtained easily and
very quickly. The sampling issue comes into play as a consequence of
trying to analyse 100,000 comments or postings. Physically reading
them all is expensive, time consuming and impractical. It is impossible to read and summarize 100,000 postings in any reasonable
time. 100,000 posts would be the equivalent of a 700,000 or more
word novel and that is a very conservative estimate. According to
Amazon the median length of a novel is about 64,000 words, 700,000
words is more than 10 novels to read and summarize.
The
problem grows when there are embedded links in the posts to websites
and images. The tendrils of social media posts can be vast. Therefore
some sort of sampling seems logical, take a subset of the posts
controlled in some way and analyse the subset. A tenth of 100,000
posts is 10,000, but that is still a large number to code and
analyse. It seems a pity to ignore the 90,000 other posts.
Sampling
in the survey world is a consequence of the difficulty of obtaining
data. What is hardly ever done is discarding the completed survey
data once we have it. We analyse all completed (and sometimes
incomplete) survey responses, so why not analyse all the social media
postings ? Unlike survey data, the problem with social media is not
the acquisition of the data, it is the analysis of the data that
is challenging.
I
think the way forward is to look at social media as a behavioural data
stream. Posting is a behaviour, it should be analysed and quantified
as any other form of behavioural data. We need a theory of why people
post, it can't be random. Throwing away the vast majority of social media posts collected simply because there is a lack of use
of analytical tools doesn't seem like a good idea to me. Automated
metrics and analyses can be generated for a 100,000 posts fairly
easily, coding 100,000 posts is expensive and of limited value.
There
is nothing magical about words. Languages are structured systems,
they are systems of signs, they are amenable to summary and analysis
in the same way as survey data is. The automated methods may be very
different from the ones survey data uses but they are available.
This is not an attempt to remove human analysts from their role
though. Statistics need to be interpreted, they are just results. In
the same way automated analyses of social media data need
interpretation.
I've
often read about how human interpretation of social media data is the
only way, applying qualitative techniques to social media as if it is
some vast focus group. Writers, rightly, talk about how powerful our
minds are when it comes to the interpretation of language and
culture. What no one seems to mention is the limitations and biases
that the human mind brings to any analysis of text. Charles Pierce,
a seminal thinker in the field of semiotics, talked of
the understanding or effect
of the linguistic
sign. He was saying that
signs, the elements of language, can have different meanings and
effects on the interpreter of the sign.
The limits of human memory
mean it's not possible to
read the equivalent of “War and Peace”, about 560,000 words, in a
couple of days and summarize
all of it. And this
is a small amount of text,
1.5 million words in
a corpora is
entirely possible.
The
more we learn about the structure of social media posts the better we
can segment and analyse them. Automated analysis of text is a great
way to build a foundation for better understanding the sea of text we
now faced with. After all, it's not like we can read it all.