Not all social data has value.
The time for entering a simple keyword search into a social listening tool and seeing what comes back has come and gone. The businesses winning at social data analysis have a more sophisticated approach – from asking the right questions and choosing the right sources, to filling in the ‘context holes’ and analysing and interpreting the data with rigour and consistency.
Rob Sullivan, data analyst at Social Chain, recently caught our attention with his LinkedIn posts about his approach to social data analysis. We caught up with Rob to get his take on what he calls ‘alternative data’, why it’s useful, how he selects the best data sources for his projects and how he analyses the returned data.
The Need for Alternative Data
The one thing that comes up time and time again when questioning the legitimacy of social data analysis is ‘sample bias’.
Rob tell us that he looks at information from across the whole spectrum of sources because “social data doesn’t cover everything”. He emphasised this with the example that “Twitter and Instagram US users are not the US census”.
Chris Hansen, Janys Analytics, may disagree with this sentiment, clashing with his 2015 TEDxMileHigh presentation on how a sample of US and Western Europe Twitter users was a close representation of the rest of the population.
Of course, it’s important to bear in mind that people’s use of social networks change over time, perhaps making Hansen’s findings outdated. After all, it wasn’t that long ago that there was a public outcry and dip in usage of Snapchat when they changed their UX, with a single post by Kyle Jenner reportedly wiping $1.3billion for their stock value.
While we’ve never been reliant on Snapchat data to mine for customer insight, it goes to show how perceptions of and participation in social networks change over time.
Rob emphasised that “2015 was a while back in social terms”, citing other studies looking at Twitter users versus the UK Census and the breadth of differences between the samples (for example, Mellon and Prosser).
He believes choosing the right social data source is more industry dependent.
“if you are Converse and want to assess a product launch, Twitter is ideal because people will gossip all day about the Miley or Tyler release. But, if you’re a B2B software provider, maybe not! Twitter is still lacking in capturing the silent majority, where Google Trends can assist”.
And, let’s not forget the fact that social data analysis is reliant on the networks providing API access. In the weeks after the Cambridge Analytica scandal we saw a change to Instagram data access, which resulted in a slew of changes in the social listening tools themselves. The most restrictive change we’ve experienced is no longer being able to access the comments on Instagram posts within the social listening tools.
Even before the changes in the Instagram API, there was the quiet death of new tools developed by Pulsar and Spredfast around Facebook Creative Insights, only a month or so after they launched. Adweek reported this change to be down to the fact that Facebook was moving to supply data directly to marketers instead of going through a middleman.
To counter these challenges, Rob also explores “alternative data sources”.
He believes that alternative data is the most interesting new area for research. Computing power and the rise of machine learning makes it easier to use new (and sometimes untapped) data sources to understand consumer behaviours.
Rob includes reviews, forums and Google Trends data as examples of alternative data sources. As well as these he’s also interested in:
Satellite Data: for example, assessing agricultural yields, oil rig activity, car counts in parking lots. He says providers like Orbital Insights and Descartes Labs are doing really interesting things in this field as providers, with interest from investment banks and hedge funds to trade on their data. Rob says, commodities are a natural focus here.
Web Scraping: something that Rob has done before. For example, scraping Piston Heads, the car website to run a multiple regression and see if Mercedes has a brand effect in the used market vs, Audi and BMW. Other good analyses he’s seen done before are scraping IMDB and AngelList.
PDF Parsing: Rob says, if you have a lot of PDFs lying around, like Patents or SEC filings, you can use a package like Tesseract or R to get the text from thousands of physical or web documents. Have a look and see if there is a trend in what is being filed in a certain industry, and across all the different forms e.g. insider trading disclosures. Rob gives the example of Wolfe Research, using text mining in corporate filings really well. He points out, if you spot that the language in a standardised part of an annual report suddenly changes, it shows the company may be having some big problems.
Football Traffic: you can track volunteered geographic information, either through GPS, WiFi or CDMA signals. With this data you can try and determine footfall traffic in and around store locations. Rob says, Foursquare famously did this with Chipotle a few years ago and correctly called earnings weakness.
Selecting the Best Data Source
If you already analyse social data, you’ll know that there is no one-stop shop for answers. It all depends on the question and the topic.
And, it’s a lot more than tapping a general search query into a social listening tool. You need to know the question that you want to answer and select the appropriate data sources [and metrics] to use.
If all of this is still a bit murky for you, Rob gave us a couple of examples to help get you started…
To look at trends in customer electronics over the last 10 years, he would look at a combination of Google Trends (including shopping queries), Twitter, Instagram and annual reports.
If he wanted to explore marketing opportunities in nutrition, Rob would do things a little differently. He’d collate academic literature such as the randomised controlled trials and scientific papers, and compare the public interest to the scientific (e.g. Google hits vs. citations or sentiment around findings in papers).
Rob says “it’s the gap between scientific and public interest that’s the gold”.
He gives the example of how the scientific interest around St John’s Wort helping depression is really high but the public interest is low. However, black tea for cancer prevention has high public interest but small amounts of scientific evidence supporting it.
In selecting data sources for this, he looks at where the high quality conversations are. For example, he might look at forums like Bodybuilding.com if the brand is going after high lifetime value customers or a specific niche.
Product Launch Analysis
If Rob is trying to assess the success of a product launch for someone like Converse or Nike, Twitter may be a sufficient primary source. Alongside activity on “sneakerhead”, sub-reddits also be relevant.
Unsurprisingly, the best data source is industry and topic specific. The best rule of thumb is to spend some time thinking about the most valuable data sources – the sources where people are talking in-depth about a particular topic.
Rob tends to use more than one data source in his analysis, these sources compliment the question being answered, help to fill in ‘context holes’, and are chosen strategically.
Of all of Rob’s LinkedIn articles on his approach to social data analysis, one post that particularly caught our eye was about review data for Philips.
He tells us that he was using this Philips example to showcase how to get quantitative metrics from a qualitative source. Philips have a lot of product reviews across the web, with a lot on Chinese websites such as JD.com and Tmall.com, as well as Amazon and other eCommerce sites.
In past experience with review data, Rob has found that many people only analyse star ratings and other structured data, but warns that this isn’t where the most useful insight for brands lie.
And, we agree!
Rob explains, that for this example, he trained the Crimson Hexagon algorithm to look for things related to people feeling their money was wasted but he would have to run analysis across many negative themes detected to come to a clearer conclusion before making recommendations.
He also adds that the analysis would be useful to compare actual product revenues. If he and Social Chain worked with Philips on their reviews, they would most likely be looking at reporting significance. For example:
“x issue is common and tightly dispersed in y product group and it is impacting the bottom line independent of other factors. Not just x is common and in y product group”.
Rob, rightly, argues that you need to understand the significance. This is something he approaches strategically, looking at normalisation and new concept that he calls ‘dispersion’.
Rob says that in the Philips example with customer reviews, it is easy to normalise the data. The process he uses with Crimson Hexagon includes training an algorithm to recognise a type of topic in the text.
He gives an example: of 5% of 2 million posts coming back on that topic he would download the 100,000 posts, put these into Excel or SQL [or whatever is convenient], and count the unique URLs. After this you have the number of times the issue occurs for different unique products. And, finally, divide the number of reviews for that product to get the normalised occurrences of that issue.
Rob also looks at what he calls ‘dispersion’ and believes it’s important to consider dispersion when analysing reviews. Dispersion is when you look at the spread of the topic across multiple segments and the corresponding value of the normalised metric.
He asks, if there were 200 products with very low normalised counts would the issue be seen as such a problem compared to 3 products with high counts?
Rob says that:
“As a company, it’s quite important to note if issues are only plaguing one type of product or if they occur over a few products. That has real cost implications for redesigning and remarketing products. To measure it you can use something like skewness or just basic percentiles and ranks or a ratio. For example, occurrences divided by the number of unique products”.
Taking the Philips example above, there were 715 occurrences of the “waste of money” theme across 317 products, at a ratio of 2.26.
Can General Social Data Can Tell Us Anything?
We think you’ll agree that alternative data has a massive place in social analysis. But, Rob is keen to remind us that general social data is still useful, it all depends on the use case.
When it comes to using alternative data sources Rob believes that we are only limited by our own imaginations.
The question is, what alternative data sources can you use to better understand your customers’ behaviours?
You use alternative data sources? Share your experiences with our Writers Account.Write For Us