Hayley Lepps is a Researcher at NatCen Social Research
Earlier in the year I was invited to attend a two-day workshop hosted by the University of Aberdeen, as part of their ESRC‐funded project “Social Media, Privacy and Risk: Towards More Ethical Research Methodologies”.
The workshop was attended by some of the leading thinkers & practitioners in social media research, and based on our experiences and research (including my own work on NatCen's “Research using Social Media: Users’ Views” project), we aimed to develop a set of flexible guidelines for the ethical use of social media data in research, intended to be used by researchers, students, ethics committees, funding bodies, and anyone else with an interest in the ethics of working with social media data. Following the workshop, the outputs were then pulled together by the team at Aberdeen, and the draft guidelines were discussed at the AcSS/ NSMNSS conference on ethics and social media research.
I was very happy to see the guidelines published last week; I would very much recommend anyone interested in conducting ethical social media research gives them a read, and I hope they provide a useful contribution to ongoing debates and best practice in ethical social media research in the social sciences. You can find the final version of the guidelines here: www.dotrural.ac.uk/socialmediaresearchethics.pdf
Wednesday, 13 July 2016
Thursday, 7 July 2016
Curtis Jessop is a Senior Researcher at NatCen Social Research and is the Network Lead for the NSMNSS network
On Wednesday 29th June I attended a roundtable hosted by our network partners SAGE on using big data to solve social science problems. It was a great day, with contributions from leading researchers and lots of discussion of some of the key issues of working with big data in social science.
Jane Elliott began with an overview of the ESRC’s Big Data Network. She identified the difficulties with data access that earlier phases had faced, but also highlighted key challenges that big data social science currently faces:
- Can we apply the same qualitative techniques/statistical inferences we have in the past?
- Are social scientists (falling) behind in using machine learning & algorithms? What are the implications of these methods?
- Making sure we use big data to answer pertinent social science questions, and not just focus on methods
- Working ethically with big data - data security, anonymity, informed consent & data ownership
- What are the implications of a ‘big data society’/algorithm-led decision making?
New methods, tools and techniques for big data research
Giuseppe Veltri outlined how data-driven science differs from ‘traditional’ social science research as it generates hypotheses and insights from the data, rather than theory, combining abductive, inductive & deductive approaches. Further, Phillip Brooker identified a tension in big data analysis between wanting to use qualitative research approaches with data of a scale that requires numerical treatment. As a result, social scientists need to work with ‘unfamiliar’ techniques and software.
Tools for Big Data analysis
It was generally agreed that existing software are not fit for addressing academic/social science research questions. Also, tools offered by commercial companies are often ‘black boxes’, when social scientists need to be transparent on the algorithms they use as they are part of the methodology.
Many at the roundtable have therefore developed their own tools (e.g. COSMOS, Textonics, Chorus, & Method52 from CASM) to enable them to conduct analysis in a manner they wanted to. However, it was felt there was still some way to go - many of these tools are ‘in-house’ and ongoing funding/support is needed to develop something more stable, well-supported, and ‘outward facing’.
One approach to addressing the challenges of big data analysis is working in interdisciplinary teams (in particular linking between social & computer science departments). Luke Sloan and Mark Carrigan identified the key challenge of this at a ‘human level’ is ensuring a common understanding of language, after which it was easy to have an open discussion and there were rarely disagreements. Mark argued that what was key was not necessarily making sure that everyone had the same definitions, but that there was an understanding that different fields may have different perspectives.
Mark Kennedy, based on his experiences at the Data Science Institute, emphasised the importance of ‘getting excited’ about the right research question, not just focusing on the technology, and then building a team based on what skills you need to fill that gap.
However, attendees felt that there were structural barriers to interdisciplinary working in academia – departmental silos, geography, navigating different funding bodies, finding journals to publish in, and demonstrating value for the REF were all recognised as problems, although it was also mentioned that funding increasingly supported this approach.
Training in the social sciences
Quite early in the discussion, the question was raised that if there is such a clear skills gap in the social sciences, why had universities not responded to it?
Although it was accepted that training needed to address big data methods, there were differing opinions on how feasible this might be. Adding new techniques into methods courses was welcomed, but to what extent was this achievable when these are already packed covering ‘traditional’ methods? Further, given the relative rarity of established social scientists with this skill-set, who would provide this teaching?
Although it was felt that new students are open to using Python or R/new statistical techniques, this scarcity of trainers with the skills to teach both programming and its application within social sciences was again identified as a problem. Giving students (and academics) access to data science training materials that are framed by social science problems, and relevant dummy data to work with, was suggested as a way to start addressing this.
Answering social science questions with Big Data
While discussing his own research, Slava Mikhaylovhighlighted that a good way to make impact is, rather than starting with a research question, to aim to solve a problem. This was echoed by Carl Miller, who outlined some principles that Demos follow for making impact:
- Look beyond academic funders – if research is funded by a government department, they’re going to have to listen to it!
- Ask the right question – what is interesting to a researcher vs. a policy maker
- Answer quickly – policy interests change, and research won’t make an impact if everyone’s moved on
- Diversify outputs – can they be real-time, interactive, engaging?
- Networking – who are the champions of big data research?
Carl emphasised that was just the approach that Demos used, and may not be appropriate for all research or audiences. He also mentioned you need to work hard in a new discipline to be responsible and transparent about what your research doesn’t do or say.
Ethics of research using Big Data
Anne Alexander differentiated between the ethics of research using big data and the ethics of doing research in a networked world.
On the latter, Anne felt that there has not been enough reflection on the implications of the ‘datafication’ of human interaction, and that we need to de-mystify these processes and consider what the use of machine learning/algorithms means for society (e.g. their potential for discrimination).
Anne emphasised the need to take into consideration the public’s views on this when considering Big Data research, a point re-enforced by Steve Ginnis, whose work at Ipsos Mori on developing ethical guidelines for social media research drew on public ethics, existing industry guidelines and legal frameworks.
Steve’s research identified that the public both have low awareness of, and are not keen on, their social media data being used for research. This was not just due to concerns about privacy/anonymization – people were uncomfortable with being profiled and its possible implications.
That said, participants were willing to weigh up the risks and benefits, and context (who is doing the research and why) was important. Nonetheless, the ‘fundamentals’ (consent, what information, anonymization, etc.) played a much larger role in whether they felt research using social data was appropriate.
Both Anne & Steve emphasised that ethics is an ongoing process, not a one-off event at the start of a project – they need to be considered at the collection, analysis and publication stages of the research cycle.
Some concluding thoughts
Carl Miller identified that in the context of pressure for evidence-based policy, digital by default, and the open data initiative, there has never been a better time for social scientists to make impact with big data research.
Wednesday’s session demonstrated how far big data analysis in the social sciences has come over recent years and it is impressive to hear how much work has been put into developing the tools and methods to mould this rich, but novel, form of data into social insights.
However, the session also showed that there are number of areas that still need to be addressed if we are to make the most of big data:
- Access to large data sets continues to be an issue, be they proprietary, public, or administrative. We need to bargain collectively to talk to large, often global, actors and argue for academic access.
- There is a skills gap among social scientists for analysing big data, and support is needed to help develop the required methodological and programming skills.
- The interdisciplinary working required for big data analysis can be challenging, and we need to work to enable effective collaboration.
- Developing an ethical approach to big data analysis is challenging given its novelty, variety, and changing nature. Any framework needs to provide practical guidance to researchers while remaining flexible and responsive to changing contexts.
- Available tools for big data analysis can be expensive, lack transparency, or inappropriate for social science research. A maintained central library of available tools, with appropriate documentation and guidance could be extremely useful.
Friday, 1 July 2016
Joe Murphy is a senior survey methodologist with over 17 years of research and project management experience. Mr. Murphy has extensive experience developing and applying new technologies and modes of communication to improve the quality, relevance, and efficiency of survey research. His recent work has centered on the use and analysis of social media to supplement survey data, with a detailed focus on Twitter. Mr. Murphy also investigates optimal designs for multi-mode data collection platforms, data visualization, crowdsourcing, and social research in virtual worlds. Mr. Murphy is a demographer by training and survey methodologist by practice. His significant research experience includes the substantive topics of energy, hospitals and health care, and substance use and mental health. Mr. Murphy is also a proficient SAS programmer, experienced in the analysis and manipulation of large, complex data sets. @joejohnmurphy
1. Twitter is like a giant opt-in survey with one question.
Twitter started in 2006 with a simple prompt for its users: “what are you doing?” From a survey methodologist’s perspective, this isn’t really optimal question design. How people actually use Twitter is so varied, there might as well be no question at all. We aren’t used to working with answers to a question no one asked, and Twitter is a good example of what has been described as "organic data" – it just appears without our having designed for it. Tweets are limited to 140 characters in length. Pretty short, but a Tweet can capture a lot of information, and include links to other websites, photos, videos, and conversations.
2. Twitter is massive.
Every day, half a billion Tweets are posted. Half a billion! That means by the time you finish reading this, there will be approximately one million new Tweets. And the pace is only growing. With Twitter’s application programming interface (API) you can pull from a random 1% of Tweets. To get at all Tweets, or the Firehose (100% of Tweets), you need to go through one of a few vendor and for a fee, though the Library of Congress is working on providing access in the future.
3. Twitter is increasingly popular on mobile devices like smartphones and tablets.
You’ll see people tweeting at events, as news is happening right in front of them, or where you don’t really expect or want to see them tweeting, like while they’re driving. Many use Twitter on mobile devices with another screen on at the same time. That’s called multiscreening. Like when people tweet while watching television in a backchannel discussion with friends and fans of their favourite shows.
4. The user-base is large, but it doesn’t exactly reflect the general population.
It would be kind of weird it if did, honestly. There are surely many factors that influence the likelihood of adoption and wouldn’t it be surprising if we saw no differences by demographics? The Pew Research Center estimates 16% of online Americans now use Twitter, and about half of those do so on a typical day. Users are younger, more urban, and disproportionately black non-His- panic compared to the general population. This is interesting when thinking about new approaches for sometimes hard-to- reach populations.
5. It is made up of more than just people.
Twitter is not cleanly defined with one account per person or even just one person behind every account. Some people have multiple accounts and some accounts are inactive. Groups and organizations use Twitter to promote products and inform followers. They can purchase “promoted Tweets” that show up in users’ streams like a commercial. And watch out for robots! Some soft- ware applications run automated tasks to query or Retweet con- tent making it extra challenging when trying to interpret the data.
6. There are research applications beyond trying to sup- plant survey estimates.
Think about the survey lifecycle and where there may be needs for a large, cheap, timely source of data on behaviours and opinions or a standing network of users to provide information. In the design phase of a survey, can we use Twitter to help identify items to include? Can we identify and recruit subjects for a study using Twitter? How about a diary study when we need a more continuous data collection and want to let people work with a system they know instead of trying to train them to do something unfamiliar? Can Twitter be used to disseminate study results? What about network analysis? Is there information that can be gleaned from someone’s network of friends and followers, or the spread of tweets from one (or few) users to many? We often think of public opinion as characterizing sentiment at a specific place and time, but are there insights to be had from Twitter on opinion formation and influence?
7. Twitter is cheap and fast, but making sense of it may not be.
What’s the unit of analysis? Can we apply or adapt the total survey error framework when looking at Twitter? What does it mean when someone tweets as opposed to gives a response in a survey? Beyond demographics, how do Twitter users differ from other populations? How can we account for Twitter’s exponential growth when analysing the data? The best answer to each right now is “it depends” or “more research is needed.” We need a more solid understanding and some common metrics as we look to use Twitter for research. Work on this front is beginning but has a long way to go.
8. Naïve and general text mining methods for tweets can be severely lacking in quality.
The brevity of tweets, inclusion of misnomers, misspellings, slang, and sarcasm make sentiment analysis a real challenge. We’ve found the off-the-shelf systems pretty bad and inconsistent when coding sentiment on tweets. If you’re going to do automated sentiment analysis, be sure to account for nuances of your topic or population as much as possible and have a human coding component for validation. One approach we’ve found to be promising is to use crowdsourcing for human coding of tweet content.
9. Beware of the curse of Big Data and the file cabinet effect.
Searching for patterns in trillions of data points, you’re bound to find coincidences with no predictive power or that can’t be replicated. The file cabinet effect is when researchers publish exciting results about Twitter but hide away their null or negative findings.
10. Surveys aren’t perfect either.
Surveys are getting harder to complete with issues like declining response rates and reduced landline coverage. Twitter isn’t a fix-all but it may be able to fill some gaps. It’ll take some focused study and creative thinking to get there.