The use of digital data in social science is one of the leading research directions in Computer Social Science. Over the last twenty years, the availability of different digital data sources has increased significantly. These data sources are very diverse. They include, among others, data collected by various sensors and digital devices, massive digital transaction databases (e.g., banks, health, government), digitized content from earlier eras (e.g., old newspaper articles, birth records), and online content published in the online space. A significant part of the latter is user-generated posts, images, comments, and various activities on social media. This social media content is particularly exciting from a social science point of view because it allows direct observation of people's behavior, which gives a new perspective compared to survey research that mostly focuses on attitudes. However, accessing social media data is not a trivial task. In the early 2010s, access was basically through so-called APIs (Application Programming Interfaces), which allowed researchers to access large amounts of social media content quickly and relatively cheaply. APIs basically help computers communicate/exchange data with each other by opening a gateway to a specific part of a database through authenticated channels. Public APIs provide easy access to large amounts of data. Still, the data quality varies, and the amount of data that each platform offers through the APIs also varies. Although in the case of some platforms, such as Twitter, this access method remains one of the most efficient ways to access data, in the case of other social networking sites, such as Facebook or Instagram, this access method has been stopped or drastically made more difficult by the platform owners (Breuer et al. 2021). The closure of APIs is mainly a consequence of the Cambridge Analytica scandal. Still, the tightening data protection environment has also pushed platforms in a direction to restrict data access regardless. In this context, Freelon (2018) wrote that "Computational Social Science" has entered the "Post-API" era, and Bruns (2019) called this whole situation the "APIcalypse". Others, such as Tromble (2021) or Puschmann (2019), highlight the positive impact of this process, which finally ended the "wild west of social media research".
New models for accessing digital data need to be developed in the challenging data access environment presented here.
A study published by the NetGain Partnership (Shapiro et al., 2021) distinguishes between two broad strands of digital data access: those approaches that work with platforms and those that do not.
The platform collaboration models are those that are:
- Differential Privacy (DP)
- Platforms sharing data directly with publication restrictions
- Access in a controlled environment
Platform-independent data collection methods include
- Web scraping
- App scraping
- Data donation
We have started pilot research on data donation in 2018. We asked participants to share data stored on the platform in the data donation model. To comply with the GDPR law, large platforms need to give their users the possibility to access and download data stored about them in data download packages (DDPs). Large Western platforms such as Google, Facebook, Instagram, WhatsApp, or Netflix provide a user-friendly way to access and download our data. This data can be downloaded, shared, and even downloaded by the user. This allows researchers to access social media data in a completely legal environment.
Our research started from the problem presented here: the previous API-based data access no longer works for Facebook. Therefore new ways of collecting data need to be explored. At the outset of the project, we set out to conduct a 150-person pilot Facebook study to test the technical feasibility of a data donation approach, develop a technical and content framework suitable for analyzing highly diverse social media data, and explore how to link data generated in the digital space with survey data. We have also identified two additional research objectives. The first was to explore how textual data, which makes up a significant proportion of social media data, can be effectively processed and integrated into the focus of the study. The final project objective sought to answer how to incorporate social media data purchased from external actors into this analytical frame.