Skip to content

Exploring the limits of social media transparency data

License

Notifications You must be signed in to change notification settings

apparebit/diaphanous

Repository files navigation

Diaphanous: Transparency Disclosures About the Sexual Exploitation of Minors

This repository curates quantitative transparency disclosures about the online sexual exploitation of minors, i.e., people under the age of eighteen, in machine-readable form. It also includes a 4,400-line Python library for validating and tidying the data and Python as well as R notebooks with the analysis for the corresponding report Putting the Count Back Into Accountability: An Analysis of Transparency Data About the Sexual Exploitation of Minors, which is also available through this repository.

Please cite as: Robert Grimm. Diaphanous: Transparency Disclosures About the Sexual Exploitation of Minors. Zenodo, 12 Dec. 2024, DOI.

The Code

To run the code in this repository, you'll need the following tools:

  • According to vermin, the minimmum required Python version is 3.11.
  • The analysis/platform.ipynb notebook is written in Python and R. The necessary bindings are provided by the rpy2 Python package. The package is installed like other Python packages as described in the next bullet point. But it does require a working R installation (e.g., brew install r).
  • Required Python packages are listed in the repository's pyproject.toml. The simplest way of installing the project's dependencies is create a local clone of this repository and then installing it thusly:
    $ python -m venv .venv   # Create virtual environment
    $ . .venv/bin/activate   # Activate virtual environment
    $ pip install -e .       # Install diaphanous as editable
    Thanks to the -e option, pip install creates a so-called editable install, i.e., it makes the Python code in the diaphanous package executable without copying it. It also installs all necessary dependencies.

Building the report requires additional tools, i.e., a working LaTeX installation, though the necessary incantations are scripted.

The Data

While a few CSV files contain tidy data, others are decidedly untidy with, for example, individual columns combining two variables. The organization of a dataset usually reflects that of the original disclosure and helps ensuring the correctness of data transcription. The Python package includes several examples for how to tidy up such data.

Dataset 1: CyberTipline Reports per Year (1998 onward)

The CyberTipline reports per year dataset captures the number of reports NCMEC received on its CyberTipline since inception in March 1998, largely based on the table included in Appendix A of its 2022 and 2023 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.

Simon Kemp's Digital 2024: Global Overview Report includes statistics on the global number of social media user identities. They are an effective denominator for normalizing the CyberTipline reports per year.

Dataset 2: CyberTipline Report Contents and Recipients (2020 onward)

The CyberTipline report contents and recipients dataset breaks down the reports NCMEC received by:

  • the category of sexual exploitation, e.g., whether a report concerns child pornography, misleading words/images, online enticement, child sex trafficking, obscene material sent to a child, misleading domain names, child sexual molestation, or child sex tourism;
  • the kind of attachments, e.g., photos, videos, or other;
  • the uniqueness of attachments as determined by a precise hash (MD5) and a perceptual hash (PhotoDNA, Videntifier);
  • their level of detail, i.e., whether they are actionable or only informational;
  • their recipients in dedicated units, local, federal, or international law enforcement.

Labels for the uniqueness classification use "unique" for precisely hashed attachments and "similar" for perceptually hashed ones. The dataset combines several tables from NCMEC's 2022 and 2023 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.

Dataset 3: CyberTipline Reports per Platform (2019 onward)

The CyberTipline reports per platform dataset is the project's main dataset. It collects:

  • disclosures about child sexual exploitation by major non-Chinese social networks and other large service providers;
  • corresponding disclosures about service providers' reporting by NCMEC.

The above linked JSON format is automatically generated from a Python module. Both formats have the same structure and contain the same information.

The dataset incorporates information about these platforms:

  • Amazon (owns Twitch)
  • Apple
  • Automattic (owns Tumblr and Wordpress)
  • Aylo (née MindGeek)
  • Discord
  • Facebook (Meta)
  • GitHub (Microsoft)
  • Google (owns YouTube)
  • Instagram (Meta)
  • LinkedIn (Microsoft)
  • Meta (owns Facebook, Instagram, and WhatsApp)
  • Microsoft (owns GitHub and LinkedIn)
  • MindGeek (now Aylo)
  • Omegle
  • Pinterest
  • Pornhub (Aylo)
  • Quora
  • Reddit
  • Snap
  • Telegram
  • TikTok
  • Tumblr (Automattic)
  • Twitch (Amazon)
  • Twitter (now X)
  • WhatsApp (Meta)
  • Wikimedia
  • Wordpress (Automattic)
  • X (née Twitter)
  • YouTube (Google)

Surveyed organizations fall into at least one of the following categories:

  • Social media based on Buffer's list of top social media sites,
  • Popular platforms based on the European Commission's list of very large online platforms,
  • Platforms with considerable reported child sexual exploitation activity based on NCMEC's transparency disclosures.

A separate codebook documents the JSON and Python formats. Basically, they consist of a top-level object that maps organization names to an object with the data about that organization. Since platforms vary widely in what metrics they disclose, the format necessarily is rather generic and collects all of a platform's quantitative disclosures within one table:

  • Since platforms make transparency disclosures for quarter, half, and full years, each table also organizes metrics into time periods with the same granularity.

  • To faithfully capture disclosures, time periods may vary within a table. They may also overlap, both to capture several partial disclosures and to capture several redundant disclosures. A flag clearly marks the latter entries.

  • Where possible, the table uses standard labels for equivalent metrics:

    • reports tallies CyberTipline reports to NCMEC;
    • pieces tallies instances of CSAM such as pictures and videos;
    • accounts tallies user registrations implicated and terminated for CSAM;

    Instead of "account termination," many platforms use a euphemism such as "permanent suspension." User registrations thusly impacted are included under accounts. However, temporarily impacted registrations are not.

Comparable CyberTipline report counts and per-provider comparable CyberTipline report counts are materialized views onto the same data. Both views are in long format and only include rows for counts that were disclosed by both electronic service provider and NCMEC.

The latter, more precise view has year, observer, count, and topic columns, with the topic column enabling the grouping of rows with service provider and NCMEC as observers. The former, simplified view has only id, observer, and count columns, with the ID column effectively combining the other view's year and topic columns and the observer column only distinguishing between a generic ServiceProvider and NCMEC.

Dataset 4: CyberTipline Reports per Country (2019 onward)

CyberTipline reports per country collects NCMEC's per-country breakdown of CyberTipline reports for 2019, 2020, 2021, 2022, and 2023 in machine-readable form. The CSV table is mostly straightforward: Its first two columns comprise the country name and ISO three-letter code, followed by a column per year from 2019 through 2022.

To preserve all information from NCMEC's disclosures, the table includes rows for the Netherlands Antilles (ANT), "Europe" (EEE), Bouvet Island (BVT), and "No Country Listed" (no code). NCMEC does not explain its inclusion of Europe in addition to individual European countries nor the Netherlands Antilles in addition to its 2010 successors Bonaire, Sint Eustatius, and Saba (BES), Curaçao (CUW), and Sint Maarten (SXM). Neither do they explain the inclusion of Bouvet Island; the subantarctic dependency of Norway is an uninhabited nature reserve and hence rather unlikely to serve as actual location of internet users.

This repository's Python package includes code that enriches this dataset with population counts, geometries, and region/continent information. It leverages the following data:

The following choropleths using the Equal Earth projection visualize CyberTipline reports per year per country per capita:

CyberTipline reports per capita per country per year

Dataset 5: Platform Data (2020 onward)

Discord, Meta, Microsoft, and TikTok have released (some) data in machine-readable form. This dataset contains the corresponding files. Discord's and Meta's data is in CSV format, Microsoft's in Excel format, and TikTok's in Excel and later on CSV format. Meta's and TikTok's files include historical data whereas Discord's and Microsoft's do not. Since Meta re-uses the same URL every quarter, files released before Q2 2022 were retrieved from the Internet Archive's snapshots.

Dataset 6: Relationship between Offender and Victim

The CSAM pieces by relationship to victim dataset captures the relationship between suspected offenders and victims as determined by law enforcement agencies and tabulated by NCMEC. It is included in NCMEC's 2022 and 2023 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.

Since the number of victims in NCMEC's database seems to be very small, I pulled in two more datasets characterizing relationships as well. The first stems from OJJDP's Statistical Briefing Book and covers years 2018 and 2019. The data was originally extracted from the FBI's National Incident-Based Reporting System Master Files. Note that all counts are relative to "typical 1,000 sexual assaults." The second stems from LEARCAT and covers the year 2016. It also draws on the FBI's National Incident-Based Reporting System. While the Briefing Book data is helpful indeed, the choice of relationship bins for the LEARCAT data renders it close to useless in this context.

Other Data

The data directory contains a few more tables, including one with global population sizes also provided by the UN Population Division and one with Meta's daily and monthly active people, which captures the number of users who logged into Facebook, Instagram, Messenger, or WhatsApp at least one over a day or month. Both tables are used to calculate Meta's daily and monthly active people as a fraction of the world population.

Repository Layout

In addition to the data, this repository also contains the Python code for analyzing it as well as resulting figures. In particular:

  • The analysis directory contains notebooks with the high-level analysis code. The index.ipynb notebook includes almost all other notebooks.
  • The diaphanous directory contains the Python library code used by the notebooks.
    • The remaining code in diaphanous.main should be refactored into notebooks.
    • The show() function in diaphanous.show is more generally useful. Most of this functionality should be up-streamed to Pandas because it significantly improves on the default table format.
  • The figure directory contains SVG figures.
  • The stubs directory contains typing stubs.
  • The report directory contains the LaTeX sources for the article discussing the work.

Acronyms

  • CSAM: Child Sexual Abuse Material
  • CSE: Child Sexual Exploitation
  • NCMEC: National Center for Missing and Exploited Children
  • OCSE: Online Child Sexual Exploitation
  • OJJDP: Office for Juvenile Justice and Delinquency Prevention (at the US Departmet of Justice)

Licensing

The code in this repository is ©️ 2023–2024 by Robert Grimm and has been released under the Apache 2.0 open source license. The datasets in this repository combine disclosures by electronic service providers as well as the National Center for Missing and Exploited Children (NCMEC) and make this data more easily accessible in machine-readable form. It has been released under the CC BY 4.0 license.