Skip to content

Commit

Permalink
data dredging post
Browse files Browse the repository at this point in the history
  • Loading branch information
rlaker committed Feb 16, 2024
1 parent d8cb123 commit abf9407
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 0 deletions.
26 changes: 26 additions & 0 deletions _posts/2024-02-16-hilarious-example-of-data-dredging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: "Spreading Love and Margarine: the link between margarine consumption and the divorce rate"
layout: single
excerpt: "a hilarious demonstration of data dredging"
tags: [til, statistics]
---

![](https://tylervigen.com/spurious/correlation/image/5920_per-capita-consumption-of-margarine_correlates-with_the-divorce-rate-in-maine.svg)

Who knew that margarine consumption is correlated with the divorce rate in Maine? There is even a *very scientific* [paper](https://tylervigen.com/spurious/research-papers/5920_spreading-love-and-margarine-an-examination-of-the-butter-splitter-correlation-in-maine.pdf) on the subject.

This is just one of thousands of [spurious correlations](https://tylervigen.com/spurious-correlations) from [Tyler Vigen's](https://tylervigen.com/) hilarious demonstration of data dredging (his figures are even included on the associated [Wikipedia page](https://en.wikipedia.org/wiki/Data_dredging)). This is when you take many variables, say 25,237 like on his website, and blindly accept statistically significant correlations.

Turns out this is a major problem in the more statistical sciences, so much so that they now have a [pre-registration format](https://en.wikipedia.org/wiki/Preregistration_(science)#Registered_reports) to describe what the study will investigate before any data is investigated.

This project also provides a great example of generating realistic looking content, in the form of *scientific* papers, from LLMs. Each paper shows the sequence of prompts that were used to create it.

<a href = "https://tylervigen.com/spurious/research-papers/5920_spreading-love-and-margarine-an-examination-of-the-butter-splitter-correlation-in-maine.pdf">
<img src="/files/spreading-love-and-margarine-an-examination-of-the-butter-splitter-correlation-in-maine.pdf.png" alt="AI-generated paper for the relationship between margine consumption and divorce rates in Maine" style="height: 600px; width:auto;">
</a>

The author does point out that:
> The silliness of the papers is an artifact of me (1) having fun and (2) acknowledging that realistic-looking AI-generated noise is a real concern for academic research (peer reviews in particular).
> The papers could sound more realistic than they do, but I intentionally prompted the model to write papers that _look_ real but _sound_ silly.
Although, I'm sure you could convince some people that [Anne Hathaway films are responsible for the number of votes for Republican senators](https://tylervigen.com/spurious/correlation/5866_the-number-of-movies-anne-hathaway-appeared-in_correlates-with_republican-votes-for-senators-in-tennessee)...
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit abf9407

Please sign in to comment.