Skip to content

About the Data

Steph Buongiorno edited this page Aug 9, 2022 · 6 revisions

What is the Congressional Record?

The Congressional Record is the official record of the proceedings and debates of the United States Congress (from 1873 to the present moment). It is published online daily when Congress is in session.

Which Data Does CDS Collect?

It can be confusing how many different versions of data are produced by Congress. This section gives a high level overview of the data collected by CDS.

CDS collects the plain text versions of the Daily Edition of the U.S. Congressional Records from "congress.gov." It can collected data from the day it is run through the year 1995. CDS does not collect the Bound Edition, which is a PDF version of the record. Subsequently, it does not collect data from before the year 1995 because this data is served exclusively in PDF form, as opposed to the Daily Edition which is served in plain text embedded into the HTML of a record’s web page as well as in PDF form.

There are a few minor differences between the Daily Edition and the Bound edition. These differences have to do with pagination, name prefixing, and other conventions of the like. For a full description of the differences between the two versions, see Gov Info.

What Does it Mean that the Exported Data is "Analysis Ready"?

The Daily Editions on "congress.gov" are made up of large blocks of text embedded within the HTML of the site. No distinction is made between key sections, for example, the text making up a speech and the text making up a speaker's name. Therefore, if the data were scraped from "congress.gov" without additional post-processing, the user would just be handed a large block of text.

CDS identifies key fields in the data and formats it to make it more meaningful for analysis. The exported data is a .csv file with fields for: url, date, title (of record), speaker, and text.