Final project for the Stanford class CS224U: Natural Language Understanding.
We analyzed thematic and stylistic trends in a corpus of 355 popular and critically acclaimed 20th century English-language novels. First, we applied and adapted "the semantic cohort method", a vector space model meant to surface thematically similar words in a corpus; this method was originally proposed by the Stanford Literary Lab. Next, we studied trends in (1) the occurrence of words in these cohorts and (2) stylistic traits of novels, with the goal of demonstrating quantitative analysis’ usefulness as a tool to enrich existing literary scholarship as well as surface new patterns in literature.