Project Context: This project was developed as part of my work in LING 581: Computational Linguistics, in the M.S. in Human Language Technology program. It explored how stylometric techniques—quantitative measures of linguistic style—can be used to attribute authorship of nonfiction writing samples.

Project Summary: The primary goal was to evaluate whether textual metrics such as average word length, lexical diversity, and part-of-speech frequency could distinguish between authors like David Foster Wallace, Joan Didion, and Zadie Smith. I applied a range of NLP tools (e.g., NLTK, spaCy) to analyze manually chunked essay data. I used Python to clean, tokenize, and process the text, extract linguistic features, and visualize author-specific patterns.

Technologies Used

  • Python (NLTK, spaCy, Matplotlib, seaborn)
  • Pandas & NumPy for data wrangling
  • Jupyter Notebooks for exploratory analysis and visualizations

Outcomes and Skills Applied

By leveraging skills developed across HLT courses like LING 508 and 578, I was able to:

  • Design and implement a custom text preprocessing pipeline
  • Apply stylometric principles to classify text authorship
  • Generate meaningful insights using data visualizations and statistical summaries

Key Takeaways

  • Stylometric markers like word and sentence length, POS tag frequency, and lexical diversity provide useful signals for distinguishing between authors.
  • Even basic text metrics can reveal stylistic fingerprints, especially when normalized