Textual Analysis using WordNet and NLTK for Authorship Identification Jennifer Haliewicz / HLT Portfolio

Project Context: This project was developed as part of my work in LING 581: Computational Linguistics, in the M.S. in Human Language Technology program. It explored how stylometric techniques—quantitative measures of linguistic style—can be used to attribute authorship of nonfiction writing samples.

Project Summary: The primary goal was to evaluate whether textual metrics such as average word length, lexical diversity, and part-of-speech frequency could distinguish between authors like David Foster Wallace, Joan Didion, and Zadie Smith. I applied a range of NLP tools (e.g., NLTK, spaCy) to analyze manually chunked essay data. I used Python to clean, tokenize, and process the text, extract linguistic features, and visualize author-specific patterns.

Technologies Used

Python (NLTK, spaCy, Matplotlib, seaborn)
Pandas & NumPy for data wrangling
Jupyter Notebooks for exploratory analysis and visualizations

Outcomes and Skills Applied

By leveraging skills developed across HLT courses like LING 508 and 578, I was able to:

Design and implement a custom text preprocessing pipeline
Apply stylometric principles to classify text authorship
Generate meaningful insights using data visualizations and statistical summaries

Key Takeaways

Stylometric markers like word and sentence length, POS tag frequency, and lexical diversity provide useful signals for distinguishing between authors.
Even basic text metrics can reveal stylistic fingerprints, especially when normalized

Share on

Twitter Facebook LinkedIn

Share on

Twitter Facebook LinkedIn