Textual Analysis using WordNet and NLTK for Authorship Identification
Textual Analysis using WordNet and NLTK for Authorship Identification
Project Context: This project was developed as part of my work in LING 581: Computational Linguistics, in the M.S. in Human Language Technology program. It explored how stylometric techniques—quantitative measures of linguistic style—can be used to attribute authorship of nonfiction writing samples.
Project Summary: The primary goal was to evaluate whether textual metrics such as average word length, lexical diversity, and part-of-speech frequency could distinguish between authors like David Foster Wallace, Joan Didion, and Zadie Smith. I applied a range of NLP tools (e.g., NLTK, spaCy) to analyze manually chunked essay data. I used Python to clean, tokenize, and process the text, extract linguistic features, and visualize author-specific patterns.
Technologies Used
- Python (NLTK, spaCy, Matplotlib, seaborn)
- Pandas & NumPy for data wrangling
- Jupyter Notebooks for exploratory analysis and visualizations
Outcomes and Skills Applied
By leveraging skills developed across HLT courses like LING 508 and 578, I was able to:
- Design and implement a custom text preprocessing pipeline
- Apply stylometric principles to classify text authorship
- Generate meaningful insights using data visualizations and statistical summaries
Key Takeaways
- Stylometric markers like word and sentence length, POS tag frequency, and lexical diversity provide useful signals for distinguishing between authors.
- Even basic text metrics can reveal stylistic fingerprints, especially when normalized