In 2009, we began to experiment with automated text classification tools to lower the costs of labeling thousands of documents for topic.Several software tools have been developed as part of this process (see below).
Generally speaking, we have found that supervised machine learning methods can reduce the number of cases to be manually labeled for topic (about 225 topics in all) by as much as 70-80%. However results vary depending on the data and the size of the training sample. Thus, if lots of examples are available and there are thousands of events to be labels, automated approaches can be efficient and reliable.
The main cost reductions derive from ensemble learning and active learning. When multiple algorithms (an ensemble) make the same topic prediction, human coders can have high confidence that the event has been properly classified. Active learning refers to a human-centered process of identifying cases where the system is not performing as well, and intervening with additional training to reduce or eliminate similar mistakes in future rounds. Several publications listed below provide more information about the process and outcomes.
Several software packages of general benefit have been developed, as described below:
RTextTools is a free, open source machine learning package for automatic text classification that makes it simple for both novice and advanced users to get started with supervised learning. The package includes nine algorithms for ensemble classification (svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks, maximum entropy), comprehensive analytics, and thorough documentation. The package was developed by Timothy P. Jurka at UC Davis, Loren Collingwood at University of Washington, Amber E. Boydstun at UC Davis, Emiliano Grossman at Sciences Po Paris, and Wouter van Atteveldt at Vrije Universiteit Amsterdam.
The beta release was unveiled at the The 4th Annual Conference of the Comparative Policy Agendas Project on June 24, 2011. The full release is available on the installation page.
The RTextTools repository is available via Google Code, and the help mailing list is on Google Groups.
RTextTools builds on helpful software Paul Wolfgang (Temple University, [email protected]) developed as part of the Pennsylvania Policy Project. The latest version can be found at: http://www.cis.temple.edu/~wolfgang/ .
User Manual and Introduction to Automated Classification Using Text Tools
Jonathan Moody (Penn State University, [email protected]) has prepared additional documentation, demonstration datasets, and template files to assist research in learning how to use the Text Tools environment. These are meant to provide step-by-step instructions for how to prepare datasets, operate Text Tools, and analyze the results. Documentation: Text Tools Documentation and Templates (.rar)
Individual features of the TextTools software can also be accessed individually:
- Using the Iterative Process to Code Virgin Text (.doc)
- Instructions for Performing Accuracy Testing (.doc)
- Demonstration Training Dataset (.mdb)
- Demonstration Coding Dataset (.mdb)
- Accuracy Testing Syntax (.bat)
- Accuracy Testing Analysis Do File (.do)
- Accuracy Testing Importing Do File (.do)
- Accuracy Testing Analysis Spreadsheet Template (.xls)
SLTK Auto-coding and Supervised Coding
Hans Then of Pythea company ([email protected]) has also developed a software tool for automated coding and supervised coding based on the Texttools package put together by Paul Wolfgang. Link: SLTK tool
Research in this area offers many tips for improving the accuracy of automated methods for a given sample size. We have conducted some experiments with our data and report them in the following papers:
Collingwood and Wilkerson, Tradeoffs in Accuracy and Efficiency in Supervised Learning Methods” 2012. Journal of Information Technology and Politics, 9(3): 298-318.
Hillard, Purpura and Wilkerson, “Automated Text Classification for Mixed Methods Social Science Research” Journal of Information Technology and Politics, June 2008
Breeman, Then, Kleinnijenhuis, van Atteveldt, Timmermans,
“Strategies for Improving Semi-automated Topic Classification of Media and Parliamentary documents”
See also: Cardie and Wilkerson, Text Annotation for Political Science (Editor’s Introduction), Journal of Information Technology and Politics, August 2008