Conceptualization of an authorship attribution pipeline for blog articles
Conceptualization of an authorship attribution pipeline for blog articles
Samenvatting
The question which author has written a certain text has been around since people have started to put their ideas and thoughts into writing. Recent advancements in statistics and the capabilities of powerful machine learning algorithms have created a research field which is known as authorship attribution. This field is of great significance for many applications in humanities, journalism and law. For instance, authorship attribution can be used to detect fraudulent product reviews on popular online platforms.
The company codecentric AG, as an innovator in agile software development, is constantly interested in state-of-the-art technologies and best practices. This bachelor thesis project is concerned with the conceptualization of an automated authorship attribution pipeline that contributes to the product portfolio of codecentric AG.
This project systematically analyzes the main components of machine learning based authorship attribution by conducting a literature review. Furthermore, comparison criteria are defined which are used to assess the ability of machine learning models to detect the author of a text. Two attribution approaches and a set of different stylistic markers are empirically compared in experiments using a real-world blog article dataset.
Finally, the insights of the literature research and the experiments are integrated into a reusable authorship attribution library. This project specifies the requirements of the library from a functional and non-functional perspective. Several design issues to make this library reusable and extensible are discussed and solved by using popular software engineering design patterns. A prototype of this library is implemented, that unifies modern natural language processing technologies in the Python ecosystem.
Organisatie | Fontys |
Opleiding | Software Engineering en Business Informatics |
Afdeling | Fontys Techniek en Logistiek |
Partner | codecentric AG, Solingen |
Datum | 2017-06-12 |
Type | Bachelor |
Taal | Engels |