Loading...
Thumbnail Image
Item

Topic Modeling of Palantir Patents and List of Palantir Contracts

Acker, Amelia
Citations
Altmetric:
Genre
Dataset
Date
2021-08-18
Advisor
Committee member
Group
Department
Department of Media Studies and Production
Permanent link to this record
Collections
Research Projects
Organizational Units
Journal Issue
DOI
http://dx.doi.org/10.34944/dspace/6787
Abstract
Palantir is one of the most secretive and understudied surveillance firms in the US. The company supplies information technology (IT) solutions to governments, nonprofits, and corporations, focusing on data integration and surveillance services. We begin by sketching Palantir’s company history and contract network, followed by an explanation of key terms associated with Palantir’s area of technology specialization and a description of the firm’s platform ecosystem. We then provide a summary of current scholarship on Palantir’s continuing role in policing, intelligence, and security operations. Our primary contribution and analysis are a computational topic modeling of a purposive sample (n=155) of Palantir’s surveillance patents including their topics and themes. This approach follows recent literature that uses patents as primary data for researching the surveillance imaginaries and capabilities of IT firms. We end by discussing the concept of infrastructuring to understand Palantir as a surveillance platform, where information standards like administrative metadata are theorized as phenomena for structuring entities in and through access to digital information.
Description
For this study, we scraped all Palantir’s patents that contained the word “ontology” (as of 08/25/20) from Google Patents. This produced a purposive sample (n=155) of Palantir patents, consisting of 5197 pages, over 2.5 million words, and over 18.5 million characters. We then prepared the data set for processing by stripping all the metadata and special features, converting formats, compressing, and collating the patents together. We imported several Python libraries used for data processing (Pandas, Matplotlib, NumPy, and Seaborn), and Google Collaboratory was used to assemble the patent data, which was then loaded in a textual paragraph format. Preprocessing was then carried out, including punctuation, null value, and stop word removal, lemmatization, lowercase conversion, and tokenization, which resulted in a preprocessed data set. Part-of-speech (POS) tagging was performed, and the tokens were targeted in accordance with their corresponding POS based on context and definition (this produced most frequent nouns, verbs, etc.). Next, named entity recognition was performed to locate and classify entities in the text into predefined categories such as persons, organizations, locations, times, quantities, monetary values, percentages, etc. Topic modeling was performed using a bag-of-words model and Latent Dirichlet Allocation. We also downloaded a list of US Government contracts with Palantir from the Federal Procurement Data System (as of 08/04/21).
Citation
Iliadis, A., & Acker, A. (2021). Topic Modeling of Palantir Patents and List of Palantir Contracts. Temple University.
Citation to related work
Has part
ADA compliance
For Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact scholarshare@temple.edu
Embedded videos