On 5 May, I attended Lucia Michielin’s workshop on text analysis and humanities hosted by University of Edinburgh, which introduced us to the technical promises and epistemic stakes of applying NLP tools to historical corpora. Designed as a practical session for beginners, the workshop guided us through cleaning, tokenising, and visualising text data using Python. However, what struck me most was not the toolkit itself, but the unspoken assumptions underlying our choicesâwhat kinds of texts we process, and why.
The dataset used in the workshop was the Statistical Account of Scotland, a large textual archive commissioned by the government in the 18th and 19th centuries. At first glance, it seemed ideal: rich, structured, and already digitised. Yet, as we explored word frequencies and created comparative word clouds across different regions, I found myself thinking not only about what was present in the dataâbut also what might be absent. Who authored these records? Who was recorded, and who was left out? Which worldviews are rendered statistically visible, and which ways of life are fleeting or erased?
This issue of visibility resonates deeply with my current KIPP research. My dissertation critiques algorithmic bias from feminist and posthumanist perspectives, particularly examining how certain epistemologies gain privilege within data-driven systems. The technical skills I practised during the workshopâfrom stopword removal to KWIC (keywords-in-context) analysisâwill be directly applicable to my project, especially as I intend to audit existing NLP pipelines through a feminist lens. For example, when conducting contextual bias analysis on historical gender discourses, the ability to trace co-occurring terms across time periods and geographic locations can help illuminate patterns of exclusion or linguistic disciplining.
Yet beyond the tools themselves, the workshop prompted deeper critical reflection on the politics of digital preservation. Before attending the Text Analysis and Humanities workshop, I had always felt a vague ambivalence towards digital humanities, the new emerging scholarship. On the one hand, I recognised their potential as a technical gateway into posthuman AI ethics. On the other, I could not ignore a lingering suspicion: might this emerging scholarly field also serve as a new form of knowledge colonialism?
Though I was impressed by the practical utility of digital humanities toolsâimagine a scholar researching encyclopaedias or knowledge graphs, able to achieve in days what once took decadesâI remained loomingly unsettled throughout the session. All digitisation projects are, by their nature, selective. They rely on funding, institutional endorsement, and infrastructural capacity to preserve archives over time. What becomes âdigital dataâ is never neutral; the act of digitising knowledge is always political. Who is preserved in code? Whose voices are rendered machine-readableâand whose fade into silence and into the fate of decay?
We often say things like âdigitise literary worksâ or âconvert historical data into analysable corpora.â These phrases sound neutral, but they mask a series of choices. I began to ask:
- Who decides which texts are worth digitising?
- Why are non-standard languages, oral traditions, or ambiguous narratives from certain historical contexts so often excluded and discarded?
- How do resourcesâtime, computing power, human labour, fundingâinfluence what gets preserved?
These questions are not peripheral to my work; they extend directly from the core concerns of my research: the ethics of inquiry and the construction of knowledge. In the workshop, we were taught how to train language models to detect emotional tones and implicit attitudes. But I kept returning to a more foundational concern: before any model can identify meaning, a system must already be in place to determine what is learnable, and what counts as knowledge worth preserving.
Looking at digital humanities projects across the Chinese-speaking worldâsuch as those in Beijing or Taiwanâit is striking how the works prioritised for digitisation are those already deemed canonical: Jin Yongâs novels, Dunhuang studies, pre-Qin philosophy. These are texts that have historically survived under the epistemic regimes of elite, often patriarchal, knowledge. But what about marginal and anonymous writings by women in ancient China? Are they included? What about lower-tier knowledge: ancient Chinese cosmetic atlases, womenâs clothing diagrams, or the nĂŒshu (womenâs script)? These are often dismissed as trivial, unserious, or too fragmentary to be preserved. And so they are left to decay.
The structure of digital preservation deepens this divide: some forms of knowledge are rendered eternalâinscribed permanently in code, resistant to entropyâwhile others remain vulnerable to loss and oblivion. This phenomenon reminds me of Gibert Simondonâs (2017) view of philosophy of technology, saying “The machine is something which fights against the death of the universe; it slows down, as life does, the degradation of energy, and becomes a stabilizer of the world.” He is suggesting that privilege of mastering machine principles and therefore achieving anti-entropy belongs to certain advantaged groups, who possessed the ability to use machines to counter entropy. Digital humanities, unless critically interrogated, risk reinforcing this asymmetry. While I believe digital humanities hold immense potential, uncritical digital humanities are insufficientânot only unworthy of celebration, but unworthy of study.
Too often, digital tools are presented as neutral facilitatorsââenhancing accessâ or âexpanding understanding.â But this framing conceals their function as filters. We are not simply translating data into a processable format; we are enacting assumptions about what should be kept, standardised, or erased. The future discipline I envision is a critical digital humanitiesâone that is more reflective, and paradoxically, more humanistic.
From this perspective, text analysis becomes something other than the interpretation of existing data. It becomes a method of examining how data comes to beâhow knowledge is made visible, legible, and governable. As I continue developing my KIPP proposal, I hope to retain this dual awareness: combining computational skill with persistent scrutiny of the colonial, gendered, and infrastructural legacies embedded within.
Under this framework, I have begun to consider how my own research might appropriate computational toolsânot to strengthen classification systems, but to disturb them. For instance:
- Using text analysis to highlight theambiguous, fuzzy, and unclassifiable portions of a dataset, rather than the most frequent keywords;
- Creating amap of omission, tracing where data could not be digitised, and investigating the politics behind those absences;
- Employing co-occurrence analysis to reveal the entanglement of subjects and concepts usually considered unrelated.
These tools are not meant to make my artefact more efficient. They are meant to surface a hidden ethical questionânot âWhat can we extract from data?â, but âWhy is the data like this? Why is it structured in this way?â
Ultimately, I imagine a digital humanities practice oriented towards reverse way. Not replicating past taxonomies, but opening up their inconsistenciesâthe parts that resist logic, categorisation, or computational clarity. This approach may seem inefficient or unproductive by conventional standards. But it is precisely this inefficiency, instability, and non-linearity that holds the potential to create new epistemic space. After all, my goal is not to make data more transparent, but to create space for us to ask why we ever wanted it to be transparent in the first place.
This workshop affirmed for me that I am not against digital toolsâbut I reject their portrayal as apolitical infrastructure. They should not merely be means to an end, but should themselves be understood as knowledge practices, and therefore open to critique.
This critical awareness will be central to the design of my KIPP artefactânot as a functional feature, but as a disruptive node. It will mark the boundary where epistemic systems falter, and where alternative forms of inquiry may emerge.
As Lucia said during the workshop: âCoding is 90% learning how to Google, and how to deal with what you find on Google.â That may be true. But perhaps the deeper question is this:Â Whose knowledge is being Googled? And whose is not?
References
Simondon, G. (2017). On the Mode of Existence of Technical Objects.

