On 5 May, I attended Lucia Michielin’s workshop on text analysis and humanities hosted by University of Edinburgh, which introduced us to the technical promises and epistemic stakes of applying NLP tools to historical corpora. Designed as a practical session for beginners, the workshop guided us through cleaning, tokenising, and visualising text data using Python. However, what struck me most was not the toolkit itself, but the unspoken assumptions underlying our choices—what kinds of texts we process, and why.

The dataset used in the workshop was the Statistical Account of Scotland, a large textual archive commissioned by the government in the 18th and 19th centuries. At first glance, it seemed ideal: rich, structured, and already digitised. Yet, as we explored word frequencies and created comparative word clouds across different regions, I found myself thinking not only about what was present in the data—but also what might be absent. Who authored these records? Who was recorded, and who was left out? Which worldviews are rendered statistically visible, and which ways of life are fleeting or erased?

This issue of visibility resonates deeply with my current KIPP research. My dissertation critiques algorithmic bias from feminist and posthumanist perspectives, particularly examining how certain epistemologies gain privilege within data-driven systems. The technical skills I practised during the workshop—from stopword removal to KWIC (keywords-in-context) analysis—will be directly applicable to my project, especially as I intend to audit existing NLP pipelines through a feminist lens. For example, when conducting contextual bias analysis on historical gender discourses, the ability to trace co-occurring terms across time periods and geographic locations can help illuminate patterns of exclusion or linguistic disciplining.

Yet beyond the tools themselves, the workshop prompted deeper critical reflection on the politics of digital preservation. Before attending the Text Analysis and Humanities workshop, I had always felt a vague ambivalence towards digital humanities, the new emerging scholarship. On the one hand, I recognised their potential as a technical gateway into posthuman AI ethics. On the other, I could not ignore a lingering suspicion: might this emerging scholarly field also serve as a new form of knowledge colonialism?

Though I was impressed by the practical utility of digital humanities tools—imagine a scholar researching encyclopaedias or knowledge graphs, able to achieve in days what once took decades—I remained loomingly unsettled throughout the session. All digitisation projects are, by their nature, selective. They rely on funding, institutional endorsement, and infrastructural capacity to preserve archives over time. What becomes “digital data” is never neutral; the act of digitising knowledge is always political. Who is preserved in code? Whose voices are rendered machine-readable—and whose fade into silence and into the fate of decay?

We often say things like “digitise literary works” or “convert historical data into analysable corpora.” These phrases sound neutral, but they mask a series of choices. I began to ask:

  • Who decides which texts are worth digitising?
  • Why are non-standard languages, oral traditions, or ambiguous narratives from certain historical contexts so often excluded and discarded?
  • How do resources—time, computing power, human labour, funding—influence what gets preserved?

These questions are not peripheral to my work; they extend directly from the core concerns of my research: the ethics of inquiry and the construction of knowledge. In the workshop, we were taught how to train language models to detect emotional tones and implicit attitudes. But I kept returning to a more foundational concern: before any model can identify meaning, a system must already be in place to determine what is learnable, and what counts as knowledge worth preserving.

Looking at digital humanities projects across the Chinese-speaking world—such as those in Beijing or Taiwan—it is striking how the works prioritised for digitisation are those already deemed canonical: Jin Yong’s novels, Dunhuang studies, pre-Qin philosophy. These are texts that have historically survived under the epistemic regimes of elite, often patriarchal, knowledge. But what about marginal and anonymous writings by women in ancient China? Are they included? What about lower-tier knowledge: ancient Chinese cosmetic atlases, women’s clothing diagrams, or the nĂŒshu (women’s script)? These are often dismissed as trivial, unserious, or too fragmentary to be preserved. And so they are left to decay.

The structure of digital preservation deepens this divide: some forms of knowledge are rendered eternal—inscribed permanently in code, resistant to entropy—while others remain vulnerable to loss and oblivion. This phenomenon reminds me of Gibert Simondon’s (2017) view of philosophy of technology, saying “The machine is something which fights against the death of the universe; it slows down, as life does, the degradation of energy, and becomes a stabilizer of the world.” He is suggesting that privilege of mastering machine principles and therefore achieving anti-entropy belongs to certain advantaged groups, who possessed the ability to use machines to counter entropy. Digital humanities, unless critically interrogated, risk reinforcing this asymmetry. While I believe digital humanities hold immense potential, uncritical digital humanities are insufficient—not only unworthy of celebration, but unworthy of study.

Too often, digital tools are presented as neutral facilitators—“enhancing access” or “expanding understanding.” But this framing conceals their function as filters. We are not simply translating data into a processable format; we are enacting assumptions about what should be kept, standardised, or erased. The future discipline I envision is a critical digital humanities—one that is more reflective, and paradoxically, more humanistic.

From this perspective, text analysis becomes something other than the interpretation of existing data. It becomes a method of examining how data comes to be—how knowledge is made visible, legible, and governable. As I continue developing my KIPP proposal, I hope to retain this dual awareness: combining computational skill with persistent scrutiny of the colonial, gendered, and infrastructural legacies embedded within.

Under this framework, I have begun to consider how my own research might appropriate computational tools—not to strengthen classification systems, but to disturb them. For instance:

  • Using text analysis to highlight theambiguous, fuzzy, and unclassifiable portions of a dataset, rather than the most frequent keywords;
  • Creating amap of omission, tracing where data could not be digitised, and investigating the politics behind those absences;
  • Employing co-occurrence analysis to reveal the entanglement of subjects and concepts usually considered unrelated.

These tools are not meant to make my artefact more efficient. They are meant to surface a hidden ethical question—not “What can we extract from data?”, but “Why is the data like this? Why is it structured in this way?”

Ultimately, I imagine a digital humanities practice oriented towards reverse way. Not replicating past taxonomies, but opening up their inconsistencies—the parts that resist logic, categorisation, or computational clarity. This approach may seem inefficient or unproductive by conventional standards. But it is precisely this inefficiency, instability, and non-linearity that holds the potential to create new epistemic space. After all, my goal is not to make data more transparent, but to create space for us to ask why we ever wanted it to be transparent in the first place.

This workshop affirmed for me that I am not against digital tools—but I reject their portrayal as apolitical infrastructure. They should not merely be means to an end, but should themselves be understood as knowledge practices, and therefore open to critique.

This critical awareness will be central to the design of my KIPP artefact—not as a functional feature, but as a disruptive node. It will mark the boundary where epistemic systems falter, and where alternative forms of inquiry may emerge.

As Lucia said during the workshop: “Coding is 90% learning how to Google, and how to deal with what you find on Google.” That may be true. But perhaps the deeper question is this: Whose knowledge is being Googled? And whose is not?

References

Simondon, G. (2017). On the Mode of Existence of Technical Objects.