Datasets and Sources

Which kinds of texts should I include in my dataset? Obviously this raises some ethical considerations–besides my own personal information, much of the text I’ve written and saved will include others’ data and, in some cases, intellectual property, where I’m responding to or quoting heavily from another text.

My impulse currently is not to draw a strict line on sensitive information except in extreme cases, at least where my own data is concerned. This model is going to be personal–that’s the whole point–so the best way to protect my information from widespread visibility and potential reuse is not to sanitize the data but to restrict access to the model. If I store my model privately and carefully consider which of its outputs are appropriate for dissemination, I can make sure that any sensitive data in its training is not exposed beyond my comfort levels or in any way that puts others’ personal information at risk.

Still, there are a few boundaries I need to define around the kinds of input I want to use:

  • Document Type. Any primarily textual document, written in my own words, is fair game here. As a student and writer, many of the longer texts will be fiction, poetry, and essays. However, I would like to include informal writing as well–journal entries, emails (only my messages, scraped of others’ identifying details), text messages, etc. Some of this will not yet be digitized and will depend on the amount of time I have to create the dataset; transcribing (or learning to automate the transcription of) more than a few select notebooks, journals, or handwritten papers will be incredibly time-consuming.
  • Timespan. I would like to very intentionally include texts dating back as far into my childhood as I can. Because of my age, I will have very few digitally-authored texts prior to the 2010s, and most files older than five years will require me to sort through shared folders on family computers in my parents’ home. However, I have at least a sampling of my creative writing, journaling, and schoolwork in handwritten, physical form safely stored back in the States, and I should be able to have them sent to me in time to incorporate into this project. While this will make my dataset much less consistent, I do think that all of this texts are important aspects of my data identity. In the digital age, information does not die or fade as predictably as when it was tied to physical storage–a web search can so easily pull up posts from our teenage years that no longer represent our voices or our views and present them as if they were brand-new and untarnished. So including every version of myself, past and present, in this data doppelganger might yield interesting discussion on the way digital identity is shaped by this constant resurfacing and rebirthing of older content.
  • Representativeness. I am inclined to simply throw all of the personal textual data I can find into my model, without worrying too much about balancing across formal/informal writing or writing from different time periods. This will mean that the resulting model is not representative of all the content I do produce, but all of the content I have produced and saved in ways that let me access it later. The questions I can pose to the research, then, will be less about “how do I present myself in writing?” and more directly and accurately about “how can I retrieve my ‘self’ from my writing, and how does that ‘self’ align with or differ from my personal identity?”

Leave a Reply

Your email address will not be published. Required fields are marked *