From image to text: experience of an Optical Character Recognition Intern

Since April, PhD student Ash Charlton has been an Intern with the University of Edinburgh’s Cultural Heritage Digitisation Service (CHDS) and the Centre for Data, Culture and Society (CDCS), looking into text extraction processes at the University through Optical Character Recognition (OCR).

My position as OCR intern entailed exploring text extraction both in library practice, and thinking about how this is applied and taught within digital scholarship. This work has built nicely on a project I contributed to for the CDCS last year through another internship in creating a Training Pathway for Managing Digitisied Documents, including text extraction, and it was nice to return to working with the CDCS.

My work on the CDCS side has involved designing a workshop on text extraction, focusing on an introducing text recognition and its history, important considerations and how you can convert images of text to machine-readable text yourself. This workshop is an introduction to understand what the process is and how text recognition plays into our everyday lives, not just how it can be applied in research – key aspects to understand before delving into text extraction with programming. This leads into a second workshop run by the CDCS that will focus on using programming approaches and the OCR engine Tesseract to recognise text and create a usable output.

I also designed an asynchronous lesson alongside the workshop, which covers the foundational principles of text extraction and has several activities to engage with digitally recognising text and prompt creative thinking about processes and workflows. There are lots of variables to consider in text recognition, from the quality or format of the original materials to the type of software used, or ultimately what the text is being created for, for example searchable PDFs to provide access to materials, or downloadable datasets for textual analysis. As there are so many different aspects to consider, there is not always necessarily a straightforward one size fits all solution, so being able to think critically about the process is crucial to ensure you are getting the most out of your text recognition projects, which the workshop and resource encourage.

The other part of my internship was working with the CHDS and examining their past and current uses of OCR software, and what their solutions may be for the future in the Library and University Collections (L&UC). This gave me the opportunity talk to staff across the Digitisation Service and wider library, as well as undertake independent research to produce a report with recommendations for OCR processes and considerations for the future.

My internship with CDCS and CHDS has been a fantastic opportunity to work directly with teams in the cultural heritage sector as those creating the text, and on the digital scholarship teaching and learning side, in looking at how we engage with the texts, or carry out text extraction ourselves as researchers. Creating the training resources and thinking about how information is presented clearly and concisely in an accessible way is a valuable skill in a lot of contexts and it’s something I hope to be able to carry forward into my PhD research and wider teaching at the university. These types of opportunities working with both industry and academic partners are useful in demonstrating a wide skill set and adaptability to different working environments, and have enhanced my overall University experience. As I work with machine-readable texts in my PhD research, this internship has tied in well with my understanding and application of these text recognition methods. I would like to thank the CDCS and CHDS for their support throughout the internship, and for a wonderful learning opportunity with two excellent teams at the University of Edinburgh.

More details for the workshop to follow via the CDCS events page as we move into the 2023/24 academic year.

Find out more about the University of Edinburgh’s Cultural Heritage Digitisation Service internship here.