Name: Jo
Year of study & degree: Year 3 Artificial Intelligence and Computer Science
Internship: Log Analysis Intern
Team within ISG: Geoservices
Hi there! I’m Jo, and I’m working as EDINA’s Log Analysis Intern this summer. This is my first foray into work as a professional data scientist and I have already learned a lot I would like to share with you here.
Digimap is an online service allowing users to browse a huge variety of maps and download geographic data, useful for purposes such as academic research. The service produces gigabytes a day of logs, recording server requests and users’ interaction with the website. These come in many different forms, including Apache web logs (recording API requests to external services to provide data to Digimap, and stored in massive JSON files), system logs (recorded as the website’s functions are used, and stored in CSV files) and summary statistics (calculated at regular time intervals, and also stored in CSV files). These are all anonymised to remove personal information. My role is to work with this log data to analyse how Digimap is used, presenting my findings in a written report answering different research questions.
The volume of data available and the open-ended nature of the study mean I have significant freedom in how I approach this work (and more possibilities of things to do than I have time for in 12 weeks!). A key lesson I’ve learned thus far is the importance of substance over style for this work. It initially seemed appealing to dive into producing complex visualisations across many attributes, which I thought I could then use directly in my report. However, these lacked real direction and did not convey meaningful trends, because I did not have prior experience or knowledge of the data to know what would be useful to focus on. On the advice of my manager, I instead switched to beginning with some simple exploratory plots, such as bar charts, on simple usage metrics. Doing this immediately revealed some key figures and trends, which can now be built upon for purposeful, more advanced analysis.
This role was my first time working with such large volumes of data, and as such I’ve needed to learn new techniques to do this efficiently. The most significant new tool has been Python’s pandas library, which offers support for working with CSV files as dataframes (tables). Operations can be performed on single cells or more efficiently across entire columns, allowing useful functions (for instance, shortening a timestamp down to its calendar date) to be applied across all the data. I can then export my desired data from a processed data frame as a new CSV, to be plotted in R with ggplot.
To save loading time and disk space, it is possible to specify which columns are loaded in when reading a CSV file. However, larger CSVs can take a substantial amount of time to load and so, for repeated loading, it can be quicker to use functions of pandas to convert dataframes to binary files (which load quickly but taking significant storage space) or parquets (an alternative space- and time-efficient data format). It also saves space to only store large log files once in an organised file location and access them again whenever they are needed, rather than using duplicate files. I’ve recently started using a GPU workstation, a very powerful remotely accessed computer, to help process large Apache logs, and may in future investigate multithreading, a programming technique where different parts of the data are operated on at the same time to further increase processing speed.
For anyone considering applying for an internship, I would absolutely encourage you to go for it, no matter your current level of experience. As I’ve found, the application process is a very good time investment, regardless of whether you ultimately get a position (for example, you gain practice making applications, and so can apply more quickly and effectively in future, as well as knowledge of different subfields you may be interested in working in). As I applied for internships within the UK, I found it helped me to focus on applying for positions I was especially interested in and tailor my application (CV, cover letter etc.) to each of them. I’d also like to champion the support of the university’s careers service, as well as the advice in information sessions offered by ISG for internships with us. The fact you’re reading a blog post like this demonstrates your drive and interest; continue to push forward, and please don’t rule yourself out of opportunities.