Web Archiving

Approaches and Considerations Towards Archiving Digital Islam Across Europe

Anna Grasso (University of Wales Trinity Saint David)

Gary R. Bunt (University of Wales Trinity Saint David)

Digital Islam Across Europe (DIGITISLAM) is a multi-country research project analysing how Online Islamic Environments (OIE) shape Muslims’ social and religious practices in diverse European contexts. The Collaboration of Humanities and Social Sciences in Europe (CHANSE) funded the project. It involves research teams working across five European countries: the United Kingdom, Poland, Sweden, Spain, and Lithuania.[1]

One of the project outputs focuses on website data and collection analysis.[2] This consists of a coordinated effort between country-teams (Spain, Poland, Lithuania, Sweden, United Kingdom) headed by UK Co-Investigator Prof Gary Bunt (University of Wales Trinity Saint David) and Postdoctoral Researcher Dr Anna Grasso (University of Wales Trinity Saint David). The objective is to identify, catalogue and analyse Muslim organisations’ and actors’ websites (Online Islamic Environments or OIEs) from these different countries. Examining these collections will shed light on the similarities and differences concerning the online presence of different Muslim groups within these contexts (i.e., how actors and organisations present themselves online, who chooses which online channel to transmit their ideas, what content is posted online). The results of this strand of the research programme will be combined with the two other project outputs (quantitative and qualitative data collection and analysis).

The first part of this project’s aspect is identifying and cataloguing websites from the five countries. To carry out the archiving of these different URLs, the Internet Archive’s Archive-it was selected as being the most suitable and sustainable choice.[3] One of the long-term scopes of this project aspect is publishing a curated online archive of websites accessible to researchers, stakeholders, and the public through the project website.

The lead team faced two main challenges: firstly, the research involved substantial self-training and contact with the Archive-it support team. It built on the archiving experience of Prof. Bunt in his work with the parallel Digital British Islam project.[4] Secondly, each country-team involved has its specificities (languages, demographics, online presence, local organisations, transnational ties, ideological and religious diversity) affecting the quantity and typology of online data.

The website data collection process began in May 2023. Meetings were carried out with each country team to present the work package’s objectives and discuss each team’s view on the online presence of Muslim organisations and actors in their respective contexts. Some suggestions were given on the types of websites to look for to portray the diverse Muslim groups present in each country.[5] We also asked to keep an eye on whether critical current events (for instance, the recent controversy around Quran-burning) had an impact on the organisations and actors they listed or if they noticed that a relevant webpage risked being taken down. In both cases, country teams were to signal it to the archiving team to set up the capture of the actor or organisation pages involved.

Each team was tasked with producing a spreadsheet listing all the Muslim organisations’ and actors’ websites by the start of June 2023. We shared a model to follow, asking to list at least 20 websites, a more difficult task for countries with proportionally lower Muslim populations such as Poland and Lithuania.

Excel spreadsheets became tools for thematic categorisations, particularly relevant in the case of countries with many websites (mainly the UK and Spain). Each team could set up relevant subdivisions (i.e., education, identities, Sufi, Shi’a). Nonetheless, the final objective will be harmonising the categories around the project’s main themes (Gender, Authority, Online/Offline interactions, Transnational connections). Besides the names and URLs of the different organisations and actors, social media pages were also listed.[6] Furthermore, we examined whether websites were dormant. This practice would not only give us clues on the online activity of specific groups. It would also determine the number of times we would archive such websites on Archive-it.[7] Links to external social media were also provided. However, these cannot be archived and captured due to capacity and ethical considerations. Finally, the teams compiled information concerning the main topics covered by the URLs alongside descriptors. This approach is relevant for setting up archived material metadata.

After receiving the various spreadsheets, the next step consisted of setting up ‘test crawls’. Test crawls are a valuable part of Archive-it as they simulate an official ‘crawl’ (the process of saving web pages). Contrary to official crawls, test crawls only save webpages for 60 days and autodelete if not saved. Archive-it use requires the purchase of specific storage. Therefore, official crawls take up the purchased space allowance. Hence, particular care needs to be taken before selecting the websites to crawl.

Each Excel URL spreadsheet became a ‘collection’ (a determined set of websites – called ‘seeds’) on Archive-it. Some social media and video-sharing pages (present within the websites) were blocked out of the crawl to avoid collecting too much data and for ethical reasons. Test crawls take approximately one day to archive a determined collection.

Once all the pages were test-archived, we organised a new session of meetings with each team to show the initial results. After each meeting, we set up country team accounts on Archive-it and shared instructions on reviewing the archived material. We asked each team to check whether sufficient data had been saved or whether further links needed to be included.[8] For non-dormant websites, each team was invited to monitor the frequency with which new information was published to calibrate the future official crawl schedule.

The next step – on which we are currently working on – is descriptors and metadata. On the one hand, country teams will oversee writing a short, more elaborate description for each website in English and their official language(s). On the other hand, we set up a list of key terms on an Excel shared document (to which everyone can contribute by adding general or country-specific terms in different languages), which will be used to tag and categorise the different websites in the final results. Categorisation models generated by metadata tags will also be utilised as part of data visualisation processes. These elements will feature in future pieces on this website.

Our short-term objective is to publish initial collections and descriptions on the project website by May/June 2023. This platform will form the basis of later analysis on the website and through journals and other publications. It forms the foundation of a potentially significant legacy for future analysis and study, capturing a pivotal period in online Islamic developments across European contexts. It also enables the development of future research in this growing area of social and academic interest.

Endnotes

[1] CHANSE, DIGITISLAM: Digital Islam across Europe – Understanding Muslims’ Participation in Online Islamic Environments https://chanse.org/digitislam/, accessed 13 September 2023

[2] Digital Islam Across Europe, https://blogs.ed.ac.uk/digitalislameurope/, accessed 13 September 2023

[3] Archive-it, FAQs, https://archive-it.org/products-and-services/archive-it-faqs/, accessed 13 September 2023

[4] Digital British Islam, https://digitalbritishislam.com/ accessed 13 September 2023

[5] One example is websites offering religious text translations. The analysis includes determining translation sources, their potential audiences, and their quality.

[6] This will not be part of the public archive but will be used for informational purposes. Snapshots of posts could also be used for publications under ethics regulations.

[7] When saving data on Archive-it, the organisation can set up the frequency at which it should be saved.

[8] When launching a test crawl, there is an option to determine the depth and type of data saved from a given URL. The team decided to crawl every website with the same option (One Page +). Nonetheless, this option is only sometimes appropriate for some types of URLs. Hence, it is essential to meticulously review every test-archived URL to calibrate these options before the official crawl. Our parameters exclude data-heavy digital objects, such as lengthy videos and audio files, although some may form part of later crawls.