Any views expressed within media held on this service are those of the contributors, should not be taken as approved or endorsed by the University, and do not necessarily reflect the views of the University in respect of any particular issue.

Institute of Genetics and Cancer

Institute of Genetics and Cancer

A blog for our community to write about their interests and to share their stories.

Statistics and reproducibility: guides through your research

Figure 1: Dr Catalina Vallejos – Chancellors’s Fellow (Photo: MRC HGU, IGMM)

Over six months ago, the world plunged into lockdown and virtual seminars popped up like mushrooms in autumn. Here at the MRC Institute of Genetics Molecular Medicine, the IGMM Statistical Seminar Series was set up to provide researchers more insight on the discipline of statistics. This brainchild of Dr Ailith Ewing and Dr Catalina Vallejos, chaired by Dr Ewing, was supported by volunteers from all across the IGMM. I spoke to Catalina to find out more about the importance of statistics, reproducibility and why you definitely should check out these seminars!

Catalina opened the IGMM Statistical Seminar Series with the crucial statement: “Statistics is at the core of modern research.” During our interview she elaborates: “A lot of current research generates data. Basically people go in the lab and will run experiments and they have hypotheses. But the way they have to test this hypothesis is yet to generate some data to run some RNA sequence from qPCR or some kind of experiment that will generate in many cases quantitative data. You need to be able to extract robust information from that data because you collected the data for a purpose, and that’s where statistics are playing an important role”.

Moreover, Catalina underlines the fact that statistics start from experimental design: “If you don’t plan an experiment well, you can preclude yourself from answering the question that you had. There is this cliché phrase that ‘contacting a statistician after you conducting an experiment is like asking for a post mortem and they [statisticians] will tell you what went bad’”. Fortunately, the statistical series is set up to increase awareness on this matter. Hopefully the increasing use of statistical programming languages like R will help too.

Figure 2: Making use of directions before making your next move is important in real life and in research (Picture: https://www.publicdomainpictures.net/)

Statistics are useful for more than just validating hypotheses. Catalina explains that statistics may not directly reveal the true biological signals within data, but it may tell you where to look next. You use statistics to guide you through your data: “Imagine GWAS results or like you have very large datasets and you’re looking for signals in millions of regions. You couldn’t really manually look at each of them and validate them, but you can use a statistical approach to prioritize.” Additionally, statistical tests make assumptions so it could be quite dangerous to blindly trust p-values, particularly when those assumptions don’t hold. Using the correct test is crucial as Catalina explains with her analogy: “T tests, for example, are widely used. They have certain assumptions and there are many situations where the assumptions are not met. I’m not blaming the test in that situation, I’m saying that the T test is not being used in the appropriate context. It’s like if you have reagents that work in certain conditions, and when you use them in the wrong condition the experiment won’t work.”

Figure 3: Using version control and intermediate steps in code will make your work reproducible (Free stock images: Pexels)

Importantly, Catalina also makes clear that reproducibility is essential in research: “I think reproducibility is just so important and for many reasons. The selfish reason first of all is that imagine you submitted a paper to a journal and then after a few months of review, you get asked to like rerun some analysis or like to extend your analysis. If you haven’t worked in a way that is reproducible, there is a chance that when you try to reproduce your own analysis you will not be able to do it. Another thing is because science is incremental in many situations and people will build from your research: you can publish something and I may want to run a subsequent analysis from that data. If I’m not able to reproduce your initial analysis. How can I do mine? So I think reproducibility is important for oneself but for others in the scientific community as well.”

Luckily for us, Catalina provides us with some tips on working reproducibly: “I use version control [GitHub or GitLab] for everything. Version control has been a lifesaver many times because when you are coding something, sometimes you break that code. With version control you have the option to go back.” Moreover, she mentions that the use of environment managers like Docker or Conda can help you to work with evolving software versions. Lastly, Catalina creates intermediate “check-point” steps in code to guide herself through the analysis. She gives the sound advice to early career researchers: “invest time now trying to learn [to work reproducible] because I think that journals, funders and research institutions are starting to recognize that this is very important”.

Exposing yourself to new environments, such as new statistical analysis, can be eye opening. Talking about aims of the seminar series, Catalina said “to introduce the [statistical] concepts and making people feel aware of the different types of methods, their limitations and the challenges of what to look for when applying them”. Catalina works partially at the IGMM and Alan Turing institute. When starting at the Alan Turing Institute she came into contact with computer scientists working on a wide range of subjects. “It was really, really multidisciplinary and that was really an enriching experience and I ended up collaborating with people that I wouldn’t have collaborated with if I stayed in a more homogeneous environment.” Learning new statistical analysis is one thing, but the IGMM Statistical Seminar Series can also provide you the skills to communicate and start collaborating with colleagues from other disciplines.

There you have it: a short overview on how statistics is at the core of our research and how both statistical tests and working reproducibly guide us when doing research. Catalina’s enthusiasm on these subjects is infectious. All our research stands to benefit from the excellent work done by all the volunteers who participated in the IGMM Statistical Seminar Series!

The IGMM Statistical Seminar Series is available on Media Hopper Create (University of Edinburgh only)

Recommendations from Catalina:
  1. “Five selfish reasons to work reproducibly”, Florian Markowetz, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7
  2. ‘The Turing Way’ – A handbook for reproducible data science, https://www.turing.ac.uk/research/research-projects/turing-way-handbook-reproducible-data-science
  3. Modern Statistics for Modern Biology, Susan Holmes, https://www.huber.embl.de/msmb/index.html
  4. R for Reproducible Scientific Analysis, Software-carpentry, https://swcarpentry.github.io/r-novice-gapminder/

Share

Leave a reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

css.php

Report this page

To report inappropriate content on this page, please use the form below. Upon receiving your report, we will be in touch as per the Take Down Policy of the service.

Please note that personal data collected through this form is used and stored for the purposes of processing this report and communication with you.

If you are unable to report a concern about content via this form please contact the Service Owner.

Please enter an email address you wish to be contacted on. Please describe the unacceptable content in sufficient detail to allow us to locate it, and why you consider it to be unacceptable.
By submitting this report, you accept that it is accurate and that fraudulent or nuisance complaints may result in action by the University.

  Cancel