Open science help researchers make their work more accessible to the public. This also means that students can use research to support their own learning! In this blog post Greta discusses how open science can benefit data science learners and how to take advantage of the best data and code sharing practices.
You read an abstract of an academic article that sounds just what you need for your essay or report, you click “download” just to find out that it’s behind a paywall! Warwick University Institutional login isn’t helpful – university hasn’t subscribed to the journal. You either pay or do without it… Does this situation sound familiar to you? Open science aims to address this problem by encouraging open access publishing. However, it is much more than this. In this blog post I will discuss how you can take advantage of open science in the context of data science.
What is open science?
There is no single definition for open science (or research). For example, UK Research and Innovation (UKRI) identify four key features: transparency, reproducibility, openness, and verification. The Alan Turing Institute (UK’s national institute for data science and artificial intelligence) in the Turing Way handbook describe open research as reproducible, transparent, reusable, collaborative, accountable, and accessible to society.
McKiernan and colleagues (2016) reviewed literature to systematise how open science helps researchers to succeed. They noted that open access publications and data sharing tend to receive more citations and attract more media coverage. This sounds auspicious from a researcher’s perspective, but what if you’re a learner? How can you benefit from open science besides reading publications?
Data Sharing
Making research data open is one of the good practices in open research. This is for a good reason – data is essential to reproduce research findings. Open data can be reused by others, including you! Therefore, if you’re working on a data science project, you don’t always have to create artificial data or work with the common datasets, such as the ones found on Kaggle. Of course, Kaggle is great and, indeed, most of its top ranked datasets, such as “Housing Cost in New York”, should be reliable. Nevertheless, data quality should be even less a problem in academic research that is used to advance scientific understanding of a phenomenon or inform policy decision-making.
Open research data is often accompanied by the requirements to make data reusability easier. For example, UKRI-funded research must follow various research data policies. One of the criteria is to provide high-quality metadata supporting the understanding of the data. Good grasp of what the dataset is about – what it does and doesn’t contain – is crucial for modelling and its interpretation.
Data sharing can be practiced even in the absence of funding requirements because it reflects researcher’s values. For example, I worked on the project whose lead supervisor encouraged open science, so one of the project outputs was the release of “data packs”. My colleague created a document to explain and visualise them. By doing this we aimed to improve access to data by allowing effortless downloading and exploration in one’s preferred software.
“Data sharing can be practiced even in the absence of funding requirements because it reflects researcher’s values“
Finally, because of increasing interest in open data sharing practices, there is an increasing variety of datasets that you can explore and use. Nature journal provides a comprehensive list of data repositories grouped by academic discipline. Something that I’ve discovered recently is Living with Machines project that released a number of datasets. So, if you’re interested in digital humanities, history and/or natural language processing, check it out!
Code Sharing
Let’s say you’ve found a dataset for the question you want to answer. You clarified what you want to do and what your program should do, including the steps it should follow. There’s a little problem: you know what needs to be done, but not how to exactly implement it, i.e. translate to a programming language. This is a situation in which open notebooks might help.

One of the advantages of open notebooks is that it supports learning from colleagues about what worked and/or did not. In my previous blog post, I mentioned that I relied a lot on my supervisor’s GitHub account to learn spatial analysis in R. The good (or not) thing is that, most likely, someone else has already had your problem and solved it, thus your task is to find, test and apply it, and acknowledge the work of others!
I’d say the best two platforms to find open code is GitHub and GitLab (also stackoverflow for various coding questions and code snippets). You can find projects of interest by searching either platform. Also, academic publications will often have a formal note on data/code availability or have a URL in the text/footnote. The latter is the approach my colleagues and I chose for a conference proceeding a year ago, though today I would go for the former, if submission format permitted.
Finally, I would like to note that reading other’s code is hard so just being able to do this is a small accomplishment. It might seem that you haven’t done any work but copy-pasted, yet it can be a great way of learning if it’s done intentionally, with a purpose of understanding what has been done.
Open science is and will not become a silver bullet to learning programming. However, it can accompany you on the way to becoming better. Just don’t forget to acknowledge those whose knowledge sharing practices helped you in this journey – be it in the form of data or code.
Have you ever published open access research? Have you ever benefitted from using it? Let us know by tweeting us @researchex, messaging us on Instagram @warwicklibrary, or emailing us at libraryblogs@warwick.ac.uk
If you’re interested in reading more about coding and programming, check out Greta’s previous blog posts here and here.
Want the latest PhD Life posts direct to your inbox? Subscribe below.