Data access restrictions reduce diversity in scientific research, study finds

The easy availability of Landsat images, such as this photo of the Dzhugdzhur Mountains in Russia’s Far East, sparked a wave of new research from around the globe. (Image courtesy of NASA Goddard Space Flight Center and U.S. Geological Survey)

New technologies have allowed governments and other organizations to collect large, high-quality datasets that can be used in a variety of scientific research, from economics to biology to astronomy. Yet high costs and restrictions can limit both the diversity of researchers who have access and the range of research undertaken with this valuable data.

That’s the conclusion of a new paper co-authored by Berkeley Haas Asst. Professors Abhishek Nagaraj and Mathijs de Vaan and published in the journal Proceedings of the National Academy of Sciences.

“Scientific progress heavily relies on data access,” de Vaan says. “By restricting data access to only well-funded elite scientists, new research topics that have the potential to have a lot of impact may not materialize.”

By restricting data access to only well-funded elite scientists, new research topics that have the potential to have a lot of impact may not materialize. —Mathijs de Vaan, Berkeley Haas

Along with doctoral student Esther Shears of UC Berkeley’s Energy and Resources Group, de Vaan and Nagaraj explored whether reducing barriers to data access increased research output, as well as the diversity of researchers and topics.

The researchers focused on the NASA Landsat program, a dataset of satellite imagery often used to study environmental and demographic changes. In 1985, the U.S. government sold the Landsat data to a private organization, which charged researchers over $4,000 per satellite image for the proprietary data. In 1995, the government re-obtained the images, lowered the costs, and allowed for data sharing between scientists.

Democratization of data

To measure the effects of the democratization of the Landsat data, Nagaraj, Shears, and de Vaan compiled a dataset of academic publications that referenced Landsat from 1975 to 2005. Using machine-learning algorithms, they then geocoded both the locations of the area of study and the scientist’s location.

They found that after 1995, publications using the Landsat data increased sharply, both in overall volume and the number of publications in top journals. To isolate whether this effect was driven by the access change, they split the sample by the geographic areas covered by the satellite imagery. Some areas in the Landsat data had fewer images because of factors such as cloud coverage. This allowed the researchers to compare the areas with a high number of images versus those areas with little coverage. If some other factor led to the 1995 rise, it would conceivably affect both groups. They found that the oft-used images had three times as many publications following the removal of access barriers relative to the proprietary period, but found no change for the less-used data.

New researchers from around the world

The researchers also mapped the location of scientists using the data before and after the 1995 transition. Before the data was democratized in 1995, it was mostly used in the U.S. and parts of Western Europe. However, after 1995, usage increased starkly in South America, Africa, Eastern Europe, and parts of Asia. In addition to this increase in regional diversity, the diversity of academic institutions also increased. The researchers found that the post-1995 growth was mostly a result of more publications from authors at lower-ranked institutions relative to those ranked in the top 50. The researchers viewed these results as evidence that the reduction in costs allowed researchers with fewer resources to use the satellite imagery.

They next examined whether opening data access also resulted in a diversification of research areas. As the easier access led to more researchers from developing countries using the data, they also saw an increase in research using study locations in those countries, suggesting that the new scientists were conducting research in their local contexts.

Likewise, the researchers found more diversity in the topics being studied. They analyzed the text of paper abstracts and found an increase in unique words after 1995. This suggests that the expanded access led to more diverse research topics employing the Landsat data. Scientists using the dataset for the first time used 38% more unique words relative to scientists who already had access to the data.

Furthermore, de Vaan noted that the new papers were cited at a similar rate relative to the research performed before 1995, implying that the new research topics were of high scientific interest.

Selling data results in direct income for data providers, but it may stifle scientific progress and innovation. —Mathijs de Vaan

“Selling data results in direct income for data providers, but it may stifle scientific progress and innovation,” de Vaan said. “In making decisions about data access, data providers should weigh the downstream consequences of restricting access to scientists without substantial resources.”

Read the latest campus information on coronavirus (COVID-19) here →