rTransparent

02:44pm March 30, 2021
Anita Bandrowski

text mining

The limitations of the rTransparent tool are described in Serghiou et al 2021: https://doi.org/10.1371/journal.pbio.3001107

Summary of Data Sharing (from file S2 Text):

Upon manual inspection of a random sample of 100 research articles of 2015-2019 from PMC that were labelled as actively sharing data, 93 were indeed found to actively share data, but 7 were not - 4 used publicly available data, 2 referred to an inaccessible URL and 1 claimed that all raw data were in the text where none could be found. Similarly, out of 89 research articles labelled as not actively sharing data, 84 were found to not share data, but 5 did - 3 made their data available as supplements, 1 referred to a GSE number and 1 contained a primer sequence. Assuming similar proportions across all 6017 articles in our sample, in terms of active data sharing in research articles, this algorithm has an accuracy of 94.2% (95% CI, 89.7-97.99%), a sensitivity of 75.8% (95% CI, 61.4-93.9%) and a specificity of 98.6% (95% CI, 97.6-99.5%). Applying this algorithm across research articles of 2015-2019 on PMC is likely to underestimate the true proportion of data sharing by an absolute value of 3.6% (i.e. for every 4794 random PMC research articles of 2015-2019, we expect this algorithm to label 764 positive, whereas 935 are actually positive).

Summary of Code Sharing (from file S2 Text):

Upon manual inspection of all 110 research articles of 2015-2019 from PMC that were labelled as actively sharing code, 97 indeed shared at least some code, whereas 4 did not – all 4 of these mentioned using code by publicly available code, e.g. “local realignment and variation call were analyzed using Samtools10 Picard (http://broadinstitute.github.io/picard/) and GATK”. Similarly, out of 181 articles labelled as not sharing code, 177 indeed did not share code, but 4 did. Three of these also shared data and uploaded their code on BitBucket, OSF or referred to it as R-syntax, all three of which were not identified by the algorithm. The fourth did not share data and was missed because it had made its code available as a supplement: “Our methods were easy to implement in R, and the code is presented in Supplementary Table 1, http://links.lww.com/MD/B200.” This final mistake had a major impact on the estimated sensitivity because the majority of articles did not share data (5253/6017) and as such this one article was dramatically overweighted. Assuming similar proportions across all estimated 4825 research articles, this algorithm has an accuracy of 98.3% (95% CI, 96.0-99.6%), a sensitivity of 58.7% (95% CI, 34.0-93.7%) and a specificity of 99.7% (95% CI, 99.6-99.9%). Applying this algorithm across the whole PMC is likely to underestimate the true proportion of code sharing by an absolute value of 1.1% (i.e. for every 4825 random PMC research articles of 2015-2019, we expect that this algorithm will label 110 as positive, whereas 164 are actually positive).

Summary of Conflict of Interest (from file S2 Text):

Upon manual inspection of 100 articles labelled positive for COI, all 100 indeed reported a COI disclosure. Similarly, out of 225 labelled negative, 218 were indeed negative, but 7 were positive. Of these 7, 1 was in Spanish, 1 was an unsuccessful conversion of PDF to text (had it been successful, it would have been identified), and 4 of the remaining 5 used non-standard language to describe COIs (e.g. “No benefits in any form have been received or will be received from a commercial party.”). Running the COI algorithm within the sample of 499 articles from PubMed, we identified 7 articles that had previously been missed by the two reviewers. Assuming similar proportions across all 6017 articles, our algorithm has an accuracy of 99.3% (95% CI, 98.8-99.7%), a sensitivity of 99.2% (95% CI, 98.6-99.7%) and a specificity of 99.5% (95% CI, 98.5-100.0%). Applying this algorithm across the whole PMCOA is likely to underestimate the true proportion of COIs by an absolute value of 0.5% (i.e. for every 6017 random PMCOA articles, we expect this algorithm to label 4840 vs 4873 positive).

Summary of Funding (from file S2 Text):

Upon manual inspection of 100 articles labelled positive for Funding, all 100 indeed had an explicit Funding disclosure. The algorithm was then calibrated to PMC articles that were initially labelled negative and then tested in a random unseen sample. Of 226 test articles initially labelled negative, the algorithm correctly predicted 218. Of the remaining 8, 4 were falsely predicted negative (FN) and 4 false predicted positive (FP). Of 4 FNs, 1 was an unsuccessful conversion of PDF to text (had it been successful, it would have been predicted positive) and 3 used uncommon language (e.g. “We would like to thank the U.S. Embassy in Addis Ababa, Ethiopia, Jigjiga University, and Texas Tech University for funding this research”). Of 4 FPs, 2 referred to funds received by another study and 2 contained words commonly seen in funding disclosures (“support” and “scholar”). Running the Funding algorithm within the sample of 520 articles from PubMed, we identified 7 articles previously mislabelled negative and 2 previously mislabelled positive. Assuming similar proportions across all 6017 articles, our algorithm has an accuracy of 99.4% (95% CI, 99.0-99.8%), a sensitivity of 99.7% (95% CI, 99.3-99.9%) and a specificity of 98.1% (95% CI, 96.1-99.5%). Applying this algorithm across the whole PMCOA is expected to identify the true proportion of funding disclosures (i.e. for every 6017 random PMCOA articles, we expect this algorithm to label 5084 vs 5084 positive). This algorithm can be improved in the future by (a) using a probabilistic model (e.g. a random forest) to predict outcome on the basis of the exported features and (b) by improving the quality of the extracted text from the PDFs.

Summary of Protocol Registration (from file S2 Text):

Upon manual inspection of 161/261 articles labelled positive for a protocol registration statement, we found 5 FNs and 4 FPs. Of 5 FNs, 4 were grammatical failures of the algorithm to understand that the registration statement referred to the current and not some other study (e.g. "This registered study on www.clinicaltrials.gov (NCT01375270) was approved ...") and 1 was a statement contained within financial disclosures. Of 4 FPs, 2 were mentions of registration in the references, 1 referred to the registration of a study of which the data it was using and 1 was using a registered study as an example. Similarly, of 147/5657 articles initially labelled negative, there were 11 FPs and 2 FNs - note that the large number of errors occurred because of sampling from articles in which the algorithm was more likely to underperform (see Methods). Of 11 FPs, most errors occurred because of registrations that did not refer to open protocol registrations (e.g. approval by a medical ethics committee) and because of referral to other registries (e.g. a patient registry). Of the 2 FNs, 1 was a grammatical failure of the algorithm to understand that the registration statement referred to this study and 1 did not mention anything about registration, other than the NCT number (“EDITION 2 (NCT01499095) was a randomized, 6-month, multicenter, open-label, two-arm, phase IIIa study investigating ….”). Running the Registration algorithm within the sample of 499 articles from PubMed, we identified 2 positive and 1 negative studies that were previously erroneously labelled. Assuming similar proportions across all 6017 articles, our algorithm has an accuracy of 99.5% (95% CI, 99.3-99.7%), a sensitivity of 95.6% (95% CI, 92.0-98.6%) and a specificity of 99.7% (95% CI, 99.5-99.8%). Applying this algorithm across the whole PMCOA is likely to underestimate the true proportion of protocol registration statements by an absolute value of -0.14% (i.e. for every 6017 random PMCOA articles, we expect this algorithm to label 249 vs 241 positive).