lab

 

As a crude way to look at analytical method metrics, text mining of the research literature will be done at three levels.  These are based on:

  • title only search
  • abstract only search
  • fulltext search

Metrics Based on Title Search (9/18/14)

To start this part of the project the simplest way is to look at one piece of metadata freely available for all research articles - the title. However, this is still not as easy as it might seem because the context of the metrics needs to be narrowed to analytical chemistry to make the analysis even remotely useful. As a result, the two available options to evaluate title data were via SciFinder Scholar and Crossref's Metadata API. Because Scifinder does not have an API that can be used to search its database using scripting languages Crossref's API was chosen.

In the last year Crossref has moved into the area of API's and text mining to leverage the database of DOI's and metadata that it holds.  A web based search interface is available at  http://search.crossref.org/ and an API to the same data (that returns JSON) is available at http://api.crossref.org/. The API interface is still 'alpha' in the eyes of Crossref but has good documentation and allows access to all the current DOI records (~69,000,000).

In order to limit the search to analytical chemistry, the journals currently in Elsevier's Scopus product were extracted from the most recent journal title list.  117 journals and book series were filtered from this list of over 30,000 and the metadata transferred to a MySQL database.  The list of journals was edited for those that were misidentified leaving 76 analytical journals.

In a separate table in MySQL, the twelve analytical metrics from the metrics poll were added with definitions and synonyms (details here).  The synonyms were picked as search criteria for searching the Crossref database.  A PHP script was written that automated the process of searching all 41 synonyms via the Crossref API for each journal (using ISSN).  An example API URL for searching is:

http://api.crossref.org/works?query="limit+of+detection"&filter=issn:"0003-2700"

The JSON string returned from these queries was converted to a PHP array in the script and the 'total-results' parameter extracted.  The data was saved to a third MySQL table with the journal, metric, and synonym data.  Of the 76 analytical journals, 63 were found to have articles in Crossref and as a result over 2700 searches were run.  Summary data is shown below and by journal data is available here.

Metric # Articles Percent Articles
Coefficient of Determination 12 0.002%
Limit of Detection 505 0.088%
Limit of Linearity 3 0.001%
Limit of Quantitation 25 0.004%
Linear Dynamic Range 106 0.018%
Repeatability 36 0.006%
Reproducibility 218 0.038%
Selectivity 1500 0.261%
Sensitivity 1609 0.280%
Spike Recovery 1 <0.001%
Sample Size 39 0.007%
Sample Throughput 11 0.002%
TOTAL 575488 -


Looking at the data, the main metrics that show up in titles of articles are sensitivity, selectivity, and detection limit.  These are probably not surprising, although it should be pointed out that the prescience of selectivity was definitely more prevalent as a metric in chromatography journals - for obvious reasons.  It will be interesting to see how this compares to looking at abstract based data.

Metrics Based on Abstract Search (11/13/14)

To continue the evaluation of analytical metrics searching of abstracts of articles was the next step.  To look at this I searched two data sources that contained abstracts of analytical papers; i) RSC's Analytical Abstracts (AA) database (commercial) at http://www.rsc.org/Publishing/CurrentAwareness/AA/ and ii) The Flow Analysis Database (FAD) (free) at http://www.fia.unf.edu. RSC was kind enough to give me access to AA because of the ChAMP project and I am the developer of the FAD website and backend MySQL database.

The FAD has 17310 papers through 2007 and although the abstracts are not available on the website I have collected over 99.9% of the abstracts in the database.  AA has almost 500,000 articles and searches for the key terms were done online and the abstracts subsquently downloaded into a MySQL database.  Using this process a subset of 187,224 was collected for subsequent analysis.

In order to compare the data between both sets of abstracts the AA dataset was cleaned using a process developed on the FAD dataset.  First html tags and character entities were removed, special characters (CR and LF)  deleted, and mispellings corrected.  The last step of this process is acheived by creating a MySQL Full-text index on the abstract field of the database, exporting it to a text file, and importing into Excel.  The Excel spreadsheet is then used to search for misspelled words (slow process) and apply corrections to the database by using the MySQL command

UPDATE 'citations' SET abs=replace(abs,'<term>','<corrected>') where MATCH (absft) AGAINST ('<term>');

For this work, effort focused primarily on words in the terms that would subsequently be searched. Some examples are given below of the misspellings for keywords

Determination   Limit Quantitation Quantitative
deteermination
detemination
detemrination
deter (abbrev)
deterination
determ (abbrev)
determation
determiantion
determinafion
determinaiton
determinatiion
determinatin
determinatiom
determinaton
determinatuion
determinination
determintation
determintin
determintion
detmermination
limita
limitat
limmit
limt
limts
linit
quanitation
quanititation
quantitaton
quantition
quantivation
quantiation
quatitation
quantative
quanitative
quanititative
quantiative (synonym)
quantificative
quantitiative
quantitive (synonym)
quantititative
quantizative
quatitative

 

During the course of the cleanup phase alternate search terms were identified and added to the list of those to be searched for the respective metrics (see below).  Correlation coefficients were added to coefficient of determination as they are related through r and authors more oftern report the correlation coefficient.

metric terms

A summary of the statistics found is shown below (an Excel spreadsheet with the data will be available shortly). In general both datasets are in agreement in terms of the most frequently seen metrics and the most frequently used term for those metrics. The percentages of each metric found in the whole database are quite different and this might be explained by the slightly different perspectives;  AA covering all of analytical chemistry, both quantitative and qualitative, and FAD specific to flow analysis - which is almost exclusively quantitative.

metric summary

Looking at the data in the AA set in terms of the number of metrics found in each paper, the majority of papers report only one metric and nearly 76% of the papers report 1-3 metrics.  Interestingly, 2% of the papers in the AA dataset did not have a metric even though the papers were downloaded by searching for the metrics.  A deeper analysis of this is planned in the next month.

metrics per paper

This analysis has provided a clear insight into analytical metrics as reported in the literature.  It has provided a lot of information about how to deal with the full text analysis planned next.

Metrics Based on Fulltext Searching

This has not been completed yet.