Guidelines for Representation of Metadata Fields
Data included in any metadata item that is based on ChAMP shall conform to the following guidelines for interoperability, interchange, and/or comparison.
For maximum use of numeric metadata items, numeric values should report i) an estimate of the precision of a value and ii) a unit of measure. Numeric values should be provided in scientific notation wherever possible. As there are potentially multiple pieces of information that would go into reporting a numeric value (value, error, unit) authors should logically group these when reported.
If possible, authors of metadata should specifically include a numeric value for the error in the reported value. This is might be (from worst to best)
- 12.34 (no stated error - implied error of 0.01)
- 12.34 ± 0.02 (reported rounded error)
- 12.34(56) ± 0.0223 (reported unrounded error)
- Separation of a numeric value into logical parts: mantissa (1.23456), exponent (+1), error (0.00223 - relative to the mantissa), and significant digits (4)
Any integer values reported should be identified as such, i.e. by inclusion of a qualifier like 'exact', or indication of signficant digits of 0 (as a replacement for infinite). In addition, inclusion of the error type (e.g. absolute, SD, CI etc.) is strongly encouraged.
All units should be using the International System of Units (SI) where ever possible. Authors are highly encouraged to use references to standardized representations of units such as UnitsML or SWEET to potential allow interconversion of numeric values into other equivalent units. Authors also need to specify units in an unambiguous manner so that they can be appropriately compared. As an example of this reporting a value in ppm (parts per million) is ambiguous because it could be mass/volume, mass/mass, or frequency/frequency.
Textual (String) Values
As the majority of likely representation formats are text based, textual data should be encoded in UTF-8. Although not encouraged, if there is a need to Base64 encode data in any field, users must start the encoding with the raw text as UTF-8 also. Best practices for the representation of metadata items using any data format should be derived from the specification of the format being used i.e. for
- XML-based formats (http://www.w3.org/TR/xml-i18n-bp/)
- JSON-LD (http://www.w3.org/TR/json-ld/#string-internationalization)
Date-Time Values and Ranges
- UTC format shall be used for date/time points and date/time periods (http://www.iso.org/iso/catalogue_detail?csnumber=40874)
In the develop of ChAMP it will be necessary to develop controlled vocabularies for some important metadata fields. As these are identified, they will be added to this page and links to ongoing development in the forums will be added.
Below are presentations related to the ChAMP project
- Poster presented at the Dial-a-Molecule Annual Meeting in Birmingham, England - July 2014 (PDF)
- Paper presented at the 248th ACS Meeting in San Francisco, CA - August 2014 (Slideshare)
- Status update presentation - December 2014 (Screencast)
- Paper presented at Pittcon 2015 in New Orleans, LA - March 2015 (PDF)
- Presentation given at NLM, Bethesda, MD - March 2015 (PDF)
- Paper presented at the 249th ACS Meeting in Denver, CO - March 2015 (Slideshare)
Here is a list of the ontologies important to ChAMP
- Chemical Entities of Biological Interest (ChEBI)
- Chemical Information Ontology (CHEMINF)
See also: Michel Dumontier. "The chemical information ontology: Provenance and disambiguation for chemical data on the biological semantic web. PLoS ONE 6, e25513+ (2011) doi:10.1371/journal.pone.0025513
- Eagle-i resource ontology (ERO)
- Ontology for Biomedical Investigations (CHMO or CMO)
- Chemical Methods Ontology (CHMO or CMO)
And then there are the vocabularies...
- Medical Subject Headings (MeSH)
- IUPAC Gold Book (http://goldbook.iupac.org/)
- IUPAC Orange Book (http://iupac.org/publications/analytical_compendium/)
Page 9 of 10