Learning Objectives

By the end of this module students will:

Basics of Computer Systems

For a course about chemical information we start at the lowest level - how computers deal with information internally. Take for example a Word document that contains the following


Humans of course can very easily identify that this is a telephone number, but to a computer it looks like the following in its basic format – binary (more info on this later…)

00111001, 00110000, 00110100, 00101101,
00110110, 00110010, 00110000, 00101101,
00110001, 00110001, 00110011, 00111000

Whether it be on a regular hard disk drive (magnetic coating on a disk) or a flash/SSD (memory chips) all data on a computer is stored as 1’s and O’s  (see,2817,2404258,00.asp). As a result, all data on a computer must be represented in binary notation.  Each recorded 1 or 0 is a ‘bit’ and eight bits make a ‘byte’. A byte is the basis of presenting data because eight bits, being either 1 or 0, can together represent numbers up to 255.  This is because for each bit there are two possible permutations and eight bits gives 28 combinations - 256 values, or 0 thru 255.

In the early days of computer systems, a single byte was used as the way to represent text characters.  The American Standard Code for Information Interchange (ASCII – see  Initially, only the numbers 0-127 where used to represent letters, numbers, punctuation marks and symbols (32-127), and non-printed characters (0-31).  Subsequently, extended ASCII was introduced which added accented characters, other punctuation marks, and symbols (128-255).

Looking back on the example of the telephone number above, we can now translate the binary into the telephone number:

Binary Decimal ASCII Character



































Although we still ‘use’ ASCII today, in reality we use something called UTF-8 today.  This is easier to say than how it is derived - Universal Coded Character Set + Transformation Format - 8-bit.  Unicode (see started in 1987 as an effort to create a universal character set that would encompass characters from all languages and defined 16-bits, two bytes -> 216 -> 256 x 256 = 65536 possible characters – or code points. Today, the first 65536 characters are considered the “Basic Multilingual Plane”, and in addition there are sixteen other planes for representing characters giving a total of 1,114,112 code points.  Thankfully, we don’t need to worry because if something is UTF-8 encoded it is backward compatible with the first 128 ASCII characters.

It’s worth pointing out at this stage that the development of Unicode is a good thing for science.  We speak our own language and have special symbols that we use in many different situations (how about the equilibrium symbol?  ⇌ ) and so publishers in science and technology have developed fonts for reporting scientific research.  Check out and install STIX fonts ( which would not be possible without Unicode.

Information Data Types

Looking back at the telephone number in the previous section our identification of the characters is based off of two things – the fact that there are only digits and hyphens, and the pattern of digits and hyphens.  Again though the computer knows nothing of this it just knows that there is a string of 12 characters. So, in the context of humans representing information in computers it is not just important for use to the characters, we also need to know generically what they represent.  This is critical to everything we do on a computer because we need to know how to process the characters.

In a broad sense characters represent either text, numbers, or special formats.  In the context of discussing the common ‘data types’ we are going to reference those that are used in the relational database software ‘MySQL’.  More about what MySQL actually is later, for now we will look at the ways in which the program represents information that it stores (see for more details).

Numeric types

Representation of numeric values is important because it is dependent on how they are used.  Choosing the right type is critical to making sure there are no inconsistencies, especially when making comparisons or rounding in calculations.

String Types

The differences between string types is primarily based on how much space they take up and how easy they are to search.  Thinking about the application of the string is important especially deciding if a string should be exclusive or inclusive.  Also, realize that there are situations where you might want to store a number as a string.

Other Types

There are many other types of information that are stored, some common ones are:

Common Computer Files and Formats

The mention of binary files in the previous sections highlights that there are differences in how data is stored by computers.  Traditionally, programs or applications are stored in binary format because they contain code that is used to allow the program to run and thus should not be displayed as text (its not readable anyway).  In addition, there are many non-text based files, images, audio, video etc. that need to store the content in a format that can is efficient – a place where UTF-8 (ASCII) is not needed.

Historically though the most important reason for applications and files being stored as binary is that it gave the developer a way to protect their work and make money off of it.  If the application they have written stores the files in a proprietary format (one that only the application can read and write) then no one else could work with the file without licensing the software used to create it.

In the early years, without the presence of the Internet this model worked well.  But once we all became connected and more and more armchair software developers started developing free software, a significant push back started with this model.  People wanted to be able to share documents and not have to pay high licensing fees just to be able to read a letter they sent out to their relatives.  Although a lot of progress has been made so make file formats open and based on community standards there are still a lot of issues will proprietary file formats, especially in the sciences.  If you think about it, any data you get off of an instrument is stored in a proprietary format – if you want to get the data into Microsoft Excel for instance you have to export it.  In a lot of cases you are not getting all the information that was gathered – but more about that later.

Microsoft is a good example of a company that used a proprietary binary file format for saving documents from all its applications, that has transitioned to a open standard text based format.  In 2000 Microsoft started producing versions of its files (starting with Excel) in what was to become the Open Office eXtensible Markup Language (OOXML), a standard released by ECMA in December 2006 (see   XML is a text based markup language published by the World Wide Web consortium ( that is used to annotation text strings using tags.  HyperText Markup Language (HTML) used in web browsers is a limited version of XML primarily used for presentation.

Comparing the binary Word document format (.doc file) with the new OOXML (.docx file) format and plain text (.txt file) we can see the differences.


The Word .doc File Format


The Word .docx File Format

We see that the .doc file is unintelligible in a text reader and has over 22K characters considering the file contains the text ‘Chemical Informatics’.  The .docx file is also unreadable and slightly larger in size – but wasn’t this supposed to be in the OOXML text based format?  It actually is, however the .docx file is actually a folder of XML files that is ZIP compressed into a single file.  If you change the extension of a .docx file to .zip and un-compress it you see the folder of XML documents (that take up a lot of space – a problem with XML).


Finally, if you look inside the ‘document.xml’ file you find the text content.  The other files in the folder are used to store information about the font used, the Word settings, the Styles in use and a thumbnail of the file used by the operating system to display a thumbnail of the file content.  The ‘document.xml’ contains the text of the file and other ‘markup’ – the XML – that provides the structure of the file , the layout of the page, and other important information


Of course if you just need the text (i.e. no formatting etc.) then you could use the .txt format.  Its got the same size 21 bytes, as the length of the text (including the carriage return as the end of the sentence).  Sometimes less, is more…


Computer Languages for Information Science

While this class is not a course in programming, it is useful to know what it is out there that can be used to manipulate information, perform calculations, and organize data. As this an introduction we will just touch on three that scientists/web developers use but there are many more.  This discussion does not include the hard code application development platforms like C++, C#, etc.

Desktop Application (Windows, Mac) – Microsoft Excel

A lot of people think that Microsoft Excel is really for business people and not scientists.  Certainly, Microsoft caters to the business community in terms of the table formats, graphics, and cell functions.  Nonetheless, Excel has functionality that can be used in the context of informatics and as you will see later in the course can talk to webservers to pull done data.

Excel has programming built-in in two ways. First, there are the in-cell functions. These are useful for text manipulation and basic math operations (see  There are also some statistical functions like mean (AVERAGE), mode (MODE), median (MEDIAN) standard deviation (STDEV), as well as some more sophisticated functions related to regression (SLOPE and INTERCEPT).


The there is the much more sophisticated Visual Basic for Applications, (VBA), which sits behind Excel and is used to record the Macro’s run in Excel.  VBA is much more a true programming language allowing for declaration of variables, loop structures, if-then-else conditionals and user defined functions.  The script below is a quick example of how to check for strings to be a particular format – in this case the International Chemical Identifier, or InChI string (more later on this).  This script tries to match the pattern that is defined to the text in each cell to tell if the string ‘looks’ like a valid InChI string.  Whether it is would have to be tested in other ways…



Command Line Application (Any platform) – R

R (yes just the letter R) is a very popular statistical computing and graphics environment (see  R is sophisticated and extremely powerful so there is no way to show off its capabilities in a brief introduction but to give you a flavor here are some things you can do.  R is natively a command line application, so it does not have a graphical user interface (GUI) built-in that means it launches like other applications – you run it from the command line in the terminal.  There are however many GUI’s for R – for example RStudio (

In order to get data into R to do something with it you must either enter it or import it from a text file. If we look back at the linear regression example from Excel we see we have two arrays of data, the x data points (concentration) and the y data points (absorbance).  If we wanted to enter these into R we would do it like this.

sh-3.2# R (run R)
> xpoints <- c(0.00,1.00,2.00,3.00,4.00)
> ypoints <- c(0.0000,0.1204,0.2423,0.3671,0.4901)
> lsfit(xpoints,ypoints)
Intercept         X
-0.00140   0.12269

[1]  0.00140 -0.00089 -0.00168  0.00043  0.00074

We enter the x and y data points as matrixes (you can also think of it as arrays) and then use them as inputs to the lsfit (least squares fit) function that is defined in R.  It immediately spits out the slope, intercept and residuals (the differences between the y values and the calculated y values from the equation) as well as a lot of other details not shown above.

Browser Application (Any platform) – PHP

In the early days of the Internet every webpage you accessed was static, that is, it was a page of text in a file that had to be updated manually to change it.  Today, the majority of websites are built using scripts that dynamically created the page when it is requested.  This is because we expect sophistication (features), information (content), and immediacy (its up to date).  Therefore, the web server needs to things to create the web pages: content (typically stored in a database – see the next section) and a scripting language that can logically build the page based on a set of commands.  Current the most popular language to do this is called Pre-Hypertext Processor, or PHP (see

Just like R, it is impossible to show how capable PHP is in generating web pages on the fly, but a brief example script below shows how to calculate some statistics in a web page, with the output shown below it.

$data = [0.0000,0.1204,0.2423,0.3671,0.4901];
echo "Data: ".implode(", ",$data)."<br />";
echo "Total: ".array_sum($data)."<br />";
echo "Minimum: ".min($data)."<br />";
echo "Maximum: ".max($data)."<br />";
echo "Mean: ".$mean."<br />";
foreach($data as $point) {
echo "Std. Dev: ".sqrt($sumsquares/(count($data)-1));


Storing Information in Databases

A database is a system that stores information in a logical, organized format.  Many applications on your computer (contacts, calendar, mail reader) rely on databases as the places to store the information that needs to be displayed in a way that allows the application to easily go grab the data.

While there are a number of styles of database the most common currently are relational.  This means that data in different ‘tables’ is related together by the use of a unique id field, also called a foreign key.  Each row of a table is given a unique value and this is then entered in a field in another table for any rows of data that are related to the information in the first table.  Sounds complicated but its easy to see it in action in the upcoming examples.

Desktop Database

Microsoft Access has been a staple part of the Microsoft Office suite of applications since 1992.  It often goes overlooked as there are a wider variety of uses for Word, Excel, and PowerPoint, but it is a powerful relational database that has found great use in business.

Users can define multiple tables of data (one entry is a row with multiple fields (columns)) and then link them together using unique ids that are automatically created on each row.  Columns can be different data types (text, numeric, date/time, currency, etc.) or relational (lookup).  The example below shows a small set of artists, albums, and songs that are related together via fields defined in the tables.  The relationships can be see in the last image with the ‘Album’ field of the ‘Songs’ table being linked to the ID field of the ‘Albums’ table, and the ‘Band’ field of the ‘Albums’ table being linked to the ID field of the ‘Artists’ table.  As can be seen in the figures, entering data in the linked fields is done through dropdown menu’s (to control what text is in the field) making it easier to create the table and providing the relationships.





Web-based Databases

Currently, the world of relational database is primarily geared toward the use of the Structured Query Language (SQL).  SQL is a special programming language for retrieving data out of databases originally developed at IBM in the early 1970s. Relational Software (now called Oracle) created the first commercial software to implement SQL in the late 70s – called Oracle V2, and has subsequently become a major player in the database industry.  Currently, there are many implementations of SQL databases, both commercial and open source (free), and most of them are accessible from scripting languages so they can be considered ‘Web-based’ databases.

Although it grew out of the relational database model, SQL has deviated somewhat to include features and provide functionality needed for database applications.  The basic structure of SQL is an English language sentence comprised of multiple parts using defined keywords.  ‘SELECT’ is used to indicate that data is to be retrieved, ‘FROM’ is used to identify the table(s) from which the data is to be obtained, and ‘WHERE’ defines conditions on which the command searches.  So the following SQL query (left) selects all the substance that have chloride (‘%' is used as a wildcard) in the there name and returns the names and formulas of those entries found (rather than all the fields (columns) in the substances table).


The more complicated query (right) adds the ‘JOIN’ syntax that is used to create relationships between tables on the fly so that related data can be retrieved in a single query.  The designation LEFT JOIN means that all the data from the ‘LEFT’ (first) table is including in the results even if there is no related data in the ‘RIGHT’ (second) table.  The keyword ‘AS’ is used to create an alias to a table to make the query shorter, and the keyword ‘ON’ is used to define the fields to join between the tables.

One current trend of note is the movement toward ‘NoSQL’ databases.  This is somewhat of a misnomer as the class of databases this commonly referred to are really non-relational in nature and as a result there is no need to use SQL queries (although some do).  The database model types lumped into the NoSQL classification are: Column, Document, Key-Value, Graph, and Multi-model based.  Many of these have only been developed in the last few years as a response to interest in big data – applications where very fast processing of large (distributed) datasets is required.  Graph based databases are of interest in many situations do to their lack of a data model and representation of ‘semantic’ data or subject-predicate-object triples.  These databases have their own query language, SPARQL which is the interestingly recursive acronym the standards for ‘SPARQL Protocol and RDF Query Language’.

Accessing Data on Websites and Application Programming Interfaces (APIs)

To bring this module to a close it is important to discuss the ramifications of all that we have covered in the previous sections.  By now you will hopefully have come to the realization that i) there is a lot of information/data in the world, ii) organizing it and identifying it is a important and hugh undertaking, and iii) accessing chemistry related data in this environment cannot be done by a Google search.  A lot of what chemical informatics is about is providing ways to deal with lots of data, describing best practices of how to identify and store information, and developing tools that allow scientists to find and retrieve quality information that can be used for teaching and research.

One of the fundamental parts of this is understanding where good chemistry data is, working out how to efficiently search it – based on chemical terms – and how to download it so that you can use it.  In the previous sections you have seen examples of how data can be processed and stored in databases, but how do get the data. It turns out that bringing SQL databases and scripting languages together to build webpages is only one facet of their usefulness.  Increasingly websites are being built to serve up data, that is, they are designed to allow the user to search databases via well-defined standards and return the results for both human and computer use.

Over the last few years the design of websites that follow the Representational State Transfer (REST) is a software design pattern that allows websites to be services.  What that means is of you know how to construct the URLs for pages on these websites you can anticipate the information that you are going to see on the page that is returned.  This concept is so much easier to see in practice.

One website that has a large amount of chemical metadata (information about chemicals – not chemical data like melting points etc.) is the NIH Chemical Identifier Resolver (CIR) website (see The website allows you to get information by writing it in the general format below."structure identifier"/"representation"

In this context “Structure Identifier” means any of the following: chemical names, SMILES, InChI Strings and Keys, and NIH identifiers like FICTS and FICuS.  What you get back from a search is determined by the “representation” part of the URL and includes all of those above and ‘sdf’ (otherwise known as MOL file), ‘formula’ and ‘image’.  So, using the URL below: acid/names

prints out the names of this compound that are in the CIR system – for humans to read.  If you use a computer to access the site (i.e. via a scripting language like PHP) you can also request the data be sent to you as XML using. acid/names/xml

which returns the following document.

<request string="arsinic acid" representation="names">
   <data id="1" resolver="name_by_opsin" notation="arsinic acid">
    <item id="1" classification="pubchem_iupac_name">arsinic acid</item>
    <item id="2" classification="pubchem_substance_synonym">CHEBI:29840</item>
    <item id="3" classification="pubchem_substance_synonym">HAsH2O2</item>
    <item id="4" classification="pubchem_substance_synonym">[AsH2O(OH)]</item>

    <item id="5" classification="pubchem_substance_synonym">arsinic acid</item>
    <item id="6" classification="pubchem_substance_synonym">dihydridohydroxidooxidoarsenic</item>

This is useful because it contains additional information (metadata) about the type of name that an entry is and it makes it easy for a script to take the data and do something with it – like put it in another database.

There are many sites that chemists can use to get chemical data, ChemSpider (, PubChem (, and the NIST Webbook (  Each has a different set of data but all have a standard way to interact with their website.  To encourage users to take full advantage of the search features many sites will publish the Application Programming Interface (API) that delineates what you can search for and how to do it.  Facebook has an API, Twitter has an API, Instagram… the list goes on.  API’s are the way in which websites can ‘’talk” to other websites to integrate data, or present ‘widgets’ that allow you to see what is going on on one site when you are on another.  A good example of a well documented and sophisticated API for chemistry data is the one at PubChem (see which has a nice tutorial at  With over 68 million compounds to search you are sure to find something!