Wednesday, June 18, 2008


Simplified molecular input line entry specification [SMILES]

SMILES is used for specification for describing molecules.

SMILES:: The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings.


ASCII (American Standard Code for Information Interchange, generally pronounced ass-key) is a character set and a character encoding based on the Roman alphabet as used in modern English. It is most commonly used by computers and other communication equipment to represent text and by control devices that work with text.
ASCII specifies a correspondence between digital bit patterns and the symbols/glyphs of a written language & is used on nearly all common computers, especially personal computers and workstations.
ASCII is, strictly, a seven-bit code, meaning that it uses the bit patterns representable with seven binary digits (a range of 0 to 127 decimal) to represent character information.ASCII is one of the most successful software standards ever.


SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three- dimensional models of the molecules.

The original SMILES specification was developed by Arthur Weininger and David Weininger in the late 1980s. It has since been modified and extended by others. SMILES is generally considered to have the advantage of being slightly more human-readable than InChI.

The IUPAC International Chemical Identifier (InChI), developed by IUPAC and NIST, is a digital equivalent of the IUPAC name for any particular covalent compound. Chemical structures are expressed in terms of five layers of information — connectivity, tautomeric, isotopic, stereochemical, and electronic.

TYPES OF SMILES:: Canonical SMILES and Isomeric SMILES

The term Canonical SMILES refers to the version of the SMILES specification that includes rules for ensuring that each distinct chemical molecule has a single unique SMILES representation. A common application of Canonical SMILES is for indexing and ensuring uniqueness of molecules in a database.
The term Isomeric SMILES refers to the version of the SMILES specification that includes extensions to support the specification of isotopes, chirality, and configuration about double bonds. A notable feature of these rules is that they allow rigorous partial specification of chirality.

Graph-based definition of SMILE

In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph.
depth-first and tree-traversal are two types of algorithms”
The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.


Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. The hydroxide anion is [OH-]. Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I. All other elements must be enclosed in brackets. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed;

· For instance the SMILES for water is simply O and that for ethanol is CCO. (i.e Hydrogens are trimmed off)

· The double-bonded carbon dioxide is represented as O=C=O and the triple-bonded hydrogen cyanide as C#N.

Branches are described with parentheses, as in Propionic anc and Floufororm

· CCC(=O)O for propionic acid

· C(F)(F)F for fluoroform, which could also be described by the non-canonical formula FC(F)F.

· Cyclohexane is represented as C1CCCCC1, the idea being that the two 'number ones' label the same position in the molecule, thus forming a ring with six carbons. It is to be noted that the label is the numeral (in this case the 1) rather than the combination of 'C1'.

· Aromatic C, O, S and N atoms are shown in their lower case 'c', 'o', 's' and 'n' respectively. Bonds in an aromatic cycle are rarely marked explicitly except in SMARTS search patterns. Thus Benzene is c1ccccc1.


SMARTS is a modification of SMILES that allows, in addition to the SMILES elements, the specification of wildcard (*) atoms and bonds. This is used in specifying search structures and is widely used in chemical database search applications. This practice has led to a common misconception that chemical substructure search is achieved computationally by matching SMILES/SMARTS strings, when, in fact, it is achieved by the computationally more intensive search for subgraph isomorphism in the graphs reconstructed from the SMILES representations.




any pair of attached aromatic carbons


aromatic carbons joined by an aromatic bond


aromatic carbons joined by a single bond (e.g. biphenyl).


any aliphatic oxygen


simple hydroxy oxygen


1-connected (hydroxy or hydroxide) oxygen


2-connected (etheric) oxygen


the 1st four halogens.


must be aliphatic nitrogen AND in a ring


any arom carbon OR H-pyrrole nitrogen


(arom carbon OR arom nitrogen) and exactly one H


two atoms connected by a non-ringbond


two atoms connected by a non-aromatic ringbond


two carbons connected by a double or triple bond


aliphatic carbon with two hydrogens (methylene carbon)


( NOT aliphatic carbon ) AND in ring


must be aliphatic nitrogen AND in a ring


H-pyrrole nitrogen


any arom carbon OR H-pyrrole nitrogen


All SMILES expressions are also valid SMARTS expressions, but the semantics changes because SMILES describes molecules whereas SMARTS describes patterns. The molecule represented by a SMILES string is usually, but not always, matched by the same string when used as a SMARTS.

Other 'linear' notations include the

1. Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc.)

2. SYBYL Line Notation specification for unambiguously describing the structure of chemical molecules using short ASCII strings.

3. Recently, the IUPAC has introduced the InChI as a standard for formula representation.


21:30 18th may

No comments: