Regular expressions

Table of Contents

When regular expressions are useful

We've already encountered a situation when regular expressions would have been very useful. Remember when we wanted to get the gene name from a line in a GTF file? The line with the gene name had a bunch of fields separated by ";".

gene_id "ENSG00000081386.8"; [ ... ] gene_name "ZNF510"; [ ... ]

At the time, we approached this using splitting. (I'm going to snip out a lot of the other fields to make it more readable, but the process was the same.)

info = 'gene_id "ENSG00000081386.8"; gene_name "ZNF510"'
gene_name_field = info.split(';')[1]
gene_name = gene_name_field.split()[1]
gene_name = gene_name[1:-1]  # strip out quotes
print(gene_name)
ZNF510

That's a lot of work just to get to the gene name. Regular expressions provide a nice solution to this problem because they allow you to match a pattern and pull out the information that you're interested in.

Defining a regular expression

In order to pull out the gene name, we need to define the match pattern. Defining regular expressions can be a little confusing because the syntax for creating them is dense. We are not going to go into the details of the syntax; we'll just cover what we need to extract out the gene name. This will give you an idea for what regular expression can do, so that you recognize situations where they may be helpful.

If we look at the string above, the thing that distinguishes the gene name is the text "gene_name". We will create a pattern that matches this.

pattern = r'gene_name "\w+"'

The first part of that pattern should make sense: we are creating a match for "gene_name". \w is a special character that means match a word character, which is defined as upper and lower case letters as well as digits. (See here for a list of possible characters.) This should cover the possible elements of a gene name (except for maybe a hyphen). The + means match one or more instances of the thing it follows (in this case, \w). A related symbol is *, which means match zero or more instances of the thing it follows.

The only thing left to explain in the pattern is the r that comes before the string. This classifies the string as a raw string and tells Python to not interpret \ as special. For example, recall that a new line is represented as "\n". Even though you type two characters, the \ indicates that Python should treat the n as a new line, so the string consists of one new line character. If instead, you wrote an r before the string, the \ would no longer have its special meaning, and the string would consist of two characters: a \ and a n. This is useful with regular expressions because the syntax makes heavy use of \. Putting r before the string prevents Python from trying to do anything special to the \ characters before they are passed to the regular expression function, which will then interpret them and apply any special meaning to the characters.

Matching patterns

Now that we have defined that pattern, we can search for it in the string.

import re

match = re.search(pattern, info)
print(match)
<_sre.SRE_Match object; span=(29, 47), match='gene_name "ZNF510"'>

The search function returns a match object (or None if a match isn't found). From the output, we see that we have an object, but how do we access the match? If you are in IPython, you can type match.<TAB> to see your options. Use a ? after some of them to get more information. The description of match.group? seems promising. It tells us that giving 0 as an argument will return the entire match. Let's try.

print(match.group(0))
gene_name "ZNF510"

Well, still not exactly what we want. It would be nice to get the gene name that is inside the quotes.

Pulling out groups

The trick is putting parentheses around the text we want to capture.

pattern = r'gene_name "(\w+)"'

Now we can access the captured text as a group. We know from the previous matching example that passing 0 to match.group gives that entire match string. After this default group, there are as many additional groups as parentheses sets in the pattern. So in the first example, there were no more groups. In the second pattern we created, we defined a group with parentheses, so there is one more group in addition to the default group that corresponds to the entire match.

match = re.search(pattern, info)
print(match.group(0))
print(match.group(1))
gene_name "ZNF510"
ZNF510

Here is the complete example extracting out the gene name.

import re

info = 'gene_id "ENSG00000081386.8"; gene_name "ZNF510"'
pattern = r'gene_name "(\w+)"'
gene_name = re.search(pattern, info).group(1)
print(gene_name)
ZNF510

Tasks

Extract Ensembl IDs from a GTF file

Write a program that uses a regular expression to extract the gene ID from the lines in data/gencode-v10-50random.gtf. Only extract the part of the gene ID before the decimal point. You should wrap the ID extraction in a function and write tests to show that it works. You can decide whether you would prefer to print the extracted gene IDs to the screen or write them to an output file.

Find each base that follows an insertion

Write a function that uses a regular expression to find each base that follows an insertion in the sequence. Write tests to demonstrate that your function works.

In [1]: bases_after_insert('AACT-CGGCA-AGAT')
['C', 'A']

Released under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Created with Emacs 24.4.1 (Org mode 8.3beta)

Validate