Regular expressions
Table of Contents
When regular expressions are useful
We've already encountered a situation when regular expressions would have been very useful. Remember when we wanted to get the gene name from a line in a GTF file? The line with the gene name had a bunch of fields separated by ";".
gene_id "ENSG00000081386.8"; [ ... ] gene_name "ZNF510"; [ ... ]
At the time, we approached this using splitting. (I'm going to snip out a lot of the other fields to make it more readable, but the process was the same.)
info = 'gene_id "ENSG00000081386.8"; gene_name "ZNF510"' gene_name_field = info.split(';')[1] gene_name = gene_name_field.split()[1] gene_name = gene_name[1:-1] # strip out quotes print(gene_name)
ZNF510
That's a lot of work just to get to the gene name. Regular expressions provide a nice solution to this problem because they allow you to match a pattern and pull out the information that you're interested in.
Defining a regular expression
In order to pull out the gene name, we need to define the match pattern. Defining regular expressions can be a little confusing because the syntax for creating them is dense. We are not going to go into the details of the syntax; we'll just cover what we need to extract out the gene name. This will give you an idea for what regular expression can do, so that you recognize situations where they may be helpful.
If we look at the string above, the thing that distinguishes the gene name is the text "gene_name". We will create a pattern that matches this.
pattern = r'gene_name "\w+"'
The first part of that pattern should make sense: we are creating a
match for "gene_name". \w is a special character that means match a
word character, which is defined as upper and lower case letters as
well as digits. (See here for a list of possible characters.) This
should cover the possible elements of a gene name (except for maybe a
hyphen). The + means match one or more instances of the thing it
follows (in this case, \w). A related symbol is *, which means
match zero or more instances of the thing it follows.
The only thing left to explain in the pattern is the r that comes
before the string. This classifies the string as a raw string and
tells Python to not interpret \ as special. For example, recall that
a new line is represented as "\n". Even though you type two
characters, the \ indicates that Python should treat the n as a
new line, so the string consists of one new line character. If
instead, you wrote an r before the string, the \ would no longer
have its special meaning, and the string would consist of two
characters: a \ and a n. This is useful with regular expressions
because the syntax makes heavy use of \. Putting r before the
string prevents Python from trying to do anything special to the \
characters before they are passed to the regular expression function,
which will then interpret them and apply any special meaning to the
characters.
Matching patterns
Now that we have defined that pattern, we can search for it in the string.
import re match = re.search(pattern, info) print(match)
<_sre.SRE_Match object; span=(29, 47), match='gene_name "ZNF510"'>
The search function returns a match object (or None if
a match isn't found). From the output, we see that we have an object,
but how do we access the match? If you are in IPython, you can type
match.<TAB> to see your options. Use a ? after some of them to get
more information. The description of match.group? seems promising.
It tells us that giving 0 as an argument will return the entire
match. Let's try.
print(match.group(0))
gene_name "ZNF510"
Well, still not exactly what we want. It would be nice to get the gene name that is inside the quotes.
Pulling out groups
The trick is putting parentheses around the text we want to capture.
pattern = r'gene_name "(\w+)"'
Now we can access the captured text as a group. We know from the
previous matching example that passing 0 to match.group gives that
entire match string. After this default group, there are as many
additional groups as parentheses sets in the pattern. So in the first
example, there were no more groups. In the second pattern we created, we
defined a group with parentheses, so there is one more group in addition
to the default group that corresponds to the entire match.
match = re.search(pattern, info) print(match.group(0)) print(match.group(1))
gene_name "ZNF510" ZNF510
Here is the complete example extracting out the gene name.
import re info = 'gene_id "ENSG00000081386.8"; gene_name "ZNF510"' pattern = r'gene_name "(\w+)"' gene_name = re.search(pattern, info).group(1) print(gene_name)
ZNF510
Tasks
Extract Ensembl IDs from a GTF file
Write a program that uses a regular expression to extract the gene ID from the lines in data/gencode-v10-50random.gtf. Only extract the part of the gene ID before the decimal point. You should wrap the ID extraction in a function and write tests to show that it works. You can decide whether you would prefer to print the extracted gene IDs to the screen or write them to an output file.
Find each base that follows an insertion
Write a function that uses a regular expression to find each base that follows an insertion in the sequence. Write tests to demonstrate that your function works.
In [1]: bases_after_insert('AACT-CGGCA-AGAT')
['C', 'A']