Functions
Table of Contents
The advantage of functions
Functions allow you to created reusable pieces of code. In their purest form, functions take some input, do some calculation, and then return an output.
Let's write a function that, given a gene's start position, returns a set of "promoter" coordinates. We'll coarsely define this as 1,000 bases upstream and 500 bases downstream of the start position.
So, if we have a set of coordinates for a gene of interest (chr7:73,095,248-73,097,781), we can get the start of the promoter by subtracting 1,000 from 73,095,248 and the end the promoter by adding 500 to the same value. But we also need to consider the strand. Because this gene is on the negative strand, we should add 1,000 to 73,097,781 to get the start position of the promoter and subtract 500 from the same value to get the stop position.
In code, this would look like this:
## details for the specific gene genestart = 73097781 ## promoter calculation (for case of negative strand) promstart = genestart + 1000 promend = genestart - 500
Functions allow us to separate out this calculation logic from the details of a specific case, give a name to it, and then apply it throughout the code by the name.
Defining functions
Let's create a function named get_promoter. This is done using def.
def get_promoter(): pass ## Placeholder until we fill in the code
At this point, the function doesn't do anything. It doesn't take any arguments and it doesn't return any values.
A good place to start when defining a function is the docstring, which serves as documentation for what the function does.
def get_promoter(): """Return promoter of gene.""" pass ## Placeholder until we fill in the code
Next, we should declare what arguments the function receives.
def get_promoter(start, strand): """Return promoter of gene. Parameters ---------- start : int Start coordinate of gene strand : str Strand direction indicated by + or - """ pass
The arguments are specified between the parentheses (everything added between the triple quotes is documentation that, while often helpful, is not required).
Calling functions
We can use the genestart defined above and the strand as input to our new function.
get_promoter(genestart, '-')
We have called the function, but it doesn't do anything yet. Let's add in the calculation.
def get_promoter(start, strand): """Return promoter of gene. Parameters ---------- start : int Start coordinate of gene strand : str Strand direction indicated by + or - """ if strand == '+': promstart = start - 1000 promend = start + 500 elif strand == '-': promstart = start + 1000 promend = start - 500 else: print('Strand not recognized.') # REV: Why is handling a bad argument like this problematic? # What could be done instead? print(promstart, promend) # Now call it. promoter = get_promoter(genestart, '-') print(promoter)
73098781 73097281 None
Looks like it is performing the calculation, but at this point it is
only printing out the values. We know this because when we print the
return value (promoter), it is None.
To get back the result, we can modify the function to include a return statement.
def get_promoter(start, strand): """Return promoter of gene. Parameters ---------- start : int Start coordinate of gene strand : str Strand direction indicated by + or - """ if strand == '+': promstart = start - 1000 promend = start + 500 elif strand == '-': promstart = start + 1000 promend = start - 500 else: print('Strand not recognized.') # REV: Why is handling a bad argument like this problematic? # What could be done instead? return promstart, promend # Now call it. promoter = get_promoter(genestart, '-') print(promoter)
(73098781, 73097281)
When we call the new function, the result is returned (because we've
added the return statement at the end) and is no longer printed out
(because we removed the print statement). The variable promoter now
contains the returned results.
Note: Due to how the return statement was specified, a tuple has
been returned. Any type can be returned. If square brackets were given
around the results (return [promstart, promend]), a list would have
been returned instead. For now, tuples can be thought of as very
similar to lists, though there are some very important differences
(particularly in terms of mutability, which we have not discussed).
Keyword arguments
In the above example, we have two required arguments: start and
strand. If we call the function without an argument, we get an
error.
get_promoter(genestart)
Traceback (most recent call last): ... TypeError: get_promoter() missing 1 required positional argument: 'strand'
Required arguments are a good thing, but, in some cases, it is useful to allow optional arguments. For example, the 1,000 bases upstream and 500 bases downstream used to define the promoter are parameters that should probably be tunable, so we can provide them as default values that can be overridden when the function is called.
def get_promoter(start, strand, upstream=1000, downstream=500): """Return promoter of gene. Parameters ---------- start : int Start coordinate of gene strand : str Strand direction indicated by + or - upstream, downstream : int Number of bases upstream/downstream of `start` used to define promoter """ if strand == '+': promstart = start - upstream promend = start + downstream elif strand == '-': promstart = start + upstream promend = start - downstream else: print('Strand not recognized.') # REV: Why is handling a bad argument like this problematic? # What could be done instead? return promstart, promend
Now we can call the function as before and get the same results, as we do in the first call below. If we choose to, we can override the default values. For example, in the second call, we bring the upstream bases down to 500. In the third call, both optional arguments are tweaked at the same time.
print(get_promoter(genestart, '-')) print(get_promoter(genestart, '-', upstream=500)) print(get_promoter(genestart, '-', upstream=500, downstream=400))
(73098781, 73097281) (73098281, 73097281) (73098281, 73097381)
In the above examples, we provide the keywords with the value, but they can also be given without the keywords as long as the order is preserved.
print(get_promoter(genestart, '-', 500, 400))
(73098281, 73097381)
In this case, however, leaving out the keywords makes the function less readable.
Done
We now have a function that is reusable throughout the code. In addition to simplifying our code, this makes it easier to modify the code later if needed because the code only has to be changed in one place. This function can also be shared with other python files. We'll talk about this more when we go over modules.
Tasks
GC content
Write a function that calculates the GC content of a sequence. Then
explore how to use assert to demonstrate that your function works as
you expect.
In [1]: calculate_gc_ratio('GCAA')
0.5
Reverse complement function
Write a function that takes the reverse complement of a DNA sequence.
It should work something like this:
In [1]: reverse_complement('ATAATCG')
CGATTAT
Amino acid translation function
Write a function that translates a DNA sequence to an amino acid sequence.
In [1]: translate('GTACCC')
VP
It's up to you to decide how you want to deal with stop codons. I've
already added a codon to amino acid dictionary in codon_mapping.py.
You can access this in your script by importing it.
Write tests (either with plain assert statements or py.test) to show that your code is working properly.