Introduction to Python (Part III): True/False Statements and Conditions

Sorry for the delay between posts. I really intend to post every week but its summer and I have a lot of important drinking while sunbathing by the pool to do!
IMG_0280
But dont worry this post is totally worth the wait!

Ok on to the python and bioinformatics. This post builds on my last post so I would recommend checking it out here before reading this one.

True/False Statements
In python “True” and “False” (not strings; no quotes when using them to code) conditions are Boolean (or bool) values. Boolean values are a third type of  data value (or variable). In the my last post I introduced strings and numbers as data types. Booleans can only be True or False

You can store Booleans in variables
>>>A=True
>>>B=False

Boolean values  are really useful in conditions and loops (if and while statements—coming up!).

Boolean Operations (also not strings- don’t use quotes)
-“or” checks if at least one of the arguments is true. “or” will only evaluate the second argument if the first argument is False (short-circuit operator)
>>>X or Y
-“and” checks to see if both statements are true. “and” will only evaluate the second argument if the first argument is True (also a short-circuit operator)
>>>X and Y
-“not” gives the opposite of the statement
>>>not X
 
Booleans also have an order of operations (like math) “not” is evaluated first, followed by “and” and lastly “or”

You can play all kinds of “fun” logic games with Boolean values and their operations
>>>True and not True or False
returns
False
“not” is evaluated first, not True becomes False
“and” is evaluated second, True and False becomes False
“or” is evaluated third, False or False remains False
for more examples check out the screen shot below

Screen Shot 2014-07-11 at 2.33.55 PM

Conditions
Conditions are if/else statements, they allow you to write a script that chooses between two or more actions. Conditionals are multiline scripts so they need to be written in a script editor such as IDLE or Textwrangler rather than in the basic python shell

This also brings up the importance of whitespace in python.  Whitespace, is the computer character for space between words and is used to structure code in python (but not in all programming languages).
When your code is not formatted correctly/the whitespace is off you will get the following error:
IndentationError: expected an indented block

In python a conditional statement is written with the if statement first  ending with a colon (i.e. if a==5:) on the next line, tabbed over one space is what should occur if the if statement is true (i.e. print “a equals five”). On the next line, without indenting (so that it is even with the if statement) is the else statement followed by a colon (i.e. else:). On the next line, tabbed over one space is what should occur if the statement is not true (i.e. print “a does not equal five”).
You can add other conditions between the if and the else statement. These are elif  statements. So in the above example you could add an elif statement  (i.e. elif a==6:). The elif statement should also be even with the if and else statements and the next line should be tabbed over one (like the if and else statements)  saying what occurs if the elif statement is true (i.e. print “a equals six”).

For Example

>>>if fav_food=='candy':
      print “I love candy!”
   elif fav_food=='fruit':
      print “I love fruit!”
   elif  fav_food=='vegetables':
      print “I’m a weirdo!”
   else:
      print “I live on sunshine and air!”

Remember one equals sign is used to assign a value to a variable, while two equals signs are used to denote equality

***Just learned this in the process of writing/editing this post (shout out to the awesome Cassie Ettinger, @cassetron on Twitter)
You can also write conditionals with multiple if statements
for example

>>>if x%2==0:
      print "x is even"
   if x%3==0:
      print "x is divisible by 3"
   if x%4==0:
      print "x is divisible by 4"
   else:
      print "x is not divisible by 4"

The second if statement will be evaluated whether or not the first if statement is true. If it was replaced with an elif statement it would only be evaluated if the first statement was false. Likewise the final else statement is only evaluated if the statement above it is false.
In the above example if x=18 the code would print x is even, and x is divisible by 3. In the example below (with elif statements), if x=18 the code would only print x is even. It would not evaluate the following statement. This distinction isn’t super important (I’ve successfully written multiple scripts without knowing it) so don’t worry if it is a little confusing.
Conditional with elif statements

>>>if x%2==0:
      print "x is even"
   elif x%3==0:
      print "x is divisible by 3"
   elif x%4==0:
      print "x is divisible by 4"
   else:
      print "x is not divisible by 4"

For more info on if/else statements you can check out the python documentation

http://anh.cs.luc.edu/python/hands-on/3.1/handsonHtml/ifstatements.html#if-else-statements

https://docs.python.org/2/tutorial/controlflow.html

or the wiki

http://en.wikibooks.org/wiki/Python_Programming/Conditional_Statements

or this other cool site

http://www.pythonforbeginners.com/basics/python-conditional-statements


Solution to the first bioinformatics problem!

The following problem was posted at the end of the last post (here)
Write a program to count the number of each base (ATCG) and the number of ambiguities(N) in the given nucleotide sequence

Here is my solution, remember there are lots of different ways to get to the correct answer in programming, if your syntax doesn’t match mine it isn’t necessarily wrong.

nt=’TGANGCCCAGCTTGCTGGGTGGATTAG….’
print ‘A ‘ + str(nt.count(‘A’))
print ‘T ‘ + str(nt.count(‘T’))
print ‘C ‘ + str(nt.count(‘C’))
print ‘G ‘ + str(nt.count(‘G’))
print ‘N ‘ + str(nt.count(‘N’))

This returns
A 291
T 258
C 281
G 416
N 24

To count the number of bases and ambiguities in the sequence, I transformed the given sequence into a string by putting it between quotes and setting it equal to nt.

“print nt.count(‘A’)” alone would have returned the number of ‘A’s in the sequence  but printing the ‘A ‘  documents which base I was counting. However the ‘A ‘ is a string while nt.count(‘A’) is an integer and they can’t be concatenated together so I used the str() command to change the integer to a string.

***Edit 7/15

David States (@statesdj) pointed out something I overlooked in my code. “Real” sequence data has multiple codes for different types of ambiguities (R, Y, S etc–for all of them check out the IUPAC nucleotide ambiguity code page ). To check for these you can compare the sum of the  base counts to the sequence length using the len() command.

New Bioinformatics problem!
Create a sequence that is the complement of the given sequence
Use the sequence in this file
nt_sequence_for_complementation
Good Luck!

Introduction to Python: Variables, Arithmetic and Lists

Python Terminology and Syntax

Variable-A variable is like a box storing a piece of data, giving a specific name to a value
A variable can be a string or a number
String-a string can contain letters, numbers and symbols and it MUST be within quotes (“a” or ‘a’)
A number can be an integer or a float
Integer-an integer is a number without a decimal point
Float-a float is a number with a decimal point
numbers  can be used in mathematical expressions and are not contained within quotes
To set a variable

String              >>> candy = “Twix”
Integer           >>>a=7
Float               >>>b=7.3

To change the value of a variable you can reassign it the same way you originally assigned it
>>>candy=”Snickers”
>>>a=3
>>>b=2.0

one equals sign (=) assigns a value to a variable
two equals signs (==) denotes equality
an exclamation point followed by a an equals sign (!=) means not equal

Python Arithmetic
In python you can perform mathematical operations on floats and numbers

Addition (+)
>>>2+2
Returns 4
>>>2.3+3.5
Returns 5.8

Subtraction(-)
>>>4-1
Returns 3
>>>3.3-2.2
Returns 1.1

Multiplication (*)
>>>7*3
Returns 21
>>>3.1*4.2
Returns 13.02

Division (/) is special,
An integer divided by an integer returns the quotient as an integer (without the decimals place) rounded down
>>>7/3
Returns 2
>>>15/4
Returns 3
A Float divided by a float returns the a float (with the decimals place)
>>>8.6/4.4
Returns 1.9545454545454544
>>>3.3/3.3
Returns 1.0
A float divided by an integer (or an integer divided by a float) returns a float (with the decimals place)
>>>3.3/2
Returns 1.65
>>>2/3.3
Returns 0.6060606060606061
To perform division with a float and return a whole number (still a float-ending in .0, also rounded down) you can use (//)
>>>3.3//3
Returns 1.0

Exponentials (**)
>>>6**3
Returns 4096
>>>6**2.0
Returns 36.0

The Modulus (%) Performs division and returns the remainder
>>>36%7
Returns 1
>>>100%80
Returns 20
>>36%8.6
Returns 1.6
 
For more information on using python’s mathematical functions check out the python tutorial https://docs.python.org/2/tutorial/introduction.html#using-python-as-a-calculator
 
List-A list is a variable that can hold multiple pieces of data at one time
A list can hold integers, floats and strings
You can even make lists of lists!

To create a list
>>>list_name=[‘string1’, ‘string2’, ‘string3’]
>>>Test_scores=[96, 83, 75]
>>>Celebs =[‘Beyoncé’, ‘Jay-Z, ‘Kanye’]

To access a particular list item use its index-a number indicating its place in the list
>>>print Celebs[2]

2012-12-05 10.59.49
Kanye

2012-12-06 14.12.40
Kanye on an elephant

2012-12-05 16.38.46

I bet you expected it to be Jay-Z, but in python (and I think most computer languages) the index starts at 0 (this is called 0-based numbering) so to print Jay-Z you would need to write >>>Celebs[1]

To add an item to a list
>>>list_name.append(‘new_item’)
Now
list_name=[‘string1’, ‘string2’, ‘string3’, 'new_item']
The new item is appended to the end of the list
>>>Celebs.append(‘Kim’)
Now
Celebs=[‘Beyoncé’, ‘Jay-Z, ‘Kanye’, ‘Kim’]

To remove a list item
>>>list_name.remove(‘new_item’)
Now
list_name=[‘string1’, ‘string2’, ‘string3’]
If ‘item’ occurred multiple times in the list, only the first occurrence would be removed
>>>Celebs.remove(‘Beyoncé’)
Now
Celebs=[‘Jay-Z, ‘Kanye’, ‘Kim’]

To change or replace a list item
>>>list_name[1]=’new_item’
Now
list_name=[‘string1’, ‘new_item’, ‘string3’]
This will replace the second item in the list with new_item
>>>Celebs[0]=’Kris’
Now
Celebs=[‘Kris, ‘Kanye’, ‘Kim’]

If you only want part of the list you can use list slicing
>>>list_name[a:b]
This will return items from a up to but not including b
>>>Celebs[0:2]
Would return ‘Kris’ and ‘Kanye’ (but not ‘Kim’)
If the first index is unspecified python will assume the slice begins at the beginning of the list
>>>Celebs[0:2]
will return the same thing as
>>>Celebs[:2]

If the second index is unspecified python will assume the slice ends at the end of the list
You can also include a third index
>>>list_name[a:b:c]
This tells python to include list items from a to b going by c (so if c was 2 it would include every other item)
For example if you had a list of numbers
>>>Numbers=[1,3,4,7,2,9,3,8,3,9,5,6,4,8,8]
and you wanted to return every other number from the 2nd through 8th positions
>>>Numbers[1:9:2]
Would return
[3,7,9,8]

You can also use a negative index to go backwards through the list
>>>Numbers[::-1]
Would return
[8, 8, 4, 6, 5, 9, 3, 8, 3, 9, 2, 7, 4, 3, 1]

String slicing works exactly the same way as list slicing (just replace the list_name with the string_name)

To count the number of times a particular item occurs in a list you can use the list_name.count() command
>>>Numbers.count(4)
Would return
2
This command also works for strings
>>>string_name.count(‘a’)
would return the number of times ‘a’ occurs in the string

For more information on lists check out the python tutorial https://docs.python.org/2/tutorial/datastructures.html

Sometimes the formatting on wordpress gets screwed up with different window sizes so I am also including screen shots (at the bottom of this post) of these commands and their results in the python shell.

Because I am using this blog as a repository for my programming notes as well as an educational/community building tool, I have decided to provide a more extensive documentation of python syntax then I had originally planed. However because I want to keep the focus of this blog on bioinformatics I am going to try to include a bioinformatics problem that is solvable with the syntax that has already been introduced. I will then post my solution to the problem at the top of the next blog post.

The first bioinformatics problem!
Write a program to count the number of each base (ATCG) and the number of ambiguities(N) in the given nucleotide sequence
nt_sequence
Don’t forget python is case sensitive
Feel free to post your solution, or any questions in the comments section.

Good Luck!

Screen Shot 2014-06-19 at 2.10.53 PM

Screen Shot 2014-06-19 at 2.11.21 PM

Screen Shot 2014-06-19 at 2.11.42 PM

Introduction to Python Part I

Apologies for the length of time between posts, I have been trying to determine the best way to ensure the focus of this blog stays on bioinformatics and doesn’t get side tracked by programing. On that note, this post will be the first in a two part introduction to Python. I will then begin introducing small bioinformatic programs I have written along with explanations of what the program does, other ways of solving the same problem, and why I chose that solution structure.  I will also include notes on key concepts or new syntax being introduced as well as anything I learned in the process of writing the program.

What is Python?

Python is a high level programming language* known for its clear syntax and readability

*a high level programming language is a language that is strongly abstracted from the computer. It employs natural language elements that make program development simpler, and more understandable (as opposed to machine language which can present as binary 010101).

Why Python?

Python, Perl and R are the three main languages that the bioinformaticians I have interacted with use. I began with Python because it has a reputation for being easier* to learn then other languages due to its clearer (in comparison with other languages) syntax. R is mainly for statistical data analysis (and I will be learning the basics at a workshop next week!). Eventually I also hope to learn Perl.

*Python may be easier but it is NOT easy! Sometimes people with lots of programming experience describe it as easy (because they have totally forgotten what it is like to be a beginner) which can be really discouraging to actual beginners because even if it is easier to learn than other programming languages, learning how to think in a programming language can be frustrating and difficult. Nothing is more discouraging then being told you are struggling with something “easy”.

 

What you need

You will need to make sure Python is installed on your computer as well as IDLE. IDLE is Python’s Integrated Development Environment. It works as a source code editor and a python interpreter graphical user interface (GUI). It’s basically a text editor with a few special features that help with writing and executing code.

If you are working on a Mac you most likely have an older version of Python installed by default, but not IDLE. Go to the Python download page and download the 2.7.7 version appropriate for your platform and operating system. I recommend downloading Python version 2.7.7 because that is what am using and will be explaining syntax specific to run the installer (by clicking on it in your downloads folder) and go through the prompts (the default options should be fine)

Now that you know all about the command line Open the terminal and check if you have Python (by typing “python” and hitting enter) If you have python it will return the version of python you have. 

Next check to see if you have IDLE (by typing “IDLE”)

If you have IDLE a shell will open listing your python version, followed by

“>>>”

sometimes the first time you try to open IDLE after installing Python you will get an error message, but a spaceship looking object will pop on your dock (if you hover over it it’ll say Python) just double click on the spaceship and the Python shell should appear.

>>> is the equivalent of $ in the command line, it lets you know IDLE is ready to do your bidding!

However we are using IDLE because we can write programs longer then one line in it (unlike in the command line)

In the shell open a new file (file > new file) and type the following

D=’Green eggs’

E=’and Ham’

print D + ‘ ‘ + E

Run the program by going to Run > Run Module

You will need to save the program in order to run it. I recommend creating a new folder (Python_programs) to contain all of the programs you are about to create!

If any of this was confusing feel free to check out my video

Congratulations you have just run your first program!

You may have noticed that as you were typing some words were a different color. This is one of the advantages of using IDLE over a text editor. The program recognizes various aspects of Python syntax and colors the words accordingly. This can be super helpful when you are writing code.

These are some additional resources that may be helpful:

http://anh.cs.luc.edu/python/hands-on/3.1/handsonHtml/execution.html

http://www.artofproblemsolving.com/Wiki/index.php/Getting_Started_With_Python_Programming

http://rosalind.info/problems/ini1/

https://software.rc.fas.harvard.edu/training/scraping/install/

In my next post I will explain some basic Python commands and how to think in programming languages.

Using the Command Line

I accidentally closed the tab containing the NEARLY FINISHED original post (and it did not magically save itself) so fair warning this post will be shorter, crankier, and less clever then it was originally intended.

The terminal/command line allows the user to interact with their computer without using a GUI (graphical user interface) or the mouse. Experience and comfort working in the command line is essential to bioinformatic analysis. Using it for the first time can seem intimidating, but it is fairly simple once you get started. The command line uses an operating system called unix. In the following post I will show you how to access the command line and introduce few simple commands.

A few quick definitions

command line – the command line is the place where you type commands in a terminal window. It looks something like this: (colors of text and background, may vary)

Screen Shot 2014-05-30 at 2.41.32 PM

script – a short computer program. Usually computer programs are called scripts when they perform a simple function or a small number of simple functions. Scripts are often strung together into a larger program called a pipeline. Scripts are typically only run from the command line.

directory – for practical purposes, this is just another name for a folder (the same type you would encounter on your desktop or in your finder (if you’re working on Mac OS).

** These instructions are compatible with a computer running Mac OS or linux (although if you are running linux you probably already know how to access/use the command line). Future posts will mainly cover python syntax/coding either online or in a python shell so if you are a windows user you should keep following the blog!

I did find these tutorials/resources using the command line with a computer running windows

http://www.cs.princeton.edu/courses/archive/spr05/cos126/cmd-prompt.html

http://www.bleepingcomputer.com/tutorials/windows-command-prompt-introduction/

http://www.computerhope.com/issues/chusedos.htm

Locating the Terminal

I have included both videos and text on locating and using the terminal because while I personally hate learning things from videos, they can be helpful in transmitting information. I recommend at least watching the video on locating the terminal ( I tried to make these videos as nonirritating as possible–they may seem a little fast but when I slowed them down they became painful to watch).

To open the terminal

Go to Go>Applications>Terminal

Using the Terminal

In the terminal you move between directories (folders) using the command “cd” (change directory) followed by the path to the directory you want to go be in.

When you type in the terminal it will appear after the $. The word prior to the $ is your current directory. Everything after the $ is the code you are executing.

IMPORTANT SIDE NOTE

In unix (and many/most other coding languages) spaces are significant. This is because every unix command takes the basic form:

command[space]input

For example:

“cd directory_name”

Since spaces are used to separate important parts of the command, file names with spaces can be problematic. For this reason most programmers replace spaces with underscores (_) in file and directory names to avoid screwing up their scripts (i.e. file_name)


 

dhcp16-gc1:~ Madison$ cd Desktop/Workflow/Paper

In the above example I am currently in the Madison home directory. The ~ indicates it is the home directory. To change directories I use the cd command followed by the path I want to take. I move to the paper directory via the Desktop followed by the Workflow directories. I could move through the same path in three separate steps

dhcp16-gc1:~ Madison$ cd Desktop/

dhcp16-gc1:Desktop Madison$ cd Workflow/

dhcp16-gc1:Workflow Madison$ cd Paper/

To move up a level you can use the “cd” command followed by “..”

dhcp16-gc1:Paper Madison$ cd ..

The above command moves from the Paper directory to the Workflow directory above it

dhcp16-gc1:Workflow Madison$ cd ..

The above command moves from the Workflow directory to the Desktop directory above it

To return directly to the home directory you can use the “cd ~” command

$ cd ~

If you aren’t sure where you are, or what directories are available to you type the command “pwd” which will display the path of the current directory

$ pwd

To see all of the files and directories in your current directory use the “ls” command

$ ls

To specify a file name you need to either be in the directory containing that file, or you need to specify the path to the file

If I wanted to specify a file in the Paper directory (from the above examples) the syntax would be

$ ~/Desktop/Workflow/Paper/file_name

PRO TIP

Tab complete is the best! When typing the name of a file or directory after you have typed the first few letters hit tab and the computer will fill in the rest. If multiple files or directories begin with the same letters, tab complete will fill in the letters to the point at which they diverge. If you then double tap the tab button it will list all of the directories that begin with those characters.

$ less file_name

The “less” command will display the file (listed after the command) in the terminal window, “cat” and “more” perform a similar function

$ cp file1 file2

The “cp” command will make a copy of file1 named file2, if you want file2 to be located in a different directory you must specify the path before the new file name (i.e. cp file1 ~/Documents/Blog/file2)

$ mv file1 file2

The “mv” command is similar to cp, but instead of making a second copy of the file, it renames file1 (and moves if you specify a path) to file2

$ head file_name

The “head” command displays the first ten lines of the file

$ tail file_name

The “tail” command displays the last ten lines of the file

$ grep ‘keyword’ file_name

The “grep” searches a file for all instances of the keyword and displays them on the terminal screen (the keyword MUST be in single quotes i.e. ‘keyword’)

$ grep -c ‘keyword’ file_name

The “grep -c” command counts the number of times the keyword occurs in the file and displays that number on the terminal screen (again the keyword MUST be in single quotes i.e. ‘keyword’)

$ whatis command

The “whatis” command followed by a command (i.e. whatis ls) will return a brief description of the command

Here is a cheat sheet of Unix commands and their meanings

Screen Shot 2014-05-30 at 2.11.42 PM

Screen Shot 2014-05-30 at 2.12.17 PM

This tutorial walks you through different unix commands

http://www.ee.surrey.ac.uk/Teaching/Unix/

The Turner lab at University of Virginia also has a great unix/command line tutorial pdf on their blog Getting Genetics Done

Ian Korf at the UC Davis Genome Center also has an excellent unix and perl primer here

This also looked like a useful resource

http://lifehacker.com/5633909/who-needs-a-mouse-learn-to-use-the-command-line-for-almost-anything

Thanks to the awesome Hannah Holland-Moritz for her help writing and editing this post!

Hello World

Welcome to Bioinformatics for beginners!

As a beginner myself I am creating this blog as a place to organize what I have learned (and am learning) about coding and as a resource for other novice coders. If there are any terms/jargon/concepts you dont understand feel free to contact me either by leaving a comment below or via twitter (@MDunitz)

Although I am sometimes shocked by how far I have come in such a short time, I am doing my best to direct this blog towards myself a year ago, a girl who didn’t know an operating system from a…well basically I didn’t know what an operating system was. In this post I am going to summarize what I have learned so far and how/where I learned it. If there are terms you don’t understand don’t worry! I promise I will explain them in future posts.

A little about me

I graduated with a degree in microbiology and political science from UC Davis in December of 2013. I have always been vaguely interested in coding/knowing more about computers but sort of in the same way I was interested in learning French–it would be useful and cool for someday.  In January I began working full time in the Eisen lab at the UC Davis Genome Center.  I work on a variety of fascinating (at least to me) projects using bioinformatics to study microbes in the built environment. For more information check out:

http://microbe.net/

http://phylogenomics.blogspot.com/

Currently I am working on a workflow/methods paper From Swab to Publication: a comprehensive workflow for microbial genome sequencing. The goal of this paper is to make sequencing and de novo assembly of genomes, as well as basic bioinformatics more accessible to undergraduates and smaller labs.

I am lucky enough to work with an amazing group of scientists. My coworkers range in “computer literacy” from novice bioinformaticians like myself to PhDs designing amazing bioinformatic tools/pipelines for the scientific community (such as A5 and phylosift) and all of them have been happy to assist me in my introduction to bioinformatics.

My experience with coding prior to this past winter was limited to the occasional analysis in STATA (a data analysis/statistics program) for political science classes. I began learning the command line in order to utilize QIIME (a tool for comparing and analyzing microbial communities). I also began learning python from Codeacademy (which I highly recommend-although be warned it is insanely addictive, you may find yourself staying up until three or four in the morning for “just one more level”). I then did python village on Rosalind and I am currently working on the bioinformatics stronghold.

I am not entirely sure how I will organize this blog, but I am hoping to explain the basics of the command line, and then review the python I learned in code academy and Rosalind and potentially check out the python tutorial.

Good Luck!