Block 4: Exercise

Databases and SQL exercise

  • Before you start, make sure you have all the prerequisites listed in this class page 

Exercise has 3 options:

  1. Use provided data (see details bellow)
  2. Design your own database schema with your own data
  3. Load the data provided, or your own data, in a GUI program (MS Access , Open Office Base)

Option #1
Load data provided at the course into a postgres database

Remember the database you created before the class with the CREATE DATABASE command?

You will need to connect to that database ($psql -d my_database)

1. create a database schema for loading TAIR kinase genes from the file ‘TAIR10_kinases.txt’ (generated in the linux2 exercise) and for the probes and their matching TAIR identifiers from the file ‘A-AFFY-2.adf.txt’

First you will need to prepare your data files for loading into the database

    • Remove the version number from the TAIR ids in the kinases file, and print only the unique rows into a new file.
    • Load only columns 3 and 4 from the AFFY file starting from the rows which contain the arabidopsis gene identifiers. Replace the lower case ‘t’ and ‘g’ to upper case , to make it match the gene names in the AFFY file (use ‘grep’ for printing rows with TAIR identifiers, ‘sed’ for replacing lower case with upper, and ‘cut’ for printing only columns # 3 and 4) . print the output to a new file.
  • Log into the psql terminal
  • create one table for the TAIR gene identifiers and their function from the unique kinases file.
  • create a second table for TAIR gene identifiers and their matching probe names from the parsed AFFY file

2. Copy the data from the files into your database

  • use \copy command to populate each table from the relevant data file

3. Query your database

  • How many rows do you have in the kinase genes table?
  • How many rows do you have in the gene probes table?
  • How many rows in the kinase gene table have a gene-identifier match in the probes table?
  • How many rows in the probes table do not have a match in the kinase gene  table?
Optional – advanced
  1. Add new tables to your schema for the following data
    • The 12 experiment names and their description from the E-GEOD-30093.sdrf.txt file
    • The 12 expression data files, containing probe names, expression value, p-score.
    • Prepare the experiment names, and the expression data files  for loading. Since the files have a header row, we can remove it, or change the delimiter to comma instead of tab (the SQL COPY command can ignore a header row in CSV format (comma delimited)).
    • Prepare the expression data files for loading by adding the experiment name in the first column (do this for files GSM744714 – GSM744725)
  2.  After you created the necessary tables and data files, load the data from the files using \copy command.
    • See if you can add to the expression data table a foreign key pointing to the experiment table
  3.  Query the 2 new tables  to see how many data points exist for each experiment.
  4. Can you write a query to find matches between expression data and matching TAIR gene identifiers?

Option #2
Use your own data
    • Design a database schema to accomodate your own data. Format your data files for loading into the database.
    • Query your database to test your data (you can use the questions from option #1 as a reference)

Option #3
Use a GUI program (e.g. MS Access or Open Office Base)
  • Design a schema as described in options #1 or #2.
  • Load data into your database schema.
  • Do some queries to check the data.

Block 4: Exercise solution

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: