Block 4 – Exercise solution

1. create a database schema for loading TAIR kinase genes from the file ‘TAIR10_kinases.txt’ (generated in the linux2 exercise) and for the probes and their matching TAIR identifiers from the file ‘A-AFFY-2.adf.txt’

First you will need to prepare your data files for loading into the database

  • Remove the version number from the TAIR ids in the kinases file, and print only the unique rows into a new file.

$ sed -r ‘s/\.[1-9]\t/\t/’ TAIR10_kinases.txt | sort -u > uniq_kinases.txt

  • Load only columns 3 and 4 from the AFFY file starting from the rows which contain the arabidopsis gene identifiers. Replace the lower case ‘t’ and ‘g’ to upper case , to make it match the gene names in the AFFY file (use ‘grep’ for printing rows with TAIR identifiers, ‘sed’ for replacing lower case with upper, and ‘cut’ for printing only columns # 3 and 4) . print the output to a new file.

$ grep ‘At[0-9]g’ A-AFFY-2.adf.txt | cut -f3,4 | sed -r ‘s/At/AT/’ | sed -r ‘s/g/G/’ > affy_probes.txt

  • Log into the psql terminal.

$bioinfo@ubuntu:~/bioinfo_course/Data$ psql -d my_database

  • create one table for the TAIR gene identifiers and their function from the unique kinases file.

CREATE TABLE gene (
gene_id serial PRIMARY KEY,
gene_name character varying (32) NOT NULL,
gene_function text
);

  • create a second table for TAIR gene identifiers and their matching probe names from the parsed AFFY file

CREATE TABLE gene_probe (
gene_probe_id serial PRIMARY KEY,
gene_name character varying (32) NOT NULL,
probe_name character varying (32) NOT NULL
);

2. Copy the data from the files into your database

  • use \copy command to populate each table from the relevant data file

my_database=> BEGIN;
my_database=> \copy gene (gene_name, gene_function) from ‘uniq_kinases.txt’

—check input:

SELECT * FROM gene;

—if this looks good do a commit
my_database=> COMMIT;

my_database=> BEGIN;
my_database=> \copy gene_probe (probe_name, gene_name) from ‘affy_probes.txt’

—check input:

SELECT * FROM gene_probe;

—if this looks good do a commit

my_database=> COMMIT;

3. Query your database

  • How many rows do you have in the kinase genes table?

my_database=> SELECT count(*) FROM gene ;

—Output should be:

count
——-
1351
(1 row)

  • How many rows do you have in the gene probes table?

my_database=> select count(*) from gene_probe;
count
——-
22591
(1 row)

  • How many rows in the kinase gene table have a gene-identifier match in the probes table?

my_database=> select count(*) from gene join gene_probe using (gene_name) ;
count
——-
1187
(1 row)

  • How many rows in the probes table do not have a match in the kinase gene table?

my_database=> SELECT COUNT(*) FROM gene_probe LEFT JOIN gene USING (gene_name) WHERE gene_id IS NULL;
count
——-
21406
(1 row)



Optional – advanced

1. Add new tables to your schema for the following data

  • The 12 experiment names and their description from the E-GEOD-30093.sdrf.txt file
  • The 12 expression data files, containing probe names, expression value, p-score.
  • Prepare the experiment names, and the expression data files for loading. Since the files have a header row, we can remove it, or change the delimiter to comma instead of tab (the SQL COPY command can ignore a header row in CSV format (comma delimited)).

$ cut -f1,2,3 E-GEOD-30093.sdrf.txt | sed -r ‘s/ 1\t/\t/’ | sed -r ‘s/\t/,/’ > experiment.csv

  • Prepare the expression data files for loading by adding the experiment name in the first column (do this for files GSM744714 – GSM744725)

$ awk ‘{ print “GSM744725\t”$0}’ GSM744725_sample_table.txt | sed -r ‘s/\t/,/g’> gsm744725.csv

$ awk ‘{ print “GSM744725\t”$0}’ GSM744724_sample_table.txt | sed -r ‘s/\t/,/g’> gsm744724.csv

–create the tables:
BEGIN:

CREATE TABLE experiment (
experiment_id serial PRIMARY KEY,
experiment_name character varying (64) NOT NULL,
sample_description text,
sample_source character varying (255)
);

CREATE TABLE expression (
expression_id serial PRIMARY KEY,
experiment_name character varying (64) NOT NULL,
probe_name character varying (64) NOT NULL,
expression_value numeric NOT NULL,
call_type character varying (8),
p_value real NOT NULL
);

COMMIT;


2. After you created the necessary tables, load the data from the files using \copy command.

BEGIN ; \copy experiment (experiment_name, sample_description, sample_source) FROM ‘experiment.csv’ CSV HEADER
COMMIT:

BEGIN;
\copy expression (experiment_name, probe_name, expression_value, call_type, p_value) FROM ‘gsm744725.csv’ CSV HEADER
COMMIT;

BEGIN;
\copy expression (experiment_name, probe_name, expression_value, call_type, p_value) FROM ‘gsm744724.csv’ CSV HEADER
COMMIT;

  • See if you can add to the expression data table a foreign key pointing to the experiment table

BEGIN;

—add the new column
ALTER TABLE expression ADD COLUMN experiment_id integer;

—populate the column with the matching experiment_ids
UPDATE expression SET experiment_id = experiment.experiment_id FROM experiment WHERE experiment.experiment_name = expression.experiment_name;

—Add a foreign key constraint to the expression.experiment_id column
ALTER TABLE expression ADD CONSTRAINT experiment_id_fk FOREIGN KEY (experiment_id) REFERENCES experiment (experiment_id);

–drop the expression.experiment_name column. It is not necessary now that we have the foreign key set

ALTER TABLE expression DROP COLUMN experiment_name;
COMMIT;

3. Query the 2 new tables to see how many data points exist for each experiment.

SELECT experiment_name, count(expression_id) FROM experiment JOIN expression USING (experiment_id) GROUP BY experiment_name;

4. Can you write a query to find matches between expression data and matching TAIR gene identifiers?

—This example query selects gene_name, average expression_value , maximum p_value, and the gene function from the expression, gene, and gene_probe tables. Output is ordered by gene_name. The output prints only gene names that have gene_function in the gene table (use LEFT JOIN to print non-matching gene names too).

SELECT gene_probe.gene_name, avg(expression_value), max(p_value) , gene_function
FROM gene_probe
JOIN expression USING (probe_name)
JOIN gene USING (gene_name)
GROUP BY gene_name, gene_function
ORDER BY gene_name ;

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: