Useful functions and commands

Thursday, July 23, 2015

Notes on decrypting dbGaP datasets


1. Install aspera
    download from http://downloads.asperasoft.com/downloads (select aspera connect and then show all installers)
2. Download using aspera
    working exampole download gtex to tarbell 11/25/214 (this didn’t work, cri-syncmon.cri.uchicago.edu may work)
"/home/him1/.aspera/connect/bin/ascp" -QTr -l 300M -k 1 -i "/home/him1/.aspera/connect/etc/asperaweb_id_dsa.openssh" -W AB6D9CB19B62747EC89082FF7954E5F901C737E21A8D44CC3B2D4D2E94D0AA1E7C79C059D9E704B36AF81B523E9EA81C4D dbtest@gap-upload.ncbi.nlm.nih.gov:data/instant/haeky2/41316 .

3. download manifest

4. get repository key

5. Decrypt (modified on 4/15/2015)
    - download decryption tools from dbGaP
    - run ./vdb-config -i from the sratoolkit bin directory
    - press 4 and import repository key (prj_6981.ngc, this will create a working folder  /Users/haky/ncbi/dbGaP-6981/)
    - from the working directory created by the ugly vdb-config run vbd-decrypt
  > cd  /Users/haky/ncbi/dbGaP-6981/
  >  ~/bin/sratoolkit.2.4.5-2-mac64/bin/vdb-decrypt /Volumes/Pegasus/im-lab/nas40t2/haky/Data/dbGaP/GTEx/43830 

working examples:
5052 is the number of the project that requested the data access (5052=Phenotype Prediction IRB)
43831 is the download request id







>cd /Users/haky/ncbi/dbGaP-5052/
>~/bin/sratoolkit.2.4.5-2-mac64/bin/vdb-decrypt /Volumes/Pegasus/im-lab/nas40t2/Data/dbGaP/Framingham/44795/










I should clean it up but the instructions on decrypting were helpful.

Saturday, October 4, 2014


Print-to-PDF mysteries revealed

Wednesday, August 27, 2014

my linux cheat sheet


in random order

chown -R haky:imlab folder_name/

Sunday, August 24, 2014

How to get orthologous genes between two species


http://www.ensembl.info/blog/2009/01/21/how-to-get-all-the-orthologous-genes-between-two-species/

Tuesday, August 19, 2014

Datamash: simple stat functions directly to text file


Datamash examples
http://www.gnu.org/software/datamash/examples/

Friday, July 11, 2014

Adding letter head to latex file


\documentclass{article}
\usepackage{wallpaper}
\ULCornerWallPaper{1}{letterheadfile.pdf}
\begin{document}
  Your text on the letterhead
\end{document}

via http://tex.stackexchange.com/questions/837/pdf-letterhead-as-document-background

Thursday, July 3, 2014

Imputation QC and server



http://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook

https://imputationserver.sph.umich.edu/start.html#!pages/help

Tuesday, July 1, 2014

R script to query UCSC database - RMySQL



## this will let you install R packages in a directory where you have access
.libPaths('~/Rlibs/') ## create directory ~/Rlibs if it doesnt exist
install.packages(RMySQL)

## call the RMySQL library
library(RMySQL) 

## connect to UCSC server
channel = dbConnect(MySQL(), user="genome", host="genome-mysql.cse.ucsc.edu")

## for a list of tables check url below
## http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/

## convenience function to run SQL queries
query <- function(...) dbGetQuery(channel, ...)

## use hg19 database in UCSC to use
query("USE hg19")

## see table content
query("DESCRIBE snpArrayAffy5")

## get number of rows from table
print( query("select count(*) from snpArrayAffy6") )

## get number of rows from table
tempo6 = query("select * from snpArrayAffy6")

Thursday, June 26, 2014

Analysis of PubMed search results using R


http://freshbiostats.wordpress.com/2013/12/03/analysis-of-pubmed-search-results-using-r/

Mining genomic databases with R


http://freshbiostats.wordpress.com/2013/02/23/mining-genomic-databases-with-r/

Tuesday, June 24, 2014

Statistical Genetics/Genomics Intro Material

http://genomicsclass.github.io/book/ (On github by Rafael Irizarry and Michael Love)

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002822

http://www.ploscollections.org/article/browseIssue.action?issue=info:doi/10.1371/issue.pcol.v03.i11

http://www.nature.com/nrg/series/gwas/index.html


Data analysis for genomics

https://www.edx.org/course/harvardx/harvardx-ph525x-data-analysis-genomics-1401#.U6m5uxappuY


Python
https://www.coursera.org/course/pythonlearn


http://www.mooc-list.com/course/6001x-introduction-computer-science-and-programming-using-python-edx


Linux Basic Commands 
https://wiki.uchicago.edu/display/CRIwksp/Introduction+to+Linux+Command+Line+for+Bioinformatics

R intro UCLA
http://www.ats.ucla.edu/stat/r/seminars/intro.htm

R Basic Vocabulary
http://adv-r.had.co.nz/Vocabulary.html

Coursera Data Specialization Courses
http://gettinggeneticsdone.blogspot.com/2014/01/coursera-specializations-data-science.html

Friday, June 20, 2014

Inverse normalization function


my.invnorm = function(x)
{
 res = rank(x)
 res = qnorm(res/(length(res)+0.5))
 return(res)
}

Thursday, May 15, 2014

drop repeated vars keeping first row for given var



## function
## drop repeated vars but keeping first row for given var
droprep = function(df,repvar,sortvar)
{
    tab = table(df[,repvar])
    nametab = names(tab)
    uniquelist = nametab[tab==1]
    replist =  nametab[tab>1]
    ind = df[,repvar] %in% uniquelist
    res = df[ind,]
    tempo = df[!ind,]
    tempo = tempo[order(tempo[,sortvar]),]
    for(vv in replist)
    {
        rowa = tempo[tempo[,repvar]==vv,][1,]
        res = rbind(res,rowa)
    }
    return(res)
}

Monday, March 24, 2014

How to create MySQL database with Entrez Gene Database Description and Files


http://jura.wi.mit.edu/entrez_gene/

"Entrez Gene is NCBI's repository for gene-specific information. Access to this information either through the Entrez Gene website or by flat files via NCBI's ftp site can be time consuming and limiting in regards to the number of and what questions you can ask about the data. A better solution for intense data mining is to create a relational database. 
We offer our MySQL based database and data loading script as an easy-to-implement solution to this problem. While the ER diagram describes the database we created, we also offer the SQL syntax for both the tables and indexes. The data loading script, which is written in Perl, will automatically download Entrez Gene data files, parse the data and load it into the MySQL database."

Sunday, March 23, 2014

How to handle gzip files without decompressing


gzip -cd file.gz | head

Thursday, March 20, 2014

How to start interactive session on tarbell


qsub -I

Tuesday, March 18, 2014

How to install virtualenv


http://iamzed.com/2009/05/07/a-primer-on-virtualenv/

Thursday, March 6, 2014

How to start shiny application


Fire up R/RStudio, make sure you have the shiny package installed, and type

   library(shiny)
   runApp('~/vcs/rcg/haky')

where you replace '~/vcs/rcg/haky' with the path to the directory containing the ui.R, server.R and screen.tiff files, and the data subdirectory.  It should fire right up; you can then stop it by sending R an interrupt.

Monday, February 24, 2014

How to post R packages on cran


Here's the procedure for reference: upload the .tar.gz file, using ‘anonymous’ as log-in name and your e-mail address as password, to ftp://CRAN.R-project.org/incoming/ (note: use ‘ftp’ and not ‘sftp’ to connect to this server, and passive ‘ftp’ is more often successful) and send a message to CRAN@R-project.org about it (with your package name and version in the subject line of the form “CRAN submission package version”, and please do not submit a package by email). For a new submission, please note in the message that you have read and agreed to the CRAN policies.

Wednesday, February 19, 2014

How to access dbgap authorized files

As of 2/19/2014, I could log in directly through https://dbgap.ncbi.nlm.nih.gov/ using the Era Commons ID

Before I had to do the following steps

1. Log into Era Commons

https://public.era.nih.gov/(you need an Era Commons ID)

2. Goto https://dbgap.ncbi.nlm.nih.gov/

3. Click log in

You should be in then.

Tuesday, July 24, 2012

ftp mget interactive mode off


> prompt
> mget * .

Friday, April 20, 2012

python script to convert plink bed/bim to qtdt format


import os

sdir = '/hapmap/'
basebedname = 'hapmap3'
ddir = '/erlotinib/'
temponame = 'tempo'
phenoname = 'erlotinib'

os.system('plink --bfile ' + sdir + basebedname + ' --recode12 --out ' + ddir + temponame )

## join tempo.ped with phenodata.txt 
## and generate pheno.ped file

shortfilename = ddir +  "phenodata.txt" ## FID,IID,dad,mom,sex,pheno1,...
longfilename = ddir +  temponame + ".ped" ## FID,IID,dad,mom,sex,pheno,snp1,...
outfilename = ddir + phenoname + ".ped"

index = {}
shortfile= open( shortfilename, "r" )
for row in shortfile:
    part = row.split(None,2)
    iid = part[1] ## IID is the second column in phenofile
    index[iid] = row.rstrip()
    ## rstrip gets rid of \n and other white spaces

shortfile.close()

outfile = open( outfilename, "w")
longfile= open( longfilename, "r" )

for row in longfile:
    part = row.split(None,6)
    iid = part[1]
    restofline = part[6]
    if iid in index:
        pheno = index[iid]
        outfile.write( pheno + ' ' + restofline)

longfile.close()
outfile.close()

## generate pheno.dat file

infilename = ddir + temponame + '.map'
outfilename = ddir + phenoname + '.dat'

infile = open(infilename,'r')
outfile = open(outfilename,'w')


## write covariates and phenotype names
outfile.write('C igrowth\n')
outfile.write('T logAUC\n')

## write one row per snp
for linea in infile:
 part = linea.split(None,2)
 newlinea = 'M ' + part[1] + '\n'
 outfile.write(newlinea)

infile.close()
outfile.close()

Friday, April 13, 2012

create new theme for ggplot

from: http://sape.inf.usi.ch/quick-reference/ggplot2/themes

theme_complete_bw <- function(base_size = 12) {
  structure(list(
    axis.line =         theme_blank(),
    axis.text.x =       theme_text(size = base_size * 0.8 , lineheight = 0.9, colour = "black", vjust = 1),
    axis.text.y =       theme_text(size = base_size * 0.8, lineheight = 0.9, colour = "black", hjust = 1),
    axis.ticks =        theme_segment(colour = "black"),
    axis.title.x =      theme_text(size = base_size, vjust = 0.5),
    axis.title.y =      theme_text(size = base_size, angle = 90, vjust = 0.5),
    axis.ticks.length = unit(0.15, "cm"),
    axis.ticks.margin = unit(0.1, "cm"),
 
    legend.background = theme_rect(colour=NA), 
    legend.key =        theme_rect(fill = NA, colour = "black", size = 0.25),
    legend.key.size =   unit(1.2, "lines"),
    legend.text =       theme_text(size = base_size * 0.8),
    legend.title =      theme_text(size = base_size * 0.8, face = "bold", hjust = 0),
    legend.position =   "right",
 
    panel.background =  theme_rect(fill = NA, colour = "black", size = 0.25), 
    panel.border =      theme_blank(),
    panel.grid.major =  theme_line(colour = "black", size = 0.05),
    panel.grid.minor =  theme_line(colour = "black", size = 0.05),
    panel.margin =      unit(0.25, "lines"),
 
    strip.background =  theme_rect(fill = NA, colour = NA), 
    strip.text.x =      theme_text(colour = "black", size = base_size * 0.8),
    strip.text.y =      theme_text(colour = "black", size = base_size * 0.8, angle = -90),
 
    plot.background =   theme_rect(colour = NA, fill = "white"),
    plot.title =        theme_text(size = base_size * 1.2),
    plot.margin =       unit(c(1, 1, 0.5, 0.5), "lines")
  ), class = "options")
}

Thursday, April 12, 2012

# cdf: cd's to frontmost window of Finder (by Yarko)

copy the following in ~/.bash_profile

# cdf: cd's to frontmost window of Finder
cdf ()
{
    currFolderPath=$( /usr/bin/osascript <<"    EOT"
        tell application "Finder"
            try
                set currFolder to (folder of the front window as alias)
            on error
                set currFolder to (path to desktop folder as alias)
            end try
            POSIX path of currFolder
        end tell
    EOT
    )
    # ignore the cd if it's my desktop
    if [ "$currFolderPath" != "$HOME/Desktop/" ]
    then
        echo "cd to \"$currFolderPath\""
        cd "$currFolderPath"
    else
        echo "no finder path to cd to..."
    fi
}

Friday, February 3, 2012

how to get linux command line behave like emacs

set -o emacs

Thursday, February 2, 2012

python script extract variants from large file based on smaller file

#small script to extract variants from a large file based on a smaller file (by Vasa)

largeChromosomes = {}
largePositions = {}
largeFile = open('ref1k.rs.snps.txt','r')
for line in largeFile:
       line = line.split('\t')
       chrm = line [0]
       rs = line[1]
       pos = line[2]
       largeChromosomes[rs] = chrm
       largePositions[rs] = pos

smallFile = open('reck.txt','r')
fileOut = open('reck.1kpositions.txt','w')
for line in smallFile:
       line = line.split('\t')
       rs = line[1]
       largeChrom = largeChromosomes[rs]
       largePos = largePositions[rs]
       lineOut = '     '.join( [rs, largeChrom, largePos] )
       print >> fileOut, lineOut



Below is a more stylish script that does the same:
create a file named select.py, make it excutable and run
> python select.py -s [name|number] selector_file -c [name|number] large_file -o output_file


#!/usr/bin/env python

###
# March 29th, 2011
# Yarko Tymciurak
# yarkot@uchicago.edu
#
# Research Computing Group
# Biostatistics Lab
# University of Chicago
####


'''
Given two files, select (output) lines of the second (data) file which contain one of
the data items in the first (selector) file.
'''
from __future__ import print_function
from collections import namedtuple
from optparse import OptionParser
import sys

if sys.version_info[0] < 3 and sys.version_info[0] == 2 and sys.version_info[1] < 6:
    print( 'This program needs a minimum of Python version 2.6 to run.  You are using:\n%s' % sys.version, file=sys.stderr)
    exit(1)
LINEEND = '\r\n'   # for rstrip
# FIELDSEP = None    # for split

error_log = sys.stderr

def main():
    useage = '''useage:
    %prog [-s [name|number] selector_file [-c [name|number] data_file [-o output_file]"
    
    Given two input files, the values of a specified column of
     a selector file will be used to select output.
    Data in the specified "compare" column of the data_file will
     be checked against the selector data.
    If a compare data item from the data_file exists in the selector set,
     then that data_file row (line) will be output.
    
    You may also specify an output file.
     
    - If a field name is given, the first line of the file
      is assumed to contain field names.
    - If a field number is given, the n-th field is used (left-to-right,
      beginning with 1); all lines are assumed to be data
    - Fields are assumed to be TAB separated.
    
    
    example:
    % %prog -s IID relationships.txt -c 2 some.ped -o filtered.ped
    
    This will use the "IID" column from relationships.  Any row in "some.ped"
    which has column-2 data matching an IID data will be output to "filtered.ped"
    
    This is equivalent:
    % %prog -sIID relationships.txt -c2 some.ped -ofiltered.ped
    
    '''
    parser = OptionParser(useage)

    parser.add_option( "-s", "--select-column", dest="select_column",
                       help="column from which to build selection matching data")
    parser.add_option( "-c", "--compare-column", dest="compare_column",
                       help="column to compare against selection criteria")
    parser.add_option( "-o", "--output", dest="output",
                       help="output file for selected data")
    parser.add_option( "-p", "--separator", dest="FIELDSEP",
                       help="set field separator (each string will be one separator).\n\
                       default separator: any successive whitespace")

    (options,args) = parser.parse_args()
    output = options.output
    select_column = proper_type(options.select_column)
    compare_column = proper_type(options.compare_column)
    compare_asindex = type(compare_column) is int
    
    global FIELDSEP
    
    FIELDSEP = options.FIELDSEP # I depend on this being 'None' when not set


    if (select_column is None) or (compare_column is None):
        error_log.write("Need to specify select column (-s) and compare column (-c).\nTry -h option for help.\n")
        exit(1)
    if len(args)<2:
        parser.print_help()
        exit(1)

    fname = args[0]
    fp = sys.stdin if fname == '-' else must_open(fname)
    fo = sys.stdout if output is None else must_open(output, 'w')
    selectors = get_selectors(fp, select_column)
    if type(select_column) is int:
        select_column -= 1  # adjust for zero-based array indexing

    for fname in args[1:]:
        global n
        n = 0
        fp = sys.stdin if fname == '-' else must_open(fname)
        line = fp.readline()
        # assume first line is a header line, if named data field:
        if not compare_asindex:
            dat = namedtuple("dat", line.rstrip(LINEEND).split(FIELDSEP))
            line = fp.readline()
        else:
            maxsplit = compare_column
            compare_column -= 1   # i.e. 2nd column => row[1]
        while not (line == ''):
            n+=1
            # split the line fields
            if compare_asindex:
                line_element = line.rstrip(LINEEND).split(FIELDSEP, maxsplit)
            else:
                drow = dat(*line.rstrip(LINEEND).split(FIELDSEP))
                line_element = drow._asdict()
            
            # output if the specified data field is in the selector
            try:
                if line_element[compare_column] in selectors:
                    fo.write(line)
            except Exception as e:
                print( e, "\nMaybe you need to specify an alternate field separator to process your data?",
                       file=sys.stdout);
                return
            line = fp.readline()


def get_selectors(fp, column):
    '''
    using the global option, select a column from the file, and return a dict (hash performance) with all the values
    '''
    selectors = {}
    global FIELDSEP
    # Get column option:
    # - if numeric, split the fields and use number as an index,
    # - else use named.tuples, and use the first line as the header
    asindex = type(column) is int
    if not asindex:
        # get the first line, which we expect to be the column header
        line = fp.readline()
        hdr = namedtuple("hdr", line.rstrip(LINEEND).split(FIELDSEP))
    else: # can only limit splits on numerics
        maxsplit = column
        column -= 1   # i.e. 2nd column => row[1]
    # now go thru the file:
    line = fp.readline()
    while not (line==''):
        # prepare so we index correctly either way:
        if asindex:
            row = line.rstrip(LINEEND).split(FIELDSEP, maxsplit)
        else:
            drow = hdr(*line.rstrip(LINEEND).split(FIELDSEP))
            row = drow._asdict()
        selectors[row[column]]=''   # we only care about key, not value
        line = fp.readline()
    return selectors

def must_open(f, c='r'):
    try:
        fp = open(f,c)
    except IOError as e:
        error_log.write('Unable to open %s : %s\n' % (f,e))
        exit(2)
    return fp

def proper_type(i):
    return int(i) if (i and i.isdigit()) else i

if __name__=='__main__':
    main()
##----------

Wednesday, January 11, 2012

Useful blog "getting genetics done"

http://gettinggeneticsdone.blogspot.com/

Friday, January 6, 2012

latexdiff - mark up differences between latex files

Determine and mark up significant differences between latex files.

http://www.ctan.org/tex-archive/support/latexdiff

This perl script will generate a latex file with changes marked up (similar to Word) from two latex files

latexdiff-so fileold.tex filenew.tex > dif.tex

ps2pdf to all ps files in current directory

for x in $(ls *.ps); do echo $x; ps2pdf $x $(echo $x|sed -e 's/ps$/pdf/');done

or more efficiently according to Yarko's comment

for x in *.ps; do
echo $x;
ps2pdf $x ${x/%ps/pdf};
done

Thursday, June 3, 2010

mixed effects model in R

Book
http://lme4.r-forge.r-project.org/book/

Slides
http://www.stat.wisc.edu/~bates/PotsdamGLMM/LMMD.pdf

Friday, March 19, 2010

awk: verify same number of columns

## prints number of columns if it changed from previous line

> awk -F"\t" 'NF != oldnf { print NF; oldnf = NF }' filename |head

|head added just in case file is very long

## using a script

> awk -f script.awk filename | head

script.awk:

BEGIN {
 FS="\t"
 oldnf = 1
}
{
  if(NF != oldnf) 
  {
 print NF
 oldnf = NF
  }
}

Other examples in 
http://sparky.rice.edu/~hartigan/awk.html

Friday, March 5, 2010

merge eps files into one

convert *.eps all.pdf

Tuesday, February 16, 2010

running R64 in aquamacs

http://www.matthewckeller.com/html/aquamacs.html

USING 64-BIT R FROM WITHIN AQUAMACS

1) Aquamacs is terrific because it comes bundled with about everything you need to
start using R with it. No need to separately download ESS (emacs speaks statistics)
or mess with the .ess-lisp files. Download Aquamacs from http://aquamacs.org/ and
install it.

2) Now any file with an “.R” extension that is opened with aquamacs will
automatically have syntax highlighting AND will allow you to start 64-bit R. I
highly recommend checking out this page, http://ess.r-project.org/, and downloading
the ESS reference card therein if you are new. Emacs has a bit of a learning curve,
but you’ll begin loving its capabilities as an aid to programming very soon.

3) To start 64-bit R from within aquamacs hit Control-u then Alt-x then “R” then
. Aquamacs will ask you for “Starting Args”. Here’s where you tell it you
want to run 64-bit R. Type “--arch=x86_64” then . A new window will open up
that is the R session.

4) That’s almost it. Just one more issue: graphing will crash (something to do with
the new R and X11). Aquamacs has X11 as the default (null) graphics device. To
change this, type in R: “options(device=”quartz”)”. Now your null graphics device is
quartz and it should all work out. If you really want to use X11 to write graphics,
you can use x11(type="Xlib")' for each new graph. What I do is to set
“options(device=”quartz”) within my Rprofile.site file.

Monday, February 15, 2010

trim white spaces

check ?gsub

Tuesday, December 15, 2009

pvalue plot vs uniform; FDR

qqunif  = function(p,BH=T,...)
{
  nn = length(p)
  xx =  -log10((1:nn)/(nn+1))
  plot( xx,  -sort(log10(p)),
     xlab=expression(Expected~~-log[10](italic(p))),
        ylab=expression(Observed~~-log[10](italic(p))),
       ... )
  abline(0,1,col='gray')
  if(BH)
    {
      lines( xx, -log10(0.05*(1:nn)/(nn+1)), col='red')
      lines( xx, -log10(0.10*(1:nn)/(nn+1)), col='orange')
      lines( xx, -log10(0.25*(1:nn)/(nn+1)), col='yellow')
      legend('bottomright', c("FDR = 0.05","FDR = 0.10","FDR = 0.25"),
             col=c('red','orange','yellow'),lty=1, cex=1.4)
    }
}

Tuesday, December 8, 2009

imagesc in CDAT

The imagesc function in MATLAB creates plots that have color bars scaled to a user-specified range. This allows you to simply create a whole series of plots that can be easily intercompared.

The following function imitates imagesc:

    import vcs
    def imagesc(x, mini, maxi):
        t = vcs.init()
        a = t.getboxfill('quick')
        a.level_1 = mini
        a.level_2 = maxi
        t.plot(x, a)

x is the input array and mini and maxi define the range in values to plot. The function establishes a VCS canvas, imports the "quick" boxfill graphics method, sets the level range to mini and maxi, and plots x using this customized graphics method. 


other tips for CDAT/Python check http://www.johnny-lin.com/cdat_tips/

Thursday, December 3, 2009

likelihood ratio test

> logLik.test
function(fit0,fit)
{
  l0 = logLik(fit0)
  l1 = logLik(fit)
  lratio = abs(as.numeric(l0)-as.numeric(l1))
  df = abs( attr(l1,'df') - attr(l0,'df') )
  pval = 1-pchisq( 2*lratio, df)
  return(pval)
}

log likelihood from lm, glm, polr


> fit = lm(y ~ x)
> logLik(fit)
'log Lik.' -36.4 (df=4)
> attr(logLik(fit),'df')
[1] 4

Wednesday, November 25, 2009

R - How to save graphs in files

" R-wiki on how to save graphs in file "

Tuesday, November 17, 2009

how to embed a pdf or ppt on a blog

get the code from

https://docs.google.com/viewer

Monday, October 26, 2009

plots boxplot, stripchart and mean

## CAN BE REPLACED BY qplot in ggplot2 package

qplot(x,y,geom=c("boxplot","jitter"))


## plots boxplot, stripchart and mean

boxjitter = function(x,y=NULL,border='gray',jitter=.1,vertical=T,...)
{
  if(!is.null(y))
  {
    boxplot(y~x,border=border,ylim = range(y,na.rm=T),outline=F,...)
    stripchart(y~x,vertical=vertical,method="jitter",jitter=jitter,add=T)
    meanvec = tapply(y,x,mean.na)
    points(1:length(meanvec),meanvec,pch="M",col='blue')
  } else
  {
    x = x[!is.na(x)]
    boxplot(x,border=border,ylim = range(x),outline=F,...)
    stripchart(x,vertical=vertical,method="jitter",jitter=jitter,add=T)
    points(1,mean.na(x),pch="M",col='blue')
  }
}

Thursday, October 22, 2009

calculate number of minor alleles from A G format

num.minal = function(a1,a2)
{
  aall = c(a1,a2)
  tab = table(aall)
  miname = names(tab)[tab==min(tab)]
  apply(cbind(a1==miname,a2==miname),1,sum)  
}

plink coding --recodeAD


plink --file data --recodeAD

which, assuming C is the minor allele, will recode genotypes as follows:

     SNP       SNP_A ,  SNP_HET
     ---       -----    -----
     A A   ->    0   ,   0
     A C   ->    1   ,   1
     C C   ->    2   ,   0
     0 0   ->   NA   ,  NA

R table, percentages and chi square test like in stata



table.all = function(...)
{
  args = list(...)
  dnn = names(args)
  x = args[[1]]
  y = args[[2]]
  print(table(x,y,dnn=dnn))
  print("percentage by row")
  print( round( prop.table(table(x,y,dnn=dnn),1) * 100 ) )
  print("percentage by col")
  print( round( prop.table(table(x,y,dnn=dnn),2) * 100 ) )
  print(chisq.test(table(x,y,dnn=dnn)) )
}

Friday, October 2, 2009

order data frames in R

say data has read and prog as colums

data[order(read, prog), ]

this will order first by read then by prog

data[order(-read), ]

- reverses the order

** there can be conflicts if read is a variable outside of the data frame. Use data[order(data[,1]),] assuming read is column 1

from: http://www.ats.ucla.edu/stat/R/faq/sort.htm

Thursday, October 1, 2009

list names from data that contain charname

## list names from data frames that contain charname

grepnames = function(charname,data)
  {
    names(data)[grep(charname,names(data))]
  }

Friday, September 25, 2009

power calculation in SAS

proc power; 
  twosamplefreq test=pchi 
  groupproportions = (.5034  .6459) 
  power = .
  groupns = ( 1494 373)
  alpha = 0.00005;
run;

proc power; 
      twosamplesurvival test=logrank 
         gexphs = 0.6733 | 0.4155
         accrualtime = .00001
   follouptime = 1 
         groupns = 200 | 200 
   grouplossexphazards = (0 0)
         power = .; 
proc power; 
      twosamplesurvival test=logrank 
         curve("Control")   = (1):(.45) 
         curve("Treatment") = (1):(.66) 
         groupsurvival = "Control" | "Treatment" 
   GROUPLOSSEXPHAZARDS = (.2231 .2231)
   /*grouplossexphazards = (0,0)*/
         accrualtime = .00001
         follouptime = 1 
         groupns = 120 | 120 
         power = .; 
run;

proc power; 
      twosamplesurvival test=logrank 
         curve("Control")   = (1):(.51) 
         curve("Treatment") = (1):(.66) 
         groupsurvival = "Control" | "Treatment" 
   GROUPLOSSEXPHAZARDS = (.2231 .2231)
   /*grouplossexphazards = (0,0)*/
         accrualtime = .00001
         follouptime = 1 
         groupns = 100 | 100 
         power = .; 
run;

proc power; 
      twosamplesurvival test=logrank 
         curve("Control")   = (1):(.45) 
         curve("Treatment") = (1):(.70) 
         groupsurvival = "Control" | "Treatment" 
   GROUPLOSSEXPHAZARDS = (.2231 .2231)
   /*grouplossexphazards = (0,0)*/
         accrualtime = .00001
         follouptime = 1 
         groupns = 200 | 200 
         power = .; 
run;

how to get rid of emac's automatic switching of _ into ->

Solution by Paul R. (thanks!)

M-x
set-variable 
ess-S-assign 
"-"  

That's it.

Monday, April 13, 2009



File Management in Python 
( Page 1 of 5 )

File management is a basic function, and an important part of many applications. Python makes file management surprisingly easy, especially when compared to other languages. Peyton McCullough explains the basics.
 
Introduction
 
The game you played yesterday uses files to store game saves. The order you placed yesterday was saved in a file. That report you typed up this morning was, obviously, stored in a file as well.
 
File management is an important part of many applications written in nearly every language. Python is no exception to this. In this article, we will explore the task of manipulating files using several modules. We'll read, write to, append and do other strange things to files. Let's get started.
 
Reading and Writing
 
The most basic tasks involved in file manipulation are reading data from files and writing data to files. This is a very simple task to learn. Let's open a file for writing: 

fileHandle = open ( 'test.txt', 'w' ) 

The "w" indicates that we will be writing to the file, and the rest is pretty simple to understand. The next step is to write data to the file: 

fileHandle.write ( 'This is a test.\nReally, it is.' ) 

This will write the string "This is a test." to the file's first line and "Really, it is." to the file's second line. Finally, we need to clean up after ourselves and close the file: 

fileHandle.close() 

As you can see, it's very easy, especially with Python's object orientation. Note that when you use the "w" mode to write to the file again, all its contents will be deleted. To get past this, use the "a" mode to append data to a file, adding data to the bottom: 

fileHandle = open ( 'test.txt', 'a' )
fileHandle.write ( '\n\n\nBottom line.' )
fileHandle.close() 

Now let's read our file and display the contents: 

fileHandle = open ( 'test.txt' )
print fileHandle.read()
fileHandle.close() 

This will read the entire file and print the data within it. We can also read a single line in the file: 

fileHandle = open ( 'test.txt' )
print fileHandle.readline() # "This is a test."
fileHandle.close() 

It is also possible to store the lines of a file into a list: 

fileHandle = open ( 'test.txt' )
fileList = fileHandle.readlines()
for fileLine in fileList:
   print '>>', fileLine
fileHandle.close() 

When reading a file, Python's place in the file will be remembered, illustrated in this example: 

fileHandle = open ( 'test.txt' )
garbage = fileHandle.readline()
fileHandle.readline() # "Really, it is."
fileHandle.close() 

Only the second line is displayed. We can, however, get past this by telling Python to resume reading from a different position: 

fileHandle = open ( 'test.txt' )
garbage = fileHandle.readline()
fileHandle.seek ( 0 )
print fileHandle.readline() # "This is a test."
fileHandle.close() 

In the above example, we tell Python to continue reading from the first byte in the file. Thus, the first line is printed. We can also request Python's place within the file: 

fileHandle = open ( 'test.txt' )
print fileHandle.readline() # "This is a test."
print fileHandle.tell() # "17"
print fileHandle.readline() # "Really, it is." 

It is also possible to read the file a few bytes at a time: 

fileHandle = open ( 'test.txt' )
print fileHandle.read ( 1 ) # "T"
fileHandle.seek ( 4 )
print FileHandle.read ( 1 ) # "T" 

When working with Windows and Macintosh, sometimes you are required to read and write files in binary mode, such as images or executional files. To do this, simply append "b" to the file mode: 

fileHandle = open ( 'testBinary.txt', 'wb' )
fileHandle.write ( 'There is no spoon.' )
fileHandle.close()
 
fileHandle = open ( 'testBinary.txt', 'rb' )
print fileHandle.read()
fileHandle.close() 


#write to file
f = open('c:/temp/workfile','w')
f.write('hallo')
f.close
 
#open from file
f = open('file.txt')
 
#read linewise
for i in f:
    print i
 
#read characterwise
for i in f.read():
    print i

Friday, July 13, 2007

myfunctions.r


## logit/expit
logit = function(x,b=1) log(x/(b-x))
expit = function(x,b=1) b/(1+exp(-x))

## matlab equivalent of repmat in R
repmat = function(X,m,n){
  ## tiles matrix X, m by n times
  mx = dim(X)[1]
  nx = dim(X)[2]
  matrix(t(matrix(X,mx,nx*n)),mx*m,nx*n,byrow=T)
}

acfi = function(x, lag.max = NULL,
         type = c("correlation", "covariance", "partial"),
         plot = TRUE,  demean = TRUE,...)
{
acf(x, lag.max = lag.max,
         type = type,
         plot = plot, na.action = na.pass, demean = TRUE, ...)
}

"%&%" <- function(a, b) paste(a, b, sep="")


rangi = function(x){
 range(x, na.rm = T)
}

cori = function(x,y=NULL){
cor(x, y = y, use = "complete.obs")
}


## myoptim
## minimizes fun using optim with nelder mead
## algorithm. 

myoptim = function(param,fun,epsi=1e-5,hess=F){
  dife = 10
  eini = fun(param)
  while(dife>epsi){
    res=optim(param,fun,hessian=hess)
    dife = abs(eini-res$value)
    param=res$par
    eini=res$value
  }
  res
}

##======================
## mean squared error sum((x-y)^2)/n
mse <- function(x,y) {
  n <- length(x)
  sum((x-y)^2)/n}
                                                                                
rmse <- function(x,y) sqrt(mse(x,y))
##======================
regress <- function(form,plot=F){
  fit=lm(form)
  fitsum=summary(fit);
  print(fitsum$coeff);
  print(c("R squared","residual se"))
  print(round(c(fitsum$r.squared,rmse(fit$res,0)),3))
  }

## Great circle distance from library fields, default is km.
gcdist <- function (loc1, loc2, miles = FALSE, R = NULL) 
{
    if (is.null(R)) {
        if (miles) 
            R <- 3963.34
        else R <- 6378.388
    }
    if (missing(loc2)) 
        loc2 <- loc1
    coslat1 <- cos((loc1[, 2] * pi)/180)
    sinlat1 <- sin((loc1[, 2] * pi)/180)
    coslon1 <- cos((loc1[, 1] * pi)/180)
    sinlon1 <- sin((loc1[, 1] * pi)/180)
    coslat2 <- cos((loc2[, 2] * pi)/180)
    sinlat2 <- sin((loc2[, 2] * pi)/180)
    coslon2 <- cos((loc2[, 1] * pi)/180)
    sinlon2 <- sin((loc2[, 1] * pi)/180)
    pp <- cbind(coslat1 * coslon1, coslat1 * sinlon1, sinlat1) %*% 
        t(cbind(coslat2 * coslon2, coslat2 * sinlon2, sinlat2))
    R * acos(ifelse(pp > 1, 1, pp))
}
## cord distance, default is km.
## loc1 = [lon, lat ]
cordist <- function (loc1, loc2, miles = FALSE, R = NULL) 
{
    if (is.null(R)) {
        if (miles) 
            R <- 3963.34
        else R <- 6378.388
    }
    if (missing(loc2)) 
        loc2 <- loc1
    coslat1 <- cos((loc1[, 2] * pi)/180)
    sinlat1 <- sin((loc1[, 2] * pi)/180)
    coslon1 <- cos((loc1[, 1] * pi)/180)
    sinlon1 <- sin((loc1[, 1] * pi)/180)
    coslat2 <- cos((loc2[, 2] * pi)/180)
    sinlat2 <- sin((loc2[, 2] * pi)/180)
    coslon2 <- cos((loc2[, 1] * pi)/180)
    sinlon2 <- sin((loc2[, 1] * pi)/180)
    pp <- cbind(coslat1 * coslon1, coslat1 * sinlon1, sinlat1) %*% 
        t(cbind(coslat2 * coslon2, coslat2 * sinlon2, sinlat2))
    R * sqrt( 2* (1-ifelse(pp > 1, 1, pp) ) )
}



## matern function with Handcock's parameterization
## hmatern
hmatern = function(u,range,smoo){
  if (is.vector(u)){
    names(u) <- NULL
    y = u
    y[]=1
  }
  if (is.matrix(u)) {
    dimnames(u) <- list(NULL, NULL)
    y = u
    y[,]=1
  }
  unonzero= u>0
  if(smoo<100) {
    urange = u*2*sqrt(smoo)/range
    y[unonzero] = (urange[unonzero]^smoo) * besselK(urange[unonzero],smoo)/
      (2^(smoo - 1))/ gamma(smoo)}
  else
    {y[unonzero] = exp(-u[unonzero]^2)}
  y
  }
##===========================================================
## simple useful functions
##===========================================================
suma <- function(x) {
  indnna = !is.na(x)
  if(sum(indnna)>0) res = sum(x[indnna]) else res=NA
  res
}
maxi <- function(x){
  indnna = !is.na(x)
  if(sum(indnna)>0) res=max(x[indnna]) else res=NA
  res}

mini <- function(x){
  indnna = !is.na(x)
  if(sum(indnna)>0) res=min(x[indnna]) else res=NA
  res}

meana <- function(x){
  indnna = !is.na(x)
  if(sum(indnna)>0) res=mean(x[indnna]) else res=NA
  res}

vari <- function(x){
  indnna = !is.na(x)
  if(sum(indnna)>0) res=var(x[indnna]) else res=NA
  res
}

longi <- function(data){length(data[,1])}

plotfit <- function(fit)
         {
           par( mfrow=c(2,2) )
           plot(fit)
           par( mfrow=c(1,1) )
         }

algunos <- function(x,cant=10) x[1:cant,]

## returns the row and column of an index matrix == 1
indmat <- function(x)
{
  n <- length(x[,1])
  tm1 <- seq(1,n)
  tm2 <- matrix(rep(1,n*n),n)
  rowmat <- diag(tm1)%*%tm2
  colmat <- tm2%*%diag(tm1)
  ro <- rowmat[x==1]
  co <- colmat[x==1]
  cbind(ro,co)
}