BusinessLaw

THESIS. Presented in Partial Fulfillment of the Requirements for. the Degree Master of Science in the Graduate. School of The Ohio State University

Categories
Published
of 47
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Description
USE OF CHRONIC LYMPHOCYTIC LEUKEMIA RESEARCH CONSORTIUM DATA REPOSITORY AND GENE EXPRESSION OMNIBUS TO GENERATE AND TEST HYPOTHESES FOR BIOMARKER IDENTIFICATION AND DEVELOPMENT THESIS Presented in Partial
Transcript
USE OF CHRONIC LYMPHOCYTIC LEUKEMIA RESEARCH CONSORTIUM DATA REPOSITORY AND GENE EXPRESSION OMNIBUS TO GENERATE AND TEST HYPOTHESES FOR BIOMARKER IDENTIFICATION AND DEVELOPMENT THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Kristin Chelsea Keen, B.A. ***** The Ohio State University 2009 Thesis Committee: Professor Kun Huang, adviser Professor Philip Payne Pathology Graduate Program Approved by Adviser ABSTRACT Chronic lymphocytic leukemia (CLL) is the most common adult leukemia in the United States. There is no known cure for CLL. While biomarkers have been found to correlate with disease progression, such as CD38, IGHV, and ZAP-70, there is a need for further validation of these biomarkers as well as new biomarker discovery. In this study, publicly available gene expression data from NCBI s Gene Expression Omnibus was used to identify genes with expression correlated to ZAP-70 and CD38 mrna expression patterns. Also utilized were vast amounts of data from the Chronic Lymphocytic Leukemia Research Consortium (CRC) to search for novel correlations between clinical markers and disease progression and treatment outcome. We found several hundred genes with expression patterns correlated to ZAP-70. We also found several clinical, genetic, biologic, and immunologic CRC data fields which were correlated significantly and at weakly to strongly associated. ii Dedicated to my family iii ACKNOWLEDGEMENTS I thank my adviser, Kun Huang, for his positive support and trust in me as a student and a researcher. I thank my committee member, Philip Payne, for his tangible enthusiasm and thoughtful advice. I thank Gulcin Ozer for contributing her time, effort, and experience to the correlation analysis of Aim 1. I thank Cenny Taslim for helping me with Matlab when I was in a pinch. Lastly, I thank my husband, Jared Circle, and mother, Valerie Keen, for their hours of time doing more than their share of caring for my daughter while I finished this project. I couldn t have done this without them. iv VITA 2003.B.A., Life Sciences, Otterbein College Major Field: Pathology FIELDS OF STUDY v TABLE OF CONTENTS Abstract...ii Dedication...iii Acknowledgements. iv Vita...v List of Figures... viii Chapters: 1. Introduction Specific Aims Background and Significance Chronic Lymphocytic Leukemia The CLL Research Consortium Bioinformatics and CLL Methods Methods for Specific Aim Methods for Specific Aim Results Results for Specific Aim Results for Specific Aim vi 5. Conclusions Discussion...17 References...21 Appendix A: Tables and Figures for CRC data repository and GEO analysis..23 vii LIST OF FIGURES Figure 1: Valid and meaningful hypotheses from [1] Figure 2: Flow chart for Aim 1 methods...25 Figure 3: Flow chart for Aim 2 methods Figure 4: Fields queried from CRC research data repository for Aim Figure 5: Fields analyzed and bins for these fields 28 Figure 6: Comparison pairs between CRC query datasheets and methodology for combining datasheets for analysis Figure 7: GDS dataset information summary Figure 8: Correlated CRC data fields, p 0.05, phi Figure 9: Correlation gene lists for ZAP-70 and CD38 for GDS1388, GDS1454, and GDS2501, threshold Figure 10: Randomness calculation for ZAP-70 and CD38 gene list intersections Figure 11: Annotated IPA gene lists for correlated GDS2501 gene lists and intersected gene lists for ZAP-70 and CD Figure 12: IPA pathways showing only connected genes, for combined gene lists for ZAP Figure 13: IPA pathway showing only connected genes, for viii combined gene lists for CD38 36 Figure 14: Ohio State University Medical Center clinical reference ranges, November ix CHAPTER 1 INTRODUCTION Chronic lymphocytic leukemia (CLL) is the most common adult leukemia in the United States [2]. Although some leukemias and lymphomas can be cured, there is no known cure for CLL [3]. While some biomarkers have been found to correlate with disease progression, such as CD38, IGHV, and ZAP-70, there is a need for further validation of these biomarkers as well as new biomarker discovery [4]. In this study, publicly available gene expression data from NCBI s Gene Expression Omnibus is used to identify genes with expression correlated to ZAP-70 and CD38 mrna expression patterns. We also utilize vast amounts of data from the Chronic Lymphocytic Leukemia Research Consortium (CRC) to search for novel correlations between clinical markers and disease progression and treatment outcome. Our aims are as follows: Specific Aim 1: Test previously-formed hypotheses generated using knowledge engineering (KE)-based approach using CRC data and correlation analysis as a novel method of biomarker discovery. Specific Aim 2: Identify genes whose mrna expression is correlated with known CLL biomarkers ZAP-70 and CD38. 1 Based on previous studies using gene expression correlation and genelist intersection from multiple datasets, we expect expression of several genes to correlate with ZAP-70 and CD38 in multiple datasets [5, 6]. We expect that our correlation analysis of CRC data will confirm the KE-generated hypotheses and discover new biomarkers for disease progression. 2 CHAPTER 2 BACKGROUND AND SIGNIFICANCE 2.1 Chronic Lymphocytic Leukemia Chronic lymphocytic leukemia (CLL) is the most common adult leukemia in the United States. Nearly 100,000 Americans live with CLL, most of them over fifty years old. Rates of CLL incidence are increasing, and there is no known cure [7]. CLL usually develops slowly, and the symptoms of the disease are similar to many other more common conditions. This increases the difficulty in diagnosis, and it is only after a battery of testing that most patients find out that they have CLL. CLL is diagnosed through blood tests including white blood cell count and complete blood count. Once a diagnosis is made, staging must be done to determine whether the disease is in the beginning, intermediate, or advanced stage. Some patients remain in the beginning stages of the disease progression and are able to live long lives, never having to deal with many of the disease s worst symptoms [8, 9]. This results in two distinct groups of patients: those with advancing disease and those with disease that doesn t seem to progress. Those with the non-progressive manifestation of the disease seem not to need treatment until the disease begins progressing and they become more symptomatic [4]. 3 Early determination of which grouping a patient belonged in, progressive or nonprogressive CLL, would serve an important function. If this information could be determined in advance, it would potentially enable the development of a better course of action for disease management and treatment [10]. The end result would lead to the improvement of patients conditions and possibly saving lives. Biomarkers have proven helpful in identifying patient groups for other diseases [9]. ZAP-70, CD38, and IGHV have been named in multiple studies as biomarkers for CLL disease progression [4, 11, 12]. A positive ZAP-70 test means that a patient would be placed in the progressive group. While this is progress toward earlier characterization of an individual s disease state, ZAP-70 testing only yields definitive results if conducted during later, symptomatic phases of disease progression [3]. A more efficient method would be to determine biomarkers or tests that are able to definitively determine at an early point in the course of the disease the likelihood with which a patient may soon stop responding to treatment or will begin more rapid disease progression. A large-scale study with thousands of patients with CLL has been done; with it, a database was created that contains hundreds of data fields. This database has the potential for leading to the identification of new biomarkers or tests that can assist in the determination of disease progression, early disease state detection, refractory, and patient response to treatment. 2.2 The CLL Research Consortium The CLL Research Consortium (CRC) is a multi-site research group funded by the National Cancer Institute whose primary function is to conduct studies of the genetic, 4 biochemical, and immunologic origins of CLL [13, 14]. The CRC has conducted studies that have been responsible for important new insights into CLL pathophysiology and treatment as well as multi-site group data repository management [10, 13-16]. The CRC s goals are to pursue new treatments for CLL and to examine phenotypic and biomarker relationships specifically geared toward improving staging and predicting patient response to disease treatment. To assist in reaching these goals, the CRC maintains a data repository containing genetic, biochemical, immunologic, and clinical patient data for the over 4,000 patients who have agreed to have their case information contributed to the CRC [17]. Due to the vast amounts of data compiled from the thousands of patients in hundreds of separate data fields, the CRC data depository is the ideal initial resource to consult for potential biomarker discovery for CLL, provided a positive control for analysis can be effectively identified and utilized. 2.3 Bioinformatics and CLL Conceptual knowledge acquisition methods involve combining basic units of information and meaningful relationships between those basic units [18, 19]. This information could come from virtually any verifiable source, including journal articles and abstracts, databases of information, or experts within a field. One well-described method for acquiring, refining, and validating knowledge is called knowledge engineering (KE). In practice, there are many theories and methods for performing conceptual knowledge acquisition and KE. In our case, basic factual units from the CRC database were mapped to ontologies, or collections of domain-specific knowledge, to 5 generate seventy hypotheses. These hypotheses were examined by a team of CLL field experts, who determined that nine of those hypotheses were valid and meaningful (Figure 1) [1]. In a large database such as the CRC data repository, having a starting point and a positive control when beginning an analysis can allow for more intelligent querying and a better perspective for viewing results. If these nine hypotheses are found to be true using the CRC data repository, then this is further support for the relevance of this ontologyanchored approach to conceptual knowledge engineering [1]. In order to claim a link between a biomarker and disease progression or patient response to treatment, it is essential to first detect a correlation between pairs of data fields [20-22]. A basic type of correlation is Spearman s correlation coefficient. When working with exceptionally large amounts of data, it becomes necessary to bin the data, or divide it into separate and more manageable groups for analysis. These bins provide the inception points for conduction of the analysis necessary to determine a correlation; if there is a correlation to be found, it is then possible to move forward into a more detailed examination of the relationship between the biomarker and disease progression in patients. These bioinformatic techniques can be applied to CLL to determine a link between any of a number of biomarkers and data fields representing disease progression or response to treatment. It is our eventual goal to evaluate this data for evidence of change over time to determine whether there is a correlation between delta and disease progression or response to treatment. This application is intended to lay a foundation for this work. However, upon beginning a study such as this one, technical issues come to the 6 forefront. Some of these limitations may include visit windowing due to non-aligning dates, use of a convenient sample, and novel hypotheses that are less common and thus more difficult to detect in a large sample such as the CRC data repository. 7 CHAPTER 3 METHODS Flowcharts for Aim 1 and Aim 2 can be found in Figures 2 and 3 respectively. 3.1 Methods for Specific Aim 1 Specific Aim 1: Test previously-formed hypotheses generated using knowledge engineering (KE)-based approach using CRC data and correlation analysis as a novel method of biomarker discovery. Data source Data to be analyzed was taken from the CRC research data repository, maintained at University of California, San Diego, by the CRC Biomedical Informatics Program [17]. Data selection and preparation A query was made for patients with multiple center visits for the 182 clinical, biological, genetic, and immunologic fields (Figure 4). The output file was delivered in Excel format as five different files: BsData_table, Clinical_Tablesv3, cytogenetic, Facs_table, and Registration Table v2. Positive controls 8 Nine valid, meaningful conceptual knowledge hypotheses which were developed in a previous study using CRC research data repository fields [1] were used as a positive control for Aim 1 (Figure 1). Binning process For each data field, data was sorted using Microsoft Excel's data sort function to display in order of ascending values. Empty cells and undefined or unknown entries were removed from consideration. If the analyzed data was determined to be categorical, bins were created for each category of viable data. In some cases, more than eight individual categories were present. In these cases, bins were created for ranges of categories. To further assist in defining parameters for the data sets, the numerical data was checked against The Ohio State University MedicalCenter (OSUMC) clinical data reference ranges (Figure 14). If the data field corresponded to an existing OSUMC reference range, this range was used to create bins for below normal, normal, and above normal. If the OSUMC reference range was gender-specific, the normal range for male and female was combined to create a larger normal range. If data contained very high values (defined as ten times highest normal value), an additional bin was created for these very high values. If there was no OSUMC reference range, then numerical data was divided into four bins of as close to equal size as possible. A few exceptions were made due to extreme lack of variation in data that did not allow for four equal sized bins. Figure 5 lists data fields analyzed and bins for each data field; exceptions are noted. Combining data 9 Several factors made it prohibitive to create a single, combined data matrix with the assembled information. In come cases, dates did not align perfectly between different Excel files. The problem of visit windowing is significant for large-scale research data repositories such as the CRC data repository and warrants more in-depth examination in future studies. In addition, not all patients had submitted entries for all present data fields. These factors made it unreasonable to create one combined data matrix; however, it was possible to complete a combined data matrix from comparison pairs of Excel files. As a result of the utilization of five separate Excel databases, it became necessary to conduct ten respective comparisons of the data sets (Figure 6). The use of comparison pairs subsequently enabled the utilization of different rules for comparisons between varying types of data (Figure 6) to accommodate misalignments of dates. Correlation analysis Three types of correlation were calculated for each comparison between bins and fields: residual, phi, and p-value. Contingency tables and Fisher exact test was used to calculate these values. 3.2 Methods for Specific Aim 2 Specific Aim 2: Identify genes whose mrna expression is correlated with known CLL biomarkers ZAP-70 and CD38. Data selection GEO Datasets was queried using terms chronic lymphocytic leukemia [23]. Five GDS dataset results were generated: GDS2676, GDS2643, GDS2501, GDS1454, 10 and GDS1388. Those datasets with more than one cell type or comparison were eliminated from this study; the datasets left were GDS2501, GDS1454, and GDS1388. Soft files for these GDS datasets were downloaded for correlation analysis. Figure 7 contains platform type and comparisons made within each selected GDS dataset. Correlation Pearson correlation coefficient (PCC) was performed for genes CD38 and ZAP- 70 on the three selected GDS datasets using a MATLAB script. The script calculates PCC for each gene based on gene expression values of one selected gene (in this case, CD38 or ZAP-70). The resulting output *.txt file is comprised of a gene list containing only those genes found with PCC absolute value of 0.4 or higher. The appropriate gene symbols are displayed next to their respective PCC in descending order. Due to the presence of three GDS datasets and the two selected genes of interest, it was necessary to conduct PCC analysis six times. Intersection To determine which gene symbols could be found on multiple correlation genelists, genelist intersection was performed using a MATLAB script. Genes with PCCof less than 0.4 were excluded from the results [6]. The script generates an output *.txt file viewable in Microsoft Excel as a table. The table organizes the data using each respective gene symbol for row headers and the appropriate GDS dataset number for column headers. Cells within the table contain either a PCC value or a -. The presence of the - symbol indicates that the gene was not correlated with the selected gene at PCC 11 0.4 or better. This analysis was performed two times, once for CD38 and once for ZAP- 70. Randomness Test for Intersection In order to calculate the number of genes on the intersection gene list which could be attributed to randomness, a randomness test was done. The percentage of genes on the gene chip which were correlated at PCC = 0.4 or better was calculated. The percentages for intersected datasets GDS1388 and GDS1454 were multiplied to arrive at the percentage of genes which could be attributed to randomness. Pathway Analysis and Gene Ontology Pathway analysis was done using Ingenuity Pathway Analysis (IPA) [24]. Each genelist was copied and pasted in the Gene Search box at the top of the software interface. Only the best match was selected for each gene, which was most often a direct gene symbol match and rarely a synonym match. Selected genes were used to create a new gene list in IPA. Each gene list was annotated using IPA s annotate function. Each gene list was selected and placed in a new pathway and connected under standard settings using the Build and Connect functions. Then the Auto-Layout button was used, and all unconnected genes were removed from the pathway. This was done for each gene list and for combined ZAP-70 and CD38 intersected and correlated gene lists for a total of six pathways. Gene ontology annotations were gathered using GeneInfoViz s batch search function [25]. Each IPA gene list was pasted into the search box. Gene ontology annotations from each gene were copied and pasted into the IPA annotation tables. 12 CHAPTER 4 RESULTS 4.1 Results for Specific Aim 1 Specific Aim 1: Test previously-formed hypotheses generated using knowledge engineering (KE)-based approach using CRC data and correlation analysis as a novel method of biomarker discovery. The following four dataset comparisons were made: Facs_table to Clinical_Tablesv3, Facs_table to cytogenetic, Clinical_Tablesv3 to cytogenetic, and Clinical_Tablesv3 to BsData_table data field comparison pairs had p-value less than or equal to data field comparison pairs from separate datasheets met the requirements set of p-value less than or equal to 0.05 and phi coefficient greater than or equal to 0.3 (Figure 8). Zero of nine hypotheses were included in those results because out of 63 possible comparisons between CRC fields cited in the hypotheses, only one was done. 4.2 Results for Specific Aim 2 Specific Aim 2: Identify genes whose mrna expression is correlated with known CLL biomarkers ZAP-70 and CD38. 13 GDS1388 and GDS1454 each
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks