
Long interspersed element-1 (LINE-1 or L1) is an autonomous retrotransposon, which is capable of inserting into a new region of genome. Previous studies have reported that these elements lead to genomic variations and altered functions by affecting gene expression and genetic networks. Mounting evidence strongly indicates that genetic diseases or various cancers can occur as a result of retrotransposition events that involve L1s. Therefore, the development of methodologies to study the structural variations and interpersonal insertion polymorphisms by L1 element-associated changes in an individual genome is invaluable. In this study, we applied a systematic approach to identify human-specific L1s (i.e., L1Hs) through the bioinformatics analysis of high-throughput next-generation sequencing data. We identified 525 candidates that could be inferred to carry non-reference L1Hs in a Korean individual genome (KPGP9). Among them, we randomly selected 40 candidates and validated that approximately 92.5% of non-reference L1Hs were inserted into a KPGP9 genome. In addition, unlike conventional methods, our relatively simple and expedited approach was highly reproducible in confirming the L1 insertions. Taken together, our findings strongly support that the identification of non-reference L1Hs by our novel target enrichment method demonstrates its future application to genomic variation studies on the risk of cancer and genetic disorders.
Almost half of the human genome is derived from transposable elements (TEs) that are divided into two classes, DNA transposons and retrotransposons (Cordaux and Batzer, 2009). Retrotransposons consist of long interspersed elements (LINEs), short interspersed elements (SINEs), and endogenous retroviruses (ERVs) (Cordaux and Batzer, 2009). Retrotransposons have the ability to generate genomic variations because they can be inserted into another genomic location through RNA intermediates. These intermediates are reverse transcribed and mobilize TEs (O’Donnell and Burns, 2010). Previous studies focusing on TEs provide important insights into understanding of human genome evolution and diversity (Ayarpadikannan and Kim, 2014; Beck et al., 2011; Park et al., 2015; Schrader and Schmitz, 2018; Sotero-Caio et al., 2017). Among retrotransposons, LINE-1s (L1s) constitute a common large family of retrotransposons composed of approximately 17% of the human genome, which have resided in the mammalian genomes for over 150 million years (Lutz et al., 2003; Seleme et al., 2006). Full-length L1 is around 6 kb and consists of a 5′ untranslated region (UTR), two open reading frames (ORF1 and ORF2) and a 3′ UTR with a poly A terminal sequence. The ORF1 encodes a RNA-binding protein that functions as a nucleic acid chaperone. The ORF2 encodes endonuclease (EN) and reverse transcriptase (RT) that are crucial protein-complexes for self-mobilization (Lutz et al., 2003; Philippe et al., 2016; Seleme et al., 2006). The L1 is reverse transcribed and integrated into the genome through a mechanism known as target-primed reverse transcription (TPRT) (Philippe et al., 2016). During this process, the L1 endonuclease generates a single stranded nick in genomic DNA to expose a 3′-OH, which is used as a primer for reverse transcription of L1 RNA by the L1 reverse transcriptase (Brouha et al., 2003). The newly integrated L1 is flanked on both sides by identical direct repeats (7 to 20 bp) called “target-site duplications” (TSDs) flanking newly integrated elements (Fanning and Singer, 1987). These L1 proteins can also act in trans, to generate mobilization of noncoding retroelements such as
Recently, next-generation sequencing (NGS)-based applications such as whole-genome sequencing, transcriptome sequencing, exome sequencing, and microRNA profiling has the tremendous impact on genome research (Kenny et al., 2011; Valencia et al., 2012). Nevertheless, the identification of non-reference retrotransposons in specific region is not suitable by using these NGS techniques due to the limitations of resequencing. For example, when mapping on the human reference genome, resequencing the data of additional reads could be discarded by the bioinformatics algorithm. In addition, due to the repetitive sequence characteristics of the TEs included in the short reads, it is mapped to a paralogous region other than its original position (Ewing, 2015). Furthermore, the quality of sequencing reads is degraded due to the poly A tail sequence on 3′ UTR of L1, and sequencing data cannot be used effectively. Therefore, it was challenging to detect the newly inserted TEs and its polymorphisms using the NGS method (Collier et al., 2005; Iskow et al., 2010) even though there are several methods based on TE-priming system, such as ATLAS and SIMPLE, reported that are currently used for identifying non-reference L1Hs insertions from human individuals (Badge et al., 2003; Boissinot et al., 2000; Konkel et al., 2007; Streva et al., 2015).
Here, we introduce an improved method of L1Hs target sequencing library construction and bioinformatics analysis for detecting non-reference L1Hs insertions in the human genome through the target enrichment system. Using our system, we discovered 525 non-reference L1Hs insertion candidates from a Korean individual (called KPGP9). Furthermore, to confirm the sensitivity and specificity of our method, we randomly selected 40 out of the non-reference L1Hs insertions and performed experimental validation by PCR amplification and Sanger sequencing with the KPGP9 sample. We propose that L1Hs target enrichment system could be useful to explore the dynamics of L1Hs retrotransposition in human populations.
To design a probe specific for non-reference L1Hs insertions, we compared L1Hs consensus sequence with other L1 subfamilies (L1PA2~L1PA17). Their consensus sequences were collected from the Genetic Information Research Institute (GIRI) repbase browser (
Sample donors in this study signed a written informed consent to participate, and the Genome Research Foundation (IRB-20101202-001 for KPGP9) provided an approval for this study. The first stage in a standard genomic DNA library preparation is DNA fragmentation by sonication. The genomic DNA extracted from KPGP9 (a Korean individual) was used for this study. The genomic DNA (500 ng) was sheared using a Covaris S2 sonicator (Covaris, USA) to achieve typical size range of 400 to 700 bp and a target peak around 550 bp (settings: Duty Cycle, 10%; Intensity, 2; Peak Incident Power, 175; cycles per burst, 200; DNA treatment time, 45 s; water bath temperature, 4°C). The fragmented DNAs were quantified by Colibri Microvolume Spectrometer (Titertek-Berthold, Germany) and approximately 500 ng of the sample was run on a 2% TAE agarose gel along with the 1 KB Plus DNA Ladder (BioFACT, Korea) to verify the average fragment size of 500 bp.
The Ovation Target Enrichment System kit (NuGen, USA) was used to construct Illumina libraries with the fragmented DNAs for NGS on Illumina sequencing by synthesis platform. In brief, the following steps of DNA end repair, Illumina sequencing-based adaptor ligation and purification of the ligated gDNA were performed as described in the Ovation Target Enrichment System kit protocol. To repair both ends of fragmented DNAs, the End repair Master Mix (NuGen) was added to the fragmented DNA solution and incubated at 25°C for 30 min, followed by incubation at 70°C for 10 min. Next, the Ligation Adapter Master Mix (Nugen, USA) was added to end-repaired samples in order to ligate an adaptor and this mixture was incubated at 25°C for 30 min, followed by 70°C for 10 min. Immediately, adaptor dimers and unused reagents in the reaction were removed from the enriched libraries by using 0.8 volume per sample of the Agencourt RNAClean XP Beads (Beckman Coulter, USA). After ligation purification, the quality of each library was assessed by using a Bioanalyzer high Sensitivity DNA chip (Agilent Technologies, USA) (
The probe hybridization and extension step were performed by a thermal cycler (Bio-Rad Laboratories, USA) with the following conditions: 95°C for 5 min; 200 cycles of 80°C for 10 s, decrease 0.1°C each cycle; 60°C for 16 h. After the hybridization, we immediately mixed the Extension Enzyme (Nugen) into the sample for hybridization-based target elongation with the following conditions on the thermocycler: elongation at 72°C for 10 min and stabilization at 4°C. The increased DNA from target sequence purification was immediately carried out following the manufacturer’s instructions. After the hybridization-based target elongation, this library pool was amplified using the Library Amplification Master Mix (NuGen) with following program: one cycle of 37°C for 10 min; an initial denaturation step of 3 min at 95°C, followed by 15 cycles of PCR at 30 s of denaturation at 95°C, 15 s at the annealing temperature 62°C, and 20 s of extension at 72°C, followed by a final extension step of 3 min at 72°C, followed by a hold at 4°C. The amplified product was then cleaned using Agencourt RNAClean XP beads following the manufacturer’s protocol (NuGen). A library of genomic DNA fragments amplified using the Illumina’s primer set is called a target-enrichment library. The library had been sequenced by a NGS equipment of Illumina platform.
Cluster generation and sequencing were carried out on the amplified samples using Illumina HiSeq 2500 System (Illumina, USA) in 100 bp paired-end format according to the Illumina Paired-End Sequencing Platform Library protocol. The L1Hs-targeted sequencing was performed on a HiSeq equipment at the NGS and analysis facility, Theragen Etex Bio Institute (
HiSeq sequencing short reads were analyzed by custom pipelines to preprocess the raw data, which were contained with trimming the adaptor sequences, quality control in base calling, and verification of read mapping scores. The statistics analysis workflow was as follows.
Prior to trimming sequencing reads, the reads were tag-sorted and quality control using the FastQC (
Prior to alignment between paired-end reads and reference genome sequences, the adaptor sequences at the 5′ end of reads, poly A tail, ambiguous long N bases, and low complexity sequences were trimmed and removed using Cutadapt 1.1 version (
The IGV program (
To investigate the genomic location of target enrichment regions, we manually inspected the non-reference L1Hs candidate loci and annotated genes related to their insert sites using the UCSC Genome Browser.
The UCSC genome browser gateway (
PCR was performed using two different human genomic DNA samples (KPGP9 and NA10851 (Coriell Cell Repository, USA)) as templates. PCR amplification of each locus was performed in 20 ul reactions containing 20 ng of template DNA, 10 ul of 2X EF-Taq Premix4 (BioFACT), 10 nM of each oligo nucleotide primers, and nuclease-free water. Each PCR was subjected to initial denaturation step of 5 min at 95°C, followed by 35 cycles of 30 s at 95°C, 40 s of annealing at optimal annealing temperature, and a long extension step at 68°C for 7 min, followed by a final extension step at 68°C for 10 min. The PCR products were run on a 1% agarose gel electrophoresis with EcoDye (BioFACT) and visualized using Gel Doc (Bio-Rad, Germany). The PCR product was purified by the PCR purification kit (FAVORGEN, Taiwan) and some products were cloned with the Dr. TA TOPO cloning kit (Doctor Protein, Korea) according to the manual instructions. Target clone were confirmed by colony PCR, following plasmid preparation with the Exprep™ plasmid SV kit (GeneAll Biotechnology, Korea) and sequenced using chain-termination sequencing on an ABI 3500 Genetic Analyzer (Applied Biosystems, USA).
Previous studies on non-reference TEs have suggested that the amplification system was problematic because randomly fragmented DNA in different sizes was inserted in the library construction and the length of a producible read was insufficient to cover either the L1Hs-targeted region or the genomic region (Van den Broeck et al., 1998; Witherspoon et al., 2010). To overcome those issues, we used 100 bp paired-end sequencing using the Illumina platform and high-throughput targeted sequencing methods to identify L1Hs elements in the human genome. In addition, we added a fragment size selection step to selectively enrich the L1Hs-targeted reads. Our method is a developed system to detect L1Hs elements based on targeted high-throughput sequencing using a L1Hs target-specific probe (Ewing and Kazazian, 2010) and sample-specific indexing (Smith et al., 2009).
The workflow of the method for the L1Hs-targeted library is illustrated in Fig. 1 with additional explanations, and the detailed protocol is outlined in the Materials and Methods section. The extracted genomic DNAs from the KPGP9 (Korean individual) had been fragmented by using physical sonication with the Covaris system. During this time, genomic DNAs can be randomly generated to form either blunt- or sticky-end. To extract DNAs of suitable size (about 550 bp) for the Illumina HiSeq platform, we performed beads size selection on fragmented DNAs. AMPure XP Beads (Beckman Coulter, A63881) can select the desired size of DNAs based on its concentration. The selected DNAs in shapes were performed to repair for blunt ends, and adenosines was provided to each 3′ ends for TA ligation. Adaptors comprised of 3′ thymine oligonucleotide are compatible with the Illumina sequencing platform and ligated to the sheared DNAs.
As shown in Fig. 2, the L1Hs-specific targeted probe was designed in the specific region of L1Hs 3′ UTR and this probe underwent ligation with the Illumina adaptor sequence. To select fragment DNAs that contain an L1Hs region, we perform
The obtained raw sequencing data could be classified by each sample through “Demultiplexing step”, was generated by using the FastQC program. As shown in Fig. 3, there are bioinformatics analysis flowchart for the detection of non-reference L1Hs insertions. The Illumina’s adaptor sequence with L1Hs-targed sequence is not the L1Hs-targed sequence and should be removed using the Cutadapt1.1 version bioinformatics tool. The sequence data from which the L1Hs-targeted probe sequences had been removed were mapped to the human reference genome using Bowtie aligner program. The overlapped target regions were selected as the L1Hs insertion loci using two programs, HOMER and MACS2, which were developed to display the target region showing a significant degree of mapping within the aligned genomic sequence. There is no gold-standard for peak-calling step even though several developed approaches are used (Strino and Lappe, 2016). Because of the algorithm difference in the programs, false-positive results could be increased or the peaks that should actually be called is not selected. Nevertheless, it is known that the use of multiple replicate tools in the peak-calling programs has a high recall rate (Steinhauser et al., 2016).
As for a sequencing result, we obtained a total of approximately 30 million reads with 3.05 Gb using paired-end sequencing on the HiSeq2500 system. Each manufactured sequencing read contained partial L1Hs-targeted sequence (Read 2) and its flanking sequence (Read 1), respectively. However, in the case of Read 2 from beginning of the L1Hs 3′ UTR region, the poly A tail region is continuously recorded in the target library, and the quality score deteriorated from the signal detector of the HiSeq system. Furthermore, it was difficult to use because Read 2 containing poly A tail sequence was mapped to a number of paralogous region. Therefore, in bioinformatics analysis, the “Read 1” that is considered to contain the flanking sequence of L1Hs was used to identify the predicted region of non-reference L1Hs insertion (Table 1). The qualified reads after filtering out low-quality reads and trimming adaptors for Illumina platform were implicated in alignment to the human reference genome (Hg38). After mapping trimmed reads to human reference genome, we performed peak calling of mapped reads using two peak caller programs, HOMER and MACS, by which we find 2900 overlapped peak-callings (Lun and Smyth, 2016). Considering the normalized tag counts (number of tags found at the peak) and fold change (FC) value on the HOMER program, the low density peak and the annotated region (L1 subfamilies) were removed using the Bedtools command. The peak-callings with the same position as L1s present in the human reference genome and with small number of mapping reads (i.e, normalized tag counts < 4) were eliminated. As a result, a total of 525 non-reference L1Hs insertion candidates were selected (Table 2). In addition, we manually inspected the all loci by using the IGV program and the Custom Track with these positions on the UCSC browser (
Based on our data analysis, although we have found 525 non-reference L1Hs regions with 1.53 Gb data of single read, we expect to find more accurate and more L1Hs if data production is increased. However, as data production increases, the frequency of false positive is expected to increase. On the other hand, because our probe sequence is located at the 3′ UTR of L1Hs element, we could identify non-reference L1Hs integrated by a typical TPRT mechanism rather than targeting the 5′ UTR. Thus, this method is one way to find non-reference L1Hs elements that are mostly truncated in the 5′ end.
Through the manual inspection of 525 non-reference L1Hs insertions, we examined chromosomal distribution and genomic contribution. Chromosome 4 included the higher number of L1Hs insertions while we have not detected non-reference L1Hs insertions at chromosome 21 (
Among the 525 non-reference L1Hs insertion candidates, 40 L1Hs insertion loci were randomly selected and subjected to PCR verification (
Most of the L1s in the human genome are inactive. However, approximately 100 L1 copies still remain active in the human genome. By retrotransposition, they could influence on changing genomic structure, amplifying and/or disrupting gene expression, and leading genomic variations. Here, we aimed to identify non-reference L1Hs that still are retrotranspositionally competent in the human genome. Thus, we introduced the new L1 targeted-enrichment method based on 3′ UTR sequence to discover the typical L1Hs insertion and screened the non-reference L1Hs insertion loci from a Korean individual (KPGP9) by using Illumina platform. In conclusion, we propose that this fast and cost-effective method for L1Hs screening is useful to further investigate somatic mutations induced by TEs in cancer and genetic diseases.
Summary of the high-throughput sequencing data
Classification | Paired-end | Read 1 |
---|---|---|
Total reads | 30,228,074 | 15,114,037 |
Total bases | 3.05 Gb | 1.53 Gb |
GC contents (%) | 1,306,397,565 (42.79%) | 647,427,902 (60.93%) |
N zero reads (%) | 30,127,532 (99.67%) | 15,086,912 (99.82%) |
Q30 bases | 1,977,974,878 (64.79%) | 1,268,378,122 (83.09%) |
Read1 was the sequence that we used for the computational analysis.
Summary of non-reference L1Hs elements in the KPGP9 genome
Classification | No. of loci |
---|---|
2,900 | |
2,375 | |
525 | |
Intergenic regions | 261 |
Intronic regions | 247(40) |
Exonic regions | 17(1) |
Validation regions | 40 |
The number in parentheses indicates the number of predicted genes.