13916729435 发表于 2015-12-1 11:07:06

split function of Perl,Python,Awk

  使用中常用到Perl,Python,AWK,R, 虽然Java,C,C++,Vala也学过但是就是不喜欢,你说怎么办。
  看来一辈子脚本的命。
  Perl
  @rray = split /PATTERN/, STRING, LIMIT
  可以看出split由2部分(STRING,PATTERN)和可选的LIMIT部分构成,反正split么,万变不离其宗,都要有
  你要split的String,split的界定,split的保存,其他的可以外加
  我们举一个简单的例子:
  > cat test.txt (为了对齐,黄色表示<tab>,绿色表示一个或者多个空格)
  [我们要把其中的数字和单词提取出来]
  14    yahoo  17:56 Ray---boring
> perl -e '$str="14    yahoo\t 17:56 Ray---boring";@num_word=split /[\s:\-]+/, $str;print "@num_word\n"'
14 yahoo 17 56 Ray boring
  复杂一点的例子:
  
[我们要提取浅蓝色的部分]
  
../Sample_Tgh_leaf_1_rRNA_removal_20140624/Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008_R1_001.fastq.gz
  
../Sample_Tgh_leaf_1_rRNA_removal_20140624/Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008_R2_001.fastq.gz
  
> perl -e '$str="../Sample_Tgh_leaf_1_rRNA_removal_20140624/Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008_R1_001.fastq.gz";$name = join "_", @{->]};print "$name\n"'
Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008
我们解析一下:
split /\// , $str 把字符串分为3段"..","Sample_Tgh_leaf_1_rRNA_removal_20140624","Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008_R1_001.fastq.gz"
将split的结果变成一个匿名数组
-> 是通过引用取得这个匿名数组的第3个元素
->] 是又将split结果变成一个数组(为10个元素) "Tgh","leaf","1","rRNA","removal","20140624","ATTCCT","L008","R1","001.fastq.gz"
由于匿名数组识别数组切片(->行不通),所以需要把这个数组显性の加上@{}数组标记
@{->]} 得到了一个长度为8的数组切片,仍为数组 "Tgh","leaf","1","rRNA","removal","20140624","ATTCCT","L008"
join "_", @rray 把这些元素连接起来,变成一个字符串:Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008

[要注意,分隔符部分是PATTERN,即正则表达式,不是字符串]
[其实可以直接只用一个split,什么参数也不带,这样默认的PATTREN为空格,默认的STRNIING为$_种的字符串(这个是Perl的通行证),这个用在函数,for循环等默认参数是$_的很方便]
例如:
A BC DEF GHIJ
KLMN OPQ RS T
U VWX YZ
> perl -e '@rray=("A BC DEF GHIJ","KLMN OPQ RS T","U VWX YZ");@all=();for (@rray){@tmp=split;push @all, @tmp};print "@all\n"'
A BC DEF GHIJ KLMN OPQ RS T U VWX YZ

Python
  Python的很人性化,但是不强大,可是通过其他方式提供和更强大的re.split
  由于Python的一切都是对象和类,所以么,
  简单的例子:
  14    yahoo  17:56 Ray---boring
  eehhhh, str类的split无法完成上面的分割(也可以但是你要想一想,看Attachment 1 below),只能借助re.split来完成
  > py3 -c 'import re;stri="14    yahoo  17:56 Ray---boring";print(re.split("[\s:\-]+",stri));'
['14', 'yahoo', '17', '56', 'Ray', 'boring']
简单的例子:
[我们要提取浅蓝色的部分]
../Sample_Tgh_leaf_1_rRNA_removal_20140624/Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008_R1_001.fastq.gz
../Sample_Tgh_leaf_1_rRNA_removal_20140624/Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008_R2_001.fastq.gz
> py3 -c 'stri="../Sample_Tgh_leaf_1_rRNA_removal_20140624/Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008_R1_001.fastq.gz";print("_".join(stri.split("/").split("_")))'
Tgh_leaf_1_rRNA_removal_20140624_ATTCCT
  这个用python实现的话就很简单
  
  
  AWK
  split(string,array,sep)是个函数,那种很老的函数
[我们要提取浅蓝色的部分]
../Sample_Tgh_leaf_1_rRNA_removal_20140624/Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008_R1_001.fastq.gz
../Sample_Tgh_leaf_1_rRNA_removal_20140624/Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008_R2_001.fastq.gz
> echo ../Sample_Tgh_leaf_1_rRNA_removal_20140624/Tgh_leaf_1_rRNA_removal_20140624_ATTCCT_L008_R1_001.fastq.gz | awk '{split($1,aa,"/");split(aa,bb,"_");} END{for(i=1;i<=8;i++){if(i<8){printf("%s_",bb)}else{print bb}}}'
或者
awk -F'/' '{split($3,bb,"_");}END{for(i=1;i<=8;i++){if(i<8){printf("%s_",bb)}else{print bb}}}'

  Attachment:
  1. py3 -c 'import re;stri="14    yahoo  17:56 Ray---boring";orig=stri.split();part2=orig.split(":");part3=orig.split("---");print(orig+part2+part3);'
页: [1]
查看完整版本: split function of Perl,Python,Awk