awk, python, perl文本处理效率对比(zz)

xxqyzsc 发表于 2015-12-27 07:49:19

　　转载，比较结果不一定正确，比较设计不一定科学.
　　以下3个文件依次是用python、awk和perl写的脚本，做同一件事情：

diff.sh f1 f2
f1和f2每一行的第一个字段（以空格分割）为key，如果f2某一行的key在f1中不存在，则输出f2该行。
比如：
a.dat的内容是
1 a
2 a
b.dat的内容是
1 b
3 b
那么diff.sh a.dat b.dat则输出
3 b
代码：　　#!/usr/bin/python
import sys
if len(sys.argv) != 3:
print "Usage: " + sys.argv + "file1 file2";
sys.exit(-1);
file1 = sys.argv
file2 = sys.argv
list1 = {};
for line in open(file1):
list1] = 1;
for line in open(file2):
key = line.split();
if key not in list1:
   sys.stdout.write(line)
　　#!/bin/sh
if [[ $# < 2 ]];then
echo "Usage: $0 file1 file2"
exit
fi
function do_diff()
{
if [[ $# < 2 ]];then
   echo "Usage: $0 file1 file2"
   return 1
fi
if [[ ! -f $1 ]];then
   echo "$1 is not file"
   return 2
fi
if [[ ! -f $2 ]];then
   echo "$2 is not file"
   return 3
fi
awk '
   BEGIN{FS=OFS=" "}
   ARGIND == 1 {
         arr[$1] = 1;
   }
   ARGIND == 2 {
         if (!($1 in arr)) {
            print $0;
         }
   }
' $1 $2
}
do_diff $1 $2

　　#!/usr/bin/perl -w
exit if (1 > $#ARGV);
my %map_orig;
my $file_orig = shift @ARGV;
open FH, "<$file_orig" or die "can't open file: $file_orig";
while (<FH>) {
   chomp;
   #$map_orig{$_} = 1;
   my ($filed) = split /\s+/;
   $map_orig{$filed} = 1;
}
close (FH);
my $file_diff = shift @ARGV;
open FH, "<$file_diff" or die "can't open file: $file_diff";
while (<FH>) {
   chomp;
   my ($filed) = split /\s+/;
   print "$_\n" if (!defined$map_orig{$filed});
}
close (FH)

测试方法：time diff.xx f1 f2 > out
测试文件f1有107375330行，每一行格式为：
key value（两个字段）
文件大小为2.2G
f2有473951行，每一行的格式也是：
key value(两字段）
文件大小为5.9M

测试结果：
diff.py的时间为3m24.687s = 205s
diff.sh的时间为3m39.762s = 220s
diff.pl的时间为5m49.478s = 349s

结果显示awk和python的性能差不多，perl则要明显差些。看来python的dict优化得很好，居然能赶上awk的性能，很出乎我的意料。

页: [1]

运维网's Archiver

awk, python, perl文本处理效率对比(zz)