设为首页 收藏本站
查看: 642|回复: 0

[经验分享] Perl Huge XML Solution(1)Split Files and Multiple Threads

[复制链接]

尚未签到

发表于 2017-5-19 10:04:05 | 显示全部楼层 |阅读模式
Perl Huge XML Solution(1)Split Files and Multiple Threads

1. Upgrade the Perl
>sudo yum install cpan

>sudo cpan
cpan>install Bundle::CPAN
cpan>reload cpan

cpan>upgrade
Not working with Error Message
make NO isa perl

Solution:
> sudo yum install perl-Config*

Not working to upgrade the perl, but I can install the modules one by one
cpan> install Time::Piece
cpan> install Path::Class
cpan> install autodie
cpan> install Thread::Queue

2. Split The File
split_hero.pl
#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;
use Time::Piece;
use Path::Class;
use autodie; # die if problem reading or writing a file

my $OutputSize = 0;
my $OutputCount = 0;
my $MaxSize = 100_000_000;
my $HugeFileName = "data/728";

print localtime->strftime('%Y-%m-%d %X') . "\n";

my $out;
open(my $in, '<', $HugeFileName . '.xml') or die "input: $!\n";
while(<$in>) {
    if(!$out) {
        $OutputCount++;
        $OutputSize = 0;
        open($out, '>', $HugeFileName . "/output$OutputCount.xml") or die "output: $!\n";
        unless($OutputCount==1) {
            print $out qq{<?xml version='1.0' encoding='UTF-8'?>\n};
            print $out qq{<source>\n};
        }
    }
    print $out $_;
    $OutputSize += length($_);
    if(m|</job>|i) { #/
        if($OutputSize > $MaxSize) {
            print $out "</source>\n";
            close($out);
            $out = undef;
        }
    }
}
close($in);

my @files = glob($HugeFileName . "/*.xml");

my $dir = dir($HugeFileName);
my $list_file = $dir->file("file_list");
my $list_file_handle = $list_file->open('>>');

foreach my $file (@files) {
   $list_file_handle->print($file . "\n");
   print "$file\n";
}

print localtime->strftime('%Y-%m-%d %X') . "\n";

3. Multiple Threads on Perl
#!/usr/bin/perl

use strict;
use warnings;

use threads;
use Thread::Queue;

my $nthreads = 5;

my $process_q = Thread::Queue->new();
my $failed_q  = Thread::Queue->new();

#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.

sub worker {

    #NB - this will sit a loop indefinitely, until you close the queue.
    #using $process_q -> end
    #we do this once we've queued all the things we want to process
    #and the sub completes and exits neatly.
    #however if you _don't_ end it, this will sit waiting forever.
    while ( my $server = $process_q->dequeue() ) {
        chomp($server);
        print threads->self()->tid() . ": pinging $server\n";
        my $result = `/sbin/ping -c 1 $server`;
        if ($?) { $failed_q->enqueue($server) }
        print $result;
    }
}

#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
print("what is the task list = " . $input_fh . "\n");
$process_q->enqueue(<$input_fh>);
close($input_fh);

#we 'end' process_q  - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();

#start some threads
for ( 1 .. $nthreads ) {
    threads->create( \&worker );
}

#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
    $thr->join();
}

#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
    print "$server failed to ping\n";
}

I change that a little bit to call PHP
my $result = `php src/import.php 728 $server`;

4. Test Result
split Huge XML(4.5G)  on 2 cores CPU 4G memory Machine in 00:02:05
04:17:24
04:19:29

send to Redis/SQS on 2 cores CPU 4G memory Machine in 00:03:12
04:23:46
04:26:58


References:
http://sillycat.iteye.com/blog/1017590  file handler
http://sillycat.iteye.com/blog/2193773

Perl 1, 2, 3, 4, 6
http://sillycat.iteye.com/blog/1012882
http://sillycat.iteye.com/blog/1012923
http://sillycat.iteye.com/blog/1012940
http://sillycat.iteye.com/blog/1016428
http://sillycat.iteye.com/blog/1017632 string
http://sillycat.iteye.com/blog/1021197 web
http://sillycat.iteye.com/blog/1027282 queue client
http://sillycat.iteye.com/blog/1073593 browser info

Split XML File
http://stackoverflow.com/questions/11313852/split-one-file-into-multiple-files-based-on-delimiter
http://stackoverflow.com/questions/15503980/split-file-by-xml-tag
http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_24760607.html
https://metacpan.org/pod/XML::Twig#xml_split---cut-a-big-XML-file-into-smaller-chunks
http://code.izzid.com/2008/01/21/How-to-move-back-a-line-with-reading-a-perl-filehandle.html

Perl threads
http://stackoverflow.com/questions/26296206/perl-daemonize-with-child-daemons/26297240#26297240
http://stackoverflow.com/questions/6556976/how-to-use-perl-to-run-the-same-php-script-parallel

Perl Zip the File
http://perldoc.perl.org/IO/Compress/Zip.html

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-378992-1-1.html 上篇帖子: Perl基本技二:吹姆 —— 去掉前后空格 下篇帖子: perlintro -- a brief introduction and overview of Perl
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表