Perl Huge XML Solution(1)Split Files and Multiple Threads
Perl Huge XML Solution(1)Split Files and Multiple Threads1. Upgrade the Perl
>sudo yum install cpan
>sudo cpan
cpan>install Bundle::CPAN
cpan>reload cpan
cpan>upgrade
Not working with Error Message
make NO isa perl
Solution:
> sudo yum install perl-Config*
Not working to upgrade the perl, but I can install the modules one by one
cpan> install Time::Piece
cpan> install Path::Class
cpan> install autodie
cpan> install Thread::Queue
2. Split The File
split_hero.pl
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use Time::Piece;
use Path::Class;
use autodie; # die if problem reading or writing a file
my $OutputSize = 0;
my $OutputCount = 0;
my $MaxSize = 100_000_000;
my $HugeFileName = "data/728";
print localtime->strftime('%Y-%m-%d %X') . "\n";
my $out;
open(my $in, '<', $HugeFileName . '.xml') or die "input: $!\n";
while(<$in>) {
if(!$out) {
$OutputCount++;
$OutputSize = 0;
open($out, '>', $HugeFileName . "/output$OutputCount.xml") or die "output: $!\n";
unless($OutputCount==1) {
print $out qq{<?xml version='1.0' encoding='UTF-8'?>\n};
print $out qq{<source>\n};
}
}
print $out $_;
$OutputSize += length($_);
if(m|</job>|i) { #/
if($OutputSize > $MaxSize) {
print $out "</source>\n";
close($out);
$out = undef;
}
}
}
close($in);
my @files = glob($HugeFileName . "/*.xml");
my $dir = dir($HugeFileName);
my $list_file = $dir->file("file_list");
my $list_file_handle = $list_file->open('>>');
foreach my $file (@files) {
$list_file_handle->print($file . "\n");
print "$file\n";
}
print localtime->strftime('%Y-%m-%d %X') . "\n";
3. Multiple Threads on Perl
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $nthreads = 5;
my $process_q = Thread::Queue->new();
my $failed_q= Thread::Queue->new();
#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.
sub worker {
#NB - this will sit a loop indefinitely, until you close the queue.
#using $process_q -> end
#we do this once we've queued all the things we want to process
#and the sub completes and exits neatly.
#however if you _don't_ end it, this will sit waiting forever.
while ( my $server = $process_q->dequeue() ) {
chomp($server);
print threads->self()->tid() . ": pinging $server\n";
my $result = `/sbin/ping -c 1 $server`;
if ($?) { $failed_q->enqueue($server) }
print $result;
}
}
#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
print("what is the task list = " . $input_fh . "\n");
$process_q->enqueue(<$input_fh>);
close($input_fh);
#we 'end' process_q- when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();
#start some threads
for ( 1 .. $nthreads ) {
threads->create( \&worker );
}
#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
$thr->join();
}
#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
print "$server failed to ping\n";
}
I change that a little bit to call PHP
my $result = `php src/import.php 728 $server`;
4. Test Result
split Huge XML(4.5G)on 2 cores CPU 4G memory Machine in 00:02:05
04:17:24
04:19:29
send to Redis/SQS on 2 cores CPU 4G memory Machine in 00:03:12
04:23:46
04:26:58
References:
http://sillycat.iteye.com/blog/1017590file handler
http://sillycat.iteye.com/blog/2193773
Perl 1, 2, 3, 4, 6
http://sillycat.iteye.com/blog/1012882
http://sillycat.iteye.com/blog/1012923
http://sillycat.iteye.com/blog/1012940
http://sillycat.iteye.com/blog/1016428
http://sillycat.iteye.com/blog/1017632 string
http://sillycat.iteye.com/blog/1021197 web
http://sillycat.iteye.com/blog/1027282 queue client
http://sillycat.iteye.com/blog/1073593 browser info
Split XML File
http://stackoverflow.com/questions/11313852/split-one-file-into-multiple-files-based-on-delimiter
http://stackoverflow.com/questions/15503980/split-file-by-xml-tag
http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_24760607.html
https://metacpan.org/pod/XML::Twig#xml_split---cut-a-big-XML-file-into-smaller-chunks
http://code.izzid.com/2008/01/21/How-to-move-back-a-line-with-reading-a-perl-filehandle.html
Perl threads
http://stackoverflow.com/questions/26296206/perl-daemonize-with-child-daemons/26297240#26297240
http://stackoverflow.com/questions/6556976/how-to-use-perl-to-run-the-same-php-script-parallel
Perl Zip the File
http://perldoc.perl.org/IO/Compress/Zip.html
页:
[1]