Re: Script help. Perl perhaps ? (31 Views)
Reply
Frequent Advisor
Luke Morgan
Posts: 67
Registered: ‎11-06-2000
Message 1 of 9 (31 Views)
Accepted Solution

Script help. Perl perhaps ?

Hi.
I have written the script below to do a simple lookup but now I need to do it on a much larger datafile and it will take an age.
I suspect perl is the way to go to make it faster, but I don't know any perl :(

Could someone help me to translate this script please to save my server days of processing?
Thanks in advance.

The script looks up each line of a data file, compares certain fields with fields from a master file, and outputs an id value from the master file along with certain fields from the data file.

masterfile=/tmp/masterfile
datafile=/tmp/datafile

for b in `cat $datafile`
do

compare=`echo $b | awk 'BEGIN{FS="|"}{data = $2$5$6;print data}END{}' `

for a in `cat $masterfile`
do
mastercompare=`echo $a | awk 'BEGIN{FS="|"}{line = $2$5$6;print line}END{}'`
if [ $mastercompare = $compare ]
then
id=`echo $a | awk 'BEGIN{FS="|"}{print $1}END{}'`
output=`echo $b | awk 'BEGIN{FS="|";OFS="|"}{print $3,$7}END{}'`
echo $id"|"$output >> luke.out
fi
done

done
Please use plain text.
Honored Contributor
Peter Nikitka
Posts: 1,575
Registered: ‎02-10-2003
Message 2 of 9 (31 Views)

Re: Script help. Perl perhaps ?

Hi,

my suggestion reads the masterfile and the datafile only once, and puts all info in an array (untested) - this should be MUCH faster:

awk -F'|' 'BEGIN { OFS="|";
while ((getline < "/tmp/masterfile") == 1) id[$2$5$6] = $1 }
{ line=$2$5$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"
Please use plain text.
Honored Contributor
Peter Nikitka
Posts: 1,575
Registered: ‎02-10-2003
Message 3 of 9 (31 Views)

Re: Script help. Perl perhaps ?

Hi,

to be clean, close the masterfile after reading:


awk -F'|' 'BEGIN { OFS="|"
while ((getline < "/tmp/masterfile") == 1) id[$2$5$6] = $1
close ("/tmp/masterfile") }
{ line=$2$5$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"
Please use plain text.
Honored Contributor
H.Merijn Brand (procura
Posts: 6,185
Registered: ‎10-13-1997
Message 4 of 9 (31 Views)

Re: Script help. Perl perhaps ?

Yes, in this case, perl would be extremely faster.

--8<---
#!/usr/bin/perl

use strict;
use warnings;

my $masterfile = "/tmp/masterfile";
my $datafile = "/tmp/datafile";

open my $out, ">", "luke.out" or die "luke.out: $!\n";

my %master;
open my $mst, "<", $masterfile or die "$masterfile: $!\n";
while (<$mst>) {
my @mst = split /\|/, $_;
$master{join "|", @mst[1,4,5]} = [ @mst[0,2,6] ];
}
close $mst;

open my $dta, "<", $datafile or die "$datafile: $!\n";
while (<$dta>) {
my @dta = split /\|/, $_;
my $cmp = join "|", @dta[1,4,5];
exists $master{$cmp} or next;

my $id = $master{$cmp}[0];
print $out "$id|$dta[2]|$dta[6]\n";
}
close $dta;
close $out;
-->8---

You could even gain a lot more speed if you told us the format of the fields, and change the join "|"'s to pack.

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
Please use plain text.
Honored Contributor
H.Merijn Brand (procura
Posts: 6,185
Registered: ‎10-13-1997
Message 5 of 9 (31 Views)

Re: Script help. Perl perhaps ?

Peter, I like your awk solution, which is pretty close to what I do in perl, but yours could be safer, if you would include the sep in the key

As we were not told how the data looks like, your script would map both (12, 345, 6789) and (1, 23456, 789) to the same key.

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
Please use plain text.
Frequent Advisor
Luke Morgan
Posts: 67
Registered: ‎11-06-2000
Message 6 of 9 (31 Views)

Re: Script help. Perl perhaps ?

Thank you both very much for your suggestions.
I have implemented Peters script and the difference in speed is astonishing!

FYI, the format of the data is this :
$2 is a four digit number
$5 is a two digit number
$6 is a single character

Thanks again

Luke
Please use plain text.
Honored Contributor
H.Merijn Brand (procura
Posts: 6,185
Registered: ‎10-13-1997
Message 7 of 9 (31 Views)

Re: Script help. Perl perhaps ?

Please bear in mind that last remark from me regarding the generated keys in the awk solution!

--8<--- perl with pack
#!/usr/bin/perl

use strict;
use warnings;

my $masterfile = "/tmp/masterfile";
my $datafile = "/tmp/datafile";

open my $out, ">", "luke.out" or die "luke.out: $!\n";

my %master;
open my $mst, "<", $masterfile or die "$masterfile: $!\n";
while (<$mst>) {
my @mst = split /\|/, $_;
$master{pack "ssA", @mst[1,4,5]} = $mst[0];
}
close $mst;

open my $dta, "<", $datafile or die "$datafile: $!\n";
while (<$dta>) {
my @dta = split /\|/, $_;
my $cmp = pack "ssA", @dta[1,4,5];
exists $master{$cmp} or next;

my $id = $master{$cmp};
print $out "$id|$dta[2]|$dta[6]\n";
}
close $dta;
close $out;
-->8---

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
Please use plain text.
Honored Contributor
Peter Nikitka
Posts: 1,575
Registered: ‎02-10-2003
Message 8 of 9 (31 Views)

Re: Script help. Perl perhaps ?

Hi,

Procura is totally correct in his remark - to include this in my awk solution simply add the field seperator to the key:

awk -F'|' 'BEGIN { OFS="|"
while ((getline < "/tmp/masterfile") == 1) id[$2$FS$5$FS$6] = $1
close ("/tmp/masterfile") }
{ line=$2$FS$5$FS$6; if(id[line]) print (id[line],$3,$7) }' /tmp/datafile >luke.out

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"
Please use plain text.
Honored Contributor
Peter Nikitka
Posts: 1,575
Registered: ‎02-10-2003
Message 9 of 9 (31 Views)

Re: Script help. Perl perhaps ?

Ups,

I do not complain about additional dollars normally :-).
But you work better here using

id[$2FS$5FS$6] = $1
instead of
id[$2$FS$5$FS$6] = $1

here ..

mfG Peter
The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"
Please use plain text.
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation