Fast way to extract lines from a large file with 80 billion lines2019 Community Moderator Election“Multipass” scripted modification of large file in-place (file-system level)?Easy way to copy lines from one file to anotherExtract several lines from large text fileHow to group values based on a “connectedness” metric?Print each line multiple but different timesextracting lines of text from a long fileExtracting time from text filefast ways of removing beginning lines from large text fileExtract lines that have a specific ending and use those to extract from another fileWhy does head; tail on a large file sometimes take a long time and sometimes not?
Does fire aspect on a sword, destroy mob drops?
Would mining huge amounts of resources on the Moon change its orbit?
Is "inadequate referencing" a euphemism for plagiarism?
Determine voltage drop over 10G resistors with cheap multimeter
How do hiring committees for research positions view getting "scooped"?
Don't understand why (5 | -2) > 0 is False where (5 or -2) > 0 is True
How to test the sharpness of a knife?
Do I need an EFI partition for each 18.04 ubuntu I have on my HD?
Why doesn't the fusion process of the sun speed up?
How to find the largest number(s) in a list of elements?
label a part of commutative diagram
Norwegian Refugee travel document
Is there any common country to visit for uk and schengen visa?
Air travel with refrigerated insulin
How to balance a monster modification (zombie)?
Animating wave motion in water
Error in master's thesis, I do not know what to do
Writing in a Christian voice
Did Nintendo change its mind about 68000 SNES?
Why doesn't the chatan sign the ketubah?
How can I query the supported timezones in Apex?
Inhabiting Mars versus going straight for a Dyson swarm
How much propellant is used up until liftoff?
What is it called when someone votes for an option that's not their first choice?
Fast way to extract lines from a large file with 80 billion lines
2019 Community Moderator Election“Multipass” scripted modification of large file in-place (file-system level)?Easy way to copy lines from one file to anotherExtract several lines from large text fileHow to group values based on a “connectedness” metric?Print each line multiple but different timesextracting lines of text from a long fileExtracting time from text filefast ways of removing beginning lines from large text fileExtract lines that have a specific ending and use those to extract from another fileWhy does head; tail on a large file sometimes take a long time and sometimes not?
I have a large file with 80 billion lines. Now I want to extract a few lines (around 10000) which I know the line number, what is the fastest way to deal with it. Your help is really appreciated
Is it possible to extract those lines from using another file which contains the line numbers? The line numbers in the file of line numbers would not always be consecutive.
For example, the original file is:
0.1
0.2
0.3
0.4
...
the line number file:
1
3
4
the output:
0.1
0.3
0.4
linux large-files
|
show 4 more comments
I have a large file with 80 billion lines. Now I want to extract a few lines (around 10000) which I know the line number, what is the fastest way to deal with it. Your help is really appreciated
Is it possible to extract those lines from using another file which contains the line numbers? The line numbers in the file of line numbers would not always be consecutive.
For example, the original file is:
0.1
0.2
0.3
0.4
...
the line number file:
1
3
4
the output:
0.1
0.3
0.4
linux large-files
If you expect to have to do this more than once, consider putting the lines into an SQL database or something of the sort.
– Nate Eldredge
Mar 14 at 3:36
Are they sequential 10000 lines or sporadic throughout the log file? If sequential and there's some unique pattern at the beginning you could just use grep -A 10000 <pattern> <filename>.
– kevlinux
Mar 14 at 3:48
2
Are line numbers in line number file sorted?
– JohnKoch
Mar 14 at 7:54
2
Are the lines expected to be extracted in the order of the line numbers in the smaller file?
– Kusalananda
Mar 14 at 8:03
1
This might help: stackoverflow.com/questions/6022384/…
– kevlinux
Mar 15 at 6:08
|
show 4 more comments
I have a large file with 80 billion lines. Now I want to extract a few lines (around 10000) which I know the line number, what is the fastest way to deal with it. Your help is really appreciated
Is it possible to extract those lines from using another file which contains the line numbers? The line numbers in the file of line numbers would not always be consecutive.
For example, the original file is:
0.1
0.2
0.3
0.4
...
the line number file:
1
3
4
the output:
0.1
0.3
0.4
linux large-files
I have a large file with 80 billion lines. Now I want to extract a few lines (around 10000) which I know the line number, what is the fastest way to deal with it. Your help is really appreciated
Is it possible to extract those lines from using another file which contains the line numbers? The line numbers in the file of line numbers would not always be consecutive.
For example, the original file is:
0.1
0.2
0.3
0.4
...
the line number file:
1
3
4
the output:
0.1
0.3
0.4
linux large-files
linux large-files
edited Mar 14 at 8:38
Kusalananda
136k17257425
136k17257425
asked Mar 14 at 3:28
user2842390user2842390
183
183
If you expect to have to do this more than once, consider putting the lines into an SQL database or something of the sort.
– Nate Eldredge
Mar 14 at 3:36
Are they sequential 10000 lines or sporadic throughout the log file? If sequential and there's some unique pattern at the beginning you could just use grep -A 10000 <pattern> <filename>.
– kevlinux
Mar 14 at 3:48
2
Are line numbers in line number file sorted?
– JohnKoch
Mar 14 at 7:54
2
Are the lines expected to be extracted in the order of the line numbers in the smaller file?
– Kusalananda
Mar 14 at 8:03
1
This might help: stackoverflow.com/questions/6022384/…
– kevlinux
Mar 15 at 6:08
|
show 4 more comments
If you expect to have to do this more than once, consider putting the lines into an SQL database or something of the sort.
– Nate Eldredge
Mar 14 at 3:36
Are they sequential 10000 lines or sporadic throughout the log file? If sequential and there's some unique pattern at the beginning you could just use grep -A 10000 <pattern> <filename>.
– kevlinux
Mar 14 at 3:48
2
Are line numbers in line number file sorted?
– JohnKoch
Mar 14 at 7:54
2
Are the lines expected to be extracted in the order of the line numbers in the smaller file?
– Kusalananda
Mar 14 at 8:03
1
This might help: stackoverflow.com/questions/6022384/…
– kevlinux
Mar 15 at 6:08
If you expect to have to do this more than once, consider putting the lines into an SQL database or something of the sort.
– Nate Eldredge
Mar 14 at 3:36
If you expect to have to do this more than once, consider putting the lines into an SQL database or something of the sort.
– Nate Eldredge
Mar 14 at 3:36
Are they sequential 10000 lines or sporadic throughout the log file? If sequential and there's some unique pattern at the beginning you could just use grep -A 10000 <pattern> <filename>.
– kevlinux
Mar 14 at 3:48
Are they sequential 10000 lines or sporadic throughout the log file? If sequential and there's some unique pattern at the beginning you could just use grep -A 10000 <pattern> <filename>.
– kevlinux
Mar 14 at 3:48
2
2
Are line numbers in line number file sorted?
– JohnKoch
Mar 14 at 7:54
Are line numbers in line number file sorted?
– JohnKoch
Mar 14 at 7:54
2
2
Are the lines expected to be extracted in the order of the line numbers in the smaller file?
– Kusalananda
Mar 14 at 8:03
Are the lines expected to be extracted in the order of the line numbers in the smaller file?
– Kusalananda
Mar 14 at 8:03
1
1
This might help: stackoverflow.com/questions/6022384/…
– kevlinux
Mar 15 at 6:08
This might help: stackoverflow.com/questions/6022384/…
– kevlinux
Mar 15 at 6:08
|
show 4 more comments
4 Answers
4
active
oldest
votes
One liner, using sed:
sed -nf <(sed 's/$/p/' linenumberfile) contentfile
To keep the original order in linenumberfile, you can do
sed -nf <(sed 's/$/p/' linenumberfile) contentfile | paste <(nl linenumberfile | sort -n -k 2,2) - | sort -n -k 1,1 | cut -f 3-
Explanation:
sed 's/$/p/' linenumberfile
generates a sed script which prints the specified line. The script is then fed into another sed (with -n to suppress default printing of the pattern space) to do the actual printing. Since sed process the content file line by line, the output will be in the same order as in the content file. Note that this is a one-pass process so I would expect the speed to be acceptable.
To accelerate the process, one can change p to p;b and add a q at the end of the generated sed script.
To retain the order of the lines as they are in the line number file, nl is used to add "line numbers" to the line number file. So a line number file
4
5
2
would become
1 4
2 5
3 2
The first column records the original order in the line number file.
The file with "line numbers" is then sorted and pasted to the output of sed, to make
3 2 content_of_line2
1 4 content_of_line4
2 5 content_of_line5
then it is sorted using the 1st column as the key, to finally obtain
1 4 content_of_line4
2 5 content_of_line5
3 2 content_of_line2
Finally, cut is used to remove the 2 extra columns.
Benchmarking
It seems sed would do best for a few lines, but perl is the way to go for 10000 lines as specified in the question.
$ cat /proc/cpuinfo | grep -A 4 -m 1 processor
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
$ wc -l linenumber
10 linenumber
$ wc -l content
8982457 content
$ file content
content: ASCII text
$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m0.791s
user 0m0.661s
sys 0m0.133s
$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.061s
user 0m2.908s
sys 0m0.152s
$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.706s
user 0m1.582s
sys 0m0.124s
$ ./genlinenumber.py 100 > linenumber
$ wc -l linenumber
100 linenumber
$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m3.326s
user 0m3.164s
sys 0m0.164s
$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.055s
user 0m2.890s
sys 0m0.164s
$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.769s
user 0m1.604s
sys 0m0.165s
If it is required to retain the order of lines, the command after the first | can still be used since the time is negligible.
$ ./genlinenumber.py 10000 > linenumber
$ wc -l linenumber
10000 linenumber
$ time bash -c "./ln.pl linenumber content > extract"
real 0m1.933s
user 0m1.791s
sys 0m0.141s
$ time bash -c "paste <(nl linenumber | sort -n -k 2,2) extract | sort -n -k 1,1 | cut -f 3- > /dev/null"
real 0m0.018s
user 0m0.012s
sys 0m0.005s
1
Done. Suggestions are welcome.
– Weijun Zhou
Mar 14 at 8:58
Do common versions ofsedsupport line numbers larger than 32 bits?
– Nate Eldredge
Mar 14 at 14:40
1
You are right. I made a mistake in last edits.
– Weijun Zhou
Mar 14 at 20:22
@NateEldredge I'm not sure about that.
– Weijun Zhou
Mar 15 at 0:01
add a comment |
I would use a perl script for this. I came up with this:
#!/usr/bin/perl
# usage: thisscript linenumberslist.txt contentsfile
unless (open(IN, $ARGV[0]))
die "Can't open list of line numbers file '$ARGV[0]'n";
my %linenumbers = ();
while (<IN>)
chomp;
$linenumbers$_ = 1;
unless (open(IN, $ARGV[1]))
die "Can't open contents file '$ARGV[1]'n";
$. = 0;
while (<IN>)
print if defined $linenumbers$.;
exit;
This first reads the list of line numbers that we're interested in into an associative array, where the line numbers are the key. chomp removes the newline at the end of the line, $_ is the line itself.
Next the data file is opened, and when the line number is an existing key in the array of line numbers, then the line is printed.
The $. is perl's line number counter, this increments for every line read. As this is counted across files, I reset it to zero before reading any lines of the data file.
This could probably be written much more in "perl" style, but I prefer to keep it a bit more readable.
If the list of lines you want to extract is very large, this may not be the most efficient way, but I find that perl is often amazingly efficient at these things.
If you require the lines to be extracted in the order that they are listed, i.e. not sequentially, then it becomes a lot more complicated...
add a comment |
Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.
join
Assuming you have a data file you want to extract rows from and a line_numbers file that lists the numbers of the rows you want to extract, if the sorting order of the output is not important you can use:
join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-
This will number the lines of your data file, join it with the padded_line_numbers file on the first field (the default) and print out the common lines (excluding the join field itself, that is cut away).
join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:
while read rownum; do
printf '%.12dn' "$rownum"
done <line_numbers >padded_line_numbers
The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.
If the sorting order of the output has to match that of your line_numbers file, you can use:
join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2)
<(nl -w 12 -n rz data) |
sort -k 2,2n |
cut -d ' ' -f 3-
Where we are numbering the padded_line_numbers file, sorting the result alphabetically by its second field, joining it with the numbered data file and numerically sorting the result by the original sorting order of padded_line_numbers.
Process substitution is here used for convenience. If you can not or do not want to rely on it and, as it is likely, you are not willing to waste the storage needed for creating regular files to hold intermediate results, you can leverage named pipes:
mkfifo padded_line_numbers
mkfifo numbered_data
while read rownum; do
printf '%.12dn' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &
nl -w 12 -n rz data >numbered_data &
join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-
Benchmarking
Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.
For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':
$ head -n 3 data
c15d
061d
5787
$ wc -l data
3221254963 data
The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:
shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers
The testing environment was a laptop with a i7-2670QM Intel quad-core processor, 16 GiB of memory, SSD storage, GNU/Linux, bash 5.0 and GNU tools.
The only dimension I measured has been the execution time, by means of the time shell builtin.
Here I'm considering:
- The
sedsolution from Weijun Zhou's answer. - The
awksolution from Micha's answer. - The
perlsolution from wurtel's answer. - The
joinsolution above.
perl seems to be the fastest:
$ time perl_script line_numbers data | wc -l
10000
real 14m51.597s
user 14m41.878s
sys 0m9.299s
awk's performance looks comparable:
$ time awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000
real 29m3.808s
user 28m52.616s
sys 0m10.709s
join, too, appears to be comparable:
$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000
real 28m24.053s
user 27m52.857s
sys 0m28.958s
Note that the sorted version mentioned above has roughly no performance penalty over this one.
Finally, sed appears to be significantly slower: I killed it after approximately nine hours:
$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C
real 551m12.747s
user 550m53.390s
sys 0m15.624s
add a comment |
micha@linux-micha: /tmp
$ cat numbers.txt
1
2
4
5
micha@linux-micha: /tmp
$ cat sentences.txt
alpha
bravo
charlie
delta
echo
foxtrott
micha@linux-micha: /tmp
$ awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' numbers.txt sentences.txt
alpha
bravo
delta
echo
This one invokesawkmany times and will be really slow ifsentences.txtis a huge file.
– Weijun Zhou
Mar 14 at 23:59
W. Zhou is right. Thank you! So I'll think twice.
– Micha
Mar 15 at 0:18
1
So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?
– Micha
Mar 15 at 1:05
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f506207%2ffast-way-to-extract-lines-from-a-large-file-with-80-billion-lines%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
One liner, using sed:
sed -nf <(sed 's/$/p/' linenumberfile) contentfile
To keep the original order in linenumberfile, you can do
sed -nf <(sed 's/$/p/' linenumberfile) contentfile | paste <(nl linenumberfile | sort -n -k 2,2) - | sort -n -k 1,1 | cut -f 3-
Explanation:
sed 's/$/p/' linenumberfile
generates a sed script which prints the specified line. The script is then fed into another sed (with -n to suppress default printing of the pattern space) to do the actual printing. Since sed process the content file line by line, the output will be in the same order as in the content file. Note that this is a one-pass process so I would expect the speed to be acceptable.
To accelerate the process, one can change p to p;b and add a q at the end of the generated sed script.
To retain the order of the lines as they are in the line number file, nl is used to add "line numbers" to the line number file. So a line number file
4
5
2
would become
1 4
2 5
3 2
The first column records the original order in the line number file.
The file with "line numbers" is then sorted and pasted to the output of sed, to make
3 2 content_of_line2
1 4 content_of_line4
2 5 content_of_line5
then it is sorted using the 1st column as the key, to finally obtain
1 4 content_of_line4
2 5 content_of_line5
3 2 content_of_line2
Finally, cut is used to remove the 2 extra columns.
Benchmarking
It seems sed would do best for a few lines, but perl is the way to go for 10000 lines as specified in the question.
$ cat /proc/cpuinfo | grep -A 4 -m 1 processor
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
$ wc -l linenumber
10 linenumber
$ wc -l content
8982457 content
$ file content
content: ASCII text
$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m0.791s
user 0m0.661s
sys 0m0.133s
$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.061s
user 0m2.908s
sys 0m0.152s
$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.706s
user 0m1.582s
sys 0m0.124s
$ ./genlinenumber.py 100 > linenumber
$ wc -l linenumber
100 linenumber
$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m3.326s
user 0m3.164s
sys 0m0.164s
$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.055s
user 0m2.890s
sys 0m0.164s
$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.769s
user 0m1.604s
sys 0m0.165s
If it is required to retain the order of lines, the command after the first | can still be used since the time is negligible.
$ ./genlinenumber.py 10000 > linenumber
$ wc -l linenumber
10000 linenumber
$ time bash -c "./ln.pl linenumber content > extract"
real 0m1.933s
user 0m1.791s
sys 0m0.141s
$ time bash -c "paste <(nl linenumber | sort -n -k 2,2) extract | sort -n -k 1,1 | cut -f 3- > /dev/null"
real 0m0.018s
user 0m0.012s
sys 0m0.005s
1
Done. Suggestions are welcome.
– Weijun Zhou
Mar 14 at 8:58
Do common versions ofsedsupport line numbers larger than 32 bits?
– Nate Eldredge
Mar 14 at 14:40
1
You are right. I made a mistake in last edits.
– Weijun Zhou
Mar 14 at 20:22
@NateEldredge I'm not sure about that.
– Weijun Zhou
Mar 15 at 0:01
add a comment |
One liner, using sed:
sed -nf <(sed 's/$/p/' linenumberfile) contentfile
To keep the original order in linenumberfile, you can do
sed -nf <(sed 's/$/p/' linenumberfile) contentfile | paste <(nl linenumberfile | sort -n -k 2,2) - | sort -n -k 1,1 | cut -f 3-
Explanation:
sed 's/$/p/' linenumberfile
generates a sed script which prints the specified line. The script is then fed into another sed (with -n to suppress default printing of the pattern space) to do the actual printing. Since sed process the content file line by line, the output will be in the same order as in the content file. Note that this is a one-pass process so I would expect the speed to be acceptable.
To accelerate the process, one can change p to p;b and add a q at the end of the generated sed script.
To retain the order of the lines as they are in the line number file, nl is used to add "line numbers" to the line number file. So a line number file
4
5
2
would become
1 4
2 5
3 2
The first column records the original order in the line number file.
The file with "line numbers" is then sorted and pasted to the output of sed, to make
3 2 content_of_line2
1 4 content_of_line4
2 5 content_of_line5
then it is sorted using the 1st column as the key, to finally obtain
1 4 content_of_line4
2 5 content_of_line5
3 2 content_of_line2
Finally, cut is used to remove the 2 extra columns.
Benchmarking
It seems sed would do best for a few lines, but perl is the way to go for 10000 lines as specified in the question.
$ cat /proc/cpuinfo | grep -A 4 -m 1 processor
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
$ wc -l linenumber
10 linenumber
$ wc -l content
8982457 content
$ file content
content: ASCII text
$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m0.791s
user 0m0.661s
sys 0m0.133s
$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.061s
user 0m2.908s
sys 0m0.152s
$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.706s
user 0m1.582s
sys 0m0.124s
$ ./genlinenumber.py 100 > linenumber
$ wc -l linenumber
100 linenumber
$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m3.326s
user 0m3.164s
sys 0m0.164s
$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.055s
user 0m2.890s
sys 0m0.164s
$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.769s
user 0m1.604s
sys 0m0.165s
If it is required to retain the order of lines, the command after the first | can still be used since the time is negligible.
$ ./genlinenumber.py 10000 > linenumber
$ wc -l linenumber
10000 linenumber
$ time bash -c "./ln.pl linenumber content > extract"
real 0m1.933s
user 0m1.791s
sys 0m0.141s
$ time bash -c "paste <(nl linenumber | sort -n -k 2,2) extract | sort -n -k 1,1 | cut -f 3- > /dev/null"
real 0m0.018s
user 0m0.012s
sys 0m0.005s
1
Done. Suggestions are welcome.
– Weijun Zhou
Mar 14 at 8:58
Do common versions ofsedsupport line numbers larger than 32 bits?
– Nate Eldredge
Mar 14 at 14:40
1
You are right. I made a mistake in last edits.
– Weijun Zhou
Mar 14 at 20:22
@NateEldredge I'm not sure about that.
– Weijun Zhou
Mar 15 at 0:01
add a comment |
One liner, using sed:
sed -nf <(sed 's/$/p/' linenumberfile) contentfile
To keep the original order in linenumberfile, you can do
sed -nf <(sed 's/$/p/' linenumberfile) contentfile | paste <(nl linenumberfile | sort -n -k 2,2) - | sort -n -k 1,1 | cut -f 3-
Explanation:
sed 's/$/p/' linenumberfile
generates a sed script which prints the specified line. The script is then fed into another sed (with -n to suppress default printing of the pattern space) to do the actual printing. Since sed process the content file line by line, the output will be in the same order as in the content file. Note that this is a one-pass process so I would expect the speed to be acceptable.
To accelerate the process, one can change p to p;b and add a q at the end of the generated sed script.
To retain the order of the lines as they are in the line number file, nl is used to add "line numbers" to the line number file. So a line number file
4
5
2
would become
1 4
2 5
3 2
The first column records the original order in the line number file.
The file with "line numbers" is then sorted and pasted to the output of sed, to make
3 2 content_of_line2
1 4 content_of_line4
2 5 content_of_line5
then it is sorted using the 1st column as the key, to finally obtain
1 4 content_of_line4
2 5 content_of_line5
3 2 content_of_line2
Finally, cut is used to remove the 2 extra columns.
Benchmarking
It seems sed would do best for a few lines, but perl is the way to go for 10000 lines as specified in the question.
$ cat /proc/cpuinfo | grep -A 4 -m 1 processor
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
$ wc -l linenumber
10 linenumber
$ wc -l content
8982457 content
$ file content
content: ASCII text
$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m0.791s
user 0m0.661s
sys 0m0.133s
$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.061s
user 0m2.908s
sys 0m0.152s
$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.706s
user 0m1.582s
sys 0m0.124s
$ ./genlinenumber.py 100 > linenumber
$ wc -l linenumber
100 linenumber
$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m3.326s
user 0m3.164s
sys 0m0.164s
$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.055s
user 0m2.890s
sys 0m0.164s
$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.769s
user 0m1.604s
sys 0m0.165s
If it is required to retain the order of lines, the command after the first | can still be used since the time is negligible.
$ ./genlinenumber.py 10000 > linenumber
$ wc -l linenumber
10000 linenumber
$ time bash -c "./ln.pl linenumber content > extract"
real 0m1.933s
user 0m1.791s
sys 0m0.141s
$ time bash -c "paste <(nl linenumber | sort -n -k 2,2) extract | sort -n -k 1,1 | cut -f 3- > /dev/null"
real 0m0.018s
user 0m0.012s
sys 0m0.005s
One liner, using sed:
sed -nf <(sed 's/$/p/' linenumberfile) contentfile
To keep the original order in linenumberfile, you can do
sed -nf <(sed 's/$/p/' linenumberfile) contentfile | paste <(nl linenumberfile | sort -n -k 2,2) - | sort -n -k 1,1 | cut -f 3-
Explanation:
sed 's/$/p/' linenumberfile
generates a sed script which prints the specified line. The script is then fed into another sed (with -n to suppress default printing of the pattern space) to do the actual printing. Since sed process the content file line by line, the output will be in the same order as in the content file. Note that this is a one-pass process so I would expect the speed to be acceptable.
To accelerate the process, one can change p to p;b and add a q at the end of the generated sed script.
To retain the order of the lines as they are in the line number file, nl is used to add "line numbers" to the line number file. So a line number file
4
5
2
would become
1 4
2 5
3 2
The first column records the original order in the line number file.
The file with "line numbers" is then sorted and pasted to the output of sed, to make
3 2 content_of_line2
1 4 content_of_line4
2 5 content_of_line5
then it is sorted using the 1st column as the key, to finally obtain
1 4 content_of_line4
2 5 content_of_line5
3 2 content_of_line2
Finally, cut is used to remove the 2 extra columns.
Benchmarking
It seems sed would do best for a few lines, but perl is the way to go for 10000 lines as specified in the question.
$ cat /proc/cpuinfo | grep -A 4 -m 1 processor
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
$ wc -l linenumber
10 linenumber
$ wc -l content
8982457 content
$ file content
content: ASCII text
$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m0.791s
user 0m0.661s
sys 0m0.133s
$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.061s
user 0m2.908s
sys 0m0.152s
$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.706s
user 0m1.582s
sys 0m0.124s
$ ./genlinenumber.py 100 > linenumber
$ wc -l linenumber
100 linenumber
$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m3.326s
user 0m3.164s
sys 0m0.164s
$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.055s
user 0m2.890s
sys 0m0.164s
$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.769s
user 0m1.604s
sys 0m0.165s
If it is required to retain the order of lines, the command after the first | can still be used since the time is negligible.
$ ./genlinenumber.py 10000 > linenumber
$ wc -l linenumber
10000 linenumber
$ time bash -c "./ln.pl linenumber content > extract"
real 0m1.933s
user 0m1.791s
sys 0m0.141s
$ time bash -c "paste <(nl linenumber | sort -n -k 2,2) extract | sort -n -k 1,1 | cut -f 3- > /dev/null"
real 0m0.018s
user 0m0.012s
sys 0m0.005s
edited Mar 15 at 7:34
answered Mar 14 at 8:39
Weijun ZhouWeijun Zhou
1,575325
1,575325
1
Done. Suggestions are welcome.
– Weijun Zhou
Mar 14 at 8:58
Do common versions ofsedsupport line numbers larger than 32 bits?
– Nate Eldredge
Mar 14 at 14:40
1
You are right. I made a mistake in last edits.
– Weijun Zhou
Mar 14 at 20:22
@NateEldredge I'm not sure about that.
– Weijun Zhou
Mar 15 at 0:01
add a comment |
1
Done. Suggestions are welcome.
– Weijun Zhou
Mar 14 at 8:58
Do common versions ofsedsupport line numbers larger than 32 bits?
– Nate Eldredge
Mar 14 at 14:40
1
You are right. I made a mistake in last edits.
– Weijun Zhou
Mar 14 at 20:22
@NateEldredge I'm not sure about that.
– Weijun Zhou
Mar 15 at 0:01
1
1
Done. Suggestions are welcome.
– Weijun Zhou
Mar 14 at 8:58
Done. Suggestions are welcome.
– Weijun Zhou
Mar 14 at 8:58
Do common versions of
sed support line numbers larger than 32 bits?– Nate Eldredge
Mar 14 at 14:40
Do common versions of
sed support line numbers larger than 32 bits?– Nate Eldredge
Mar 14 at 14:40
1
1
You are right. I made a mistake in last edits.
– Weijun Zhou
Mar 14 at 20:22
You are right. I made a mistake in last edits.
– Weijun Zhou
Mar 14 at 20:22
@NateEldredge I'm not sure about that.
– Weijun Zhou
Mar 15 at 0:01
@NateEldredge I'm not sure about that.
– Weijun Zhou
Mar 15 at 0:01
add a comment |
I would use a perl script for this. I came up with this:
#!/usr/bin/perl
# usage: thisscript linenumberslist.txt contentsfile
unless (open(IN, $ARGV[0]))
die "Can't open list of line numbers file '$ARGV[0]'n";
my %linenumbers = ();
while (<IN>)
chomp;
$linenumbers$_ = 1;
unless (open(IN, $ARGV[1]))
die "Can't open contents file '$ARGV[1]'n";
$. = 0;
while (<IN>)
print if defined $linenumbers$.;
exit;
This first reads the list of line numbers that we're interested in into an associative array, where the line numbers are the key. chomp removes the newline at the end of the line, $_ is the line itself.
Next the data file is opened, and when the line number is an existing key in the array of line numbers, then the line is printed.
The $. is perl's line number counter, this increments for every line read. As this is counted across files, I reset it to zero before reading any lines of the data file.
This could probably be written much more in "perl" style, but I prefer to keep it a bit more readable.
If the list of lines you want to extract is very large, this may not be the most efficient way, but I find that perl is often amazingly efficient at these things.
If you require the lines to be extracted in the order that they are listed, i.e. not sequentially, then it becomes a lot more complicated...
add a comment |
I would use a perl script for this. I came up with this:
#!/usr/bin/perl
# usage: thisscript linenumberslist.txt contentsfile
unless (open(IN, $ARGV[0]))
die "Can't open list of line numbers file '$ARGV[0]'n";
my %linenumbers = ();
while (<IN>)
chomp;
$linenumbers$_ = 1;
unless (open(IN, $ARGV[1]))
die "Can't open contents file '$ARGV[1]'n";
$. = 0;
while (<IN>)
print if defined $linenumbers$.;
exit;
This first reads the list of line numbers that we're interested in into an associative array, where the line numbers are the key. chomp removes the newline at the end of the line, $_ is the line itself.
Next the data file is opened, and when the line number is an existing key in the array of line numbers, then the line is printed.
The $. is perl's line number counter, this increments for every line read. As this is counted across files, I reset it to zero before reading any lines of the data file.
This could probably be written much more in "perl" style, but I prefer to keep it a bit more readable.
If the list of lines you want to extract is very large, this may not be the most efficient way, but I find that perl is often amazingly efficient at these things.
If you require the lines to be extracted in the order that they are listed, i.e. not sequentially, then it becomes a lot more complicated...
add a comment |
I would use a perl script for this. I came up with this:
#!/usr/bin/perl
# usage: thisscript linenumberslist.txt contentsfile
unless (open(IN, $ARGV[0]))
die "Can't open list of line numbers file '$ARGV[0]'n";
my %linenumbers = ();
while (<IN>)
chomp;
$linenumbers$_ = 1;
unless (open(IN, $ARGV[1]))
die "Can't open contents file '$ARGV[1]'n";
$. = 0;
while (<IN>)
print if defined $linenumbers$.;
exit;
This first reads the list of line numbers that we're interested in into an associative array, where the line numbers are the key. chomp removes the newline at the end of the line, $_ is the line itself.
Next the data file is opened, and when the line number is an existing key in the array of line numbers, then the line is printed.
The $. is perl's line number counter, this increments for every line read. As this is counted across files, I reset it to zero before reading any lines of the data file.
This could probably be written much more in "perl" style, but I prefer to keep it a bit more readable.
If the list of lines you want to extract is very large, this may not be the most efficient way, but I find that perl is often amazingly efficient at these things.
If you require the lines to be extracted in the order that they are listed, i.e. not sequentially, then it becomes a lot more complicated...
I would use a perl script for this. I came up with this:
#!/usr/bin/perl
# usage: thisscript linenumberslist.txt contentsfile
unless (open(IN, $ARGV[0]))
die "Can't open list of line numbers file '$ARGV[0]'n";
my %linenumbers = ();
while (<IN>)
chomp;
$linenumbers$_ = 1;
unless (open(IN, $ARGV[1]))
die "Can't open contents file '$ARGV[1]'n";
$. = 0;
while (<IN>)
print if defined $linenumbers$.;
exit;
This first reads the list of line numbers that we're interested in into an associative array, where the line numbers are the key. chomp removes the newline at the end of the line, $_ is the line itself.
Next the data file is opened, and when the line number is an existing key in the array of line numbers, then the line is printed.
The $. is perl's line number counter, this increments for every line read. As this is counted across files, I reset it to zero before reading any lines of the data file.
This could probably be written much more in "perl" style, but I prefer to keep it a bit more readable.
If the list of lines you want to extract is very large, this may not be the most efficient way, but I find that perl is often amazingly efficient at these things.
If you require the lines to be extracted in the order that they are listed, i.e. not sequentially, then it becomes a lot more complicated...
answered Mar 14 at 7:26
wurtelwurtel
11k11628
11k11628
add a comment |
add a comment |
Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.
join
Assuming you have a data file you want to extract rows from and a line_numbers file that lists the numbers of the rows you want to extract, if the sorting order of the output is not important you can use:
join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-
This will number the lines of your data file, join it with the padded_line_numbers file on the first field (the default) and print out the common lines (excluding the join field itself, that is cut away).
join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:
while read rownum; do
printf '%.12dn' "$rownum"
done <line_numbers >padded_line_numbers
The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.
If the sorting order of the output has to match that of your line_numbers file, you can use:
join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2)
<(nl -w 12 -n rz data) |
sort -k 2,2n |
cut -d ' ' -f 3-
Where we are numbering the padded_line_numbers file, sorting the result alphabetically by its second field, joining it with the numbered data file and numerically sorting the result by the original sorting order of padded_line_numbers.
Process substitution is here used for convenience. If you can not or do not want to rely on it and, as it is likely, you are not willing to waste the storage needed for creating regular files to hold intermediate results, you can leverage named pipes:
mkfifo padded_line_numbers
mkfifo numbered_data
while read rownum; do
printf '%.12dn' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &
nl -w 12 -n rz data >numbered_data &
join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-
Benchmarking
Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.
For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':
$ head -n 3 data
c15d
061d
5787
$ wc -l data
3221254963 data
The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:
shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers
The testing environment was a laptop with a i7-2670QM Intel quad-core processor, 16 GiB of memory, SSD storage, GNU/Linux, bash 5.0 and GNU tools.
The only dimension I measured has been the execution time, by means of the time shell builtin.
Here I'm considering:
- The
sedsolution from Weijun Zhou's answer. - The
awksolution from Micha's answer. - The
perlsolution from wurtel's answer. - The
joinsolution above.
perl seems to be the fastest:
$ time perl_script line_numbers data | wc -l
10000
real 14m51.597s
user 14m41.878s
sys 0m9.299s
awk's performance looks comparable:
$ time awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000
real 29m3.808s
user 28m52.616s
sys 0m10.709s
join, too, appears to be comparable:
$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000
real 28m24.053s
user 27m52.857s
sys 0m28.958s
Note that the sorted version mentioned above has roughly no performance penalty over this one.
Finally, sed appears to be significantly slower: I killed it after approximately nine hours:
$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C
real 551m12.747s
user 550m53.390s
sys 0m15.624s
add a comment |
Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.
join
Assuming you have a data file you want to extract rows from and a line_numbers file that lists the numbers of the rows you want to extract, if the sorting order of the output is not important you can use:
join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-
This will number the lines of your data file, join it with the padded_line_numbers file on the first field (the default) and print out the common lines (excluding the join field itself, that is cut away).
join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:
while read rownum; do
printf '%.12dn' "$rownum"
done <line_numbers >padded_line_numbers
The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.
If the sorting order of the output has to match that of your line_numbers file, you can use:
join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2)
<(nl -w 12 -n rz data) |
sort -k 2,2n |
cut -d ' ' -f 3-
Where we are numbering the padded_line_numbers file, sorting the result alphabetically by its second field, joining it with the numbered data file and numerically sorting the result by the original sorting order of padded_line_numbers.
Process substitution is here used for convenience. If you can not or do not want to rely on it and, as it is likely, you are not willing to waste the storage needed for creating regular files to hold intermediate results, you can leverage named pipes:
mkfifo padded_line_numbers
mkfifo numbered_data
while read rownum; do
printf '%.12dn' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &
nl -w 12 -n rz data >numbered_data &
join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-
Benchmarking
Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.
For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':
$ head -n 3 data
c15d
061d
5787
$ wc -l data
3221254963 data
The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:
shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers
The testing environment was a laptop with a i7-2670QM Intel quad-core processor, 16 GiB of memory, SSD storage, GNU/Linux, bash 5.0 and GNU tools.
The only dimension I measured has been the execution time, by means of the time shell builtin.
Here I'm considering:
- The
sedsolution from Weijun Zhou's answer. - The
awksolution from Micha's answer. - The
perlsolution from wurtel's answer. - The
joinsolution above.
perl seems to be the fastest:
$ time perl_script line_numbers data | wc -l
10000
real 14m51.597s
user 14m41.878s
sys 0m9.299s
awk's performance looks comparable:
$ time awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000
real 29m3.808s
user 28m52.616s
sys 0m10.709s
join, too, appears to be comparable:
$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000
real 28m24.053s
user 27m52.857s
sys 0m28.958s
Note that the sorted version mentioned above has roughly no performance penalty over this one.
Finally, sed appears to be significantly slower: I killed it after approximately nine hours:
$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C
real 551m12.747s
user 550m53.390s
sys 0m15.624s
add a comment |
Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.
join
Assuming you have a data file you want to extract rows from and a line_numbers file that lists the numbers of the rows you want to extract, if the sorting order of the output is not important you can use:
join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-
This will number the lines of your data file, join it with the padded_line_numbers file on the first field (the default) and print out the common lines (excluding the join field itself, that is cut away).
join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:
while read rownum; do
printf '%.12dn' "$rownum"
done <line_numbers >padded_line_numbers
The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.
If the sorting order of the output has to match that of your line_numbers file, you can use:
join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2)
<(nl -w 12 -n rz data) |
sort -k 2,2n |
cut -d ' ' -f 3-
Where we are numbering the padded_line_numbers file, sorting the result alphabetically by its second field, joining it with the numbered data file and numerically sorting the result by the original sorting order of padded_line_numbers.
Process substitution is here used for convenience. If you can not or do not want to rely on it and, as it is likely, you are not willing to waste the storage needed for creating regular files to hold intermediate results, you can leverage named pipes:
mkfifo padded_line_numbers
mkfifo numbered_data
while read rownum; do
printf '%.12dn' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &
nl -w 12 -n rz data >numbered_data &
join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-
Benchmarking
Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.
For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':
$ head -n 3 data
c15d
061d
5787
$ wc -l data
3221254963 data
The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:
shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers
The testing environment was a laptop with a i7-2670QM Intel quad-core processor, 16 GiB of memory, SSD storage, GNU/Linux, bash 5.0 and GNU tools.
The only dimension I measured has been the execution time, by means of the time shell builtin.
Here I'm considering:
- The
sedsolution from Weijun Zhou's answer. - The
awksolution from Micha's answer. - The
perlsolution from wurtel's answer. - The
joinsolution above.
perl seems to be the fastest:
$ time perl_script line_numbers data | wc -l
10000
real 14m51.597s
user 14m41.878s
sys 0m9.299s
awk's performance looks comparable:
$ time awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000
real 29m3.808s
user 28m52.616s
sys 0m10.709s
join, too, appears to be comparable:
$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000
real 28m24.053s
user 27m52.857s
sys 0m28.958s
Note that the sorted version mentioned above has roughly no performance penalty over this one.
Finally, sed appears to be significantly slower: I killed it after approximately nine hours:
$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C
real 551m12.747s
user 550m53.390s
sys 0m15.624s
Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.
join
Assuming you have a data file you want to extract rows from and a line_numbers file that lists the numbers of the rows you want to extract, if the sorting order of the output is not important you can use:
join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-
This will number the lines of your data file, join it with the padded_line_numbers file on the first field (the default) and print out the common lines (excluding the join field itself, that is cut away).
join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:
while read rownum; do
printf '%.12dn' "$rownum"
done <line_numbers >padded_line_numbers
The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.
If the sorting order of the output has to match that of your line_numbers file, you can use:
join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2)
<(nl -w 12 -n rz data) |
sort -k 2,2n |
cut -d ' ' -f 3-
Where we are numbering the padded_line_numbers file, sorting the result alphabetically by its second field, joining it with the numbered data file and numerically sorting the result by the original sorting order of padded_line_numbers.
Process substitution is here used for convenience. If you can not or do not want to rely on it and, as it is likely, you are not willing to waste the storage needed for creating regular files to hold intermediate results, you can leverage named pipes:
mkfifo padded_line_numbers
mkfifo numbered_data
while read rownum; do
printf '%.12dn' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &
nl -w 12 -n rz data >numbered_data &
join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-
Benchmarking
Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.
For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':
$ head -n 3 data
c15d
061d
5787
$ wc -l data
3221254963 data
The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:
shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers
The testing environment was a laptop with a i7-2670QM Intel quad-core processor, 16 GiB of memory, SSD storage, GNU/Linux, bash 5.0 and GNU tools.
The only dimension I measured has been the execution time, by means of the time shell builtin.
Here I'm considering:
- The
sedsolution from Weijun Zhou's answer. - The
awksolution from Micha's answer. - The
perlsolution from wurtel's answer. - The
joinsolution above.
perl seems to be the fastest:
$ time perl_script line_numbers data | wc -l
10000
real 14m51.597s
user 14m41.878s
sys 0m9.299s
awk's performance looks comparable:
$ time awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000
real 29m3.808s
user 28m52.616s
sys 0m10.709s
join, too, appears to be comparable:
$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000
real 28m24.053s
user 27m52.857s
sys 0m28.958s
Note that the sorted version mentioned above has roughly no performance penalty over this one.
Finally, sed appears to be significantly slower: I killed it after approximately nine hours:
$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C
real 551m12.747s
user 550m53.390s
sys 0m15.624s
edited 1 hour ago
answered 10 hours ago
fra-sanfra-san
1,8611520
1,8611520
add a comment |
add a comment |
micha@linux-micha: /tmp
$ cat numbers.txt
1
2
4
5
micha@linux-micha: /tmp
$ cat sentences.txt
alpha
bravo
charlie
delta
echo
foxtrott
micha@linux-micha: /tmp
$ awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' numbers.txt sentences.txt
alpha
bravo
delta
echo
This one invokesawkmany times and will be really slow ifsentences.txtis a huge file.
– Weijun Zhou
Mar 14 at 23:59
W. Zhou is right. Thank you! So I'll think twice.
– Micha
Mar 15 at 0:18
1
So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?
– Micha
Mar 15 at 1:05
add a comment |
micha@linux-micha: /tmp
$ cat numbers.txt
1
2
4
5
micha@linux-micha: /tmp
$ cat sentences.txt
alpha
bravo
charlie
delta
echo
foxtrott
micha@linux-micha: /tmp
$ awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' numbers.txt sentences.txt
alpha
bravo
delta
echo
This one invokesawkmany times and will be really slow ifsentences.txtis a huge file.
– Weijun Zhou
Mar 14 at 23:59
W. Zhou is right. Thank you! So I'll think twice.
– Micha
Mar 15 at 0:18
1
So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?
– Micha
Mar 15 at 1:05
add a comment |
micha@linux-micha: /tmp
$ cat numbers.txt
1
2
4
5
micha@linux-micha: /tmp
$ cat sentences.txt
alpha
bravo
charlie
delta
echo
foxtrott
micha@linux-micha: /tmp
$ awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' numbers.txt sentences.txt
alpha
bravo
delta
echo
micha@linux-micha: /tmp
$ cat numbers.txt
1
2
4
5
micha@linux-micha: /tmp
$ cat sentences.txt
alpha
bravo
charlie
delta
echo
foxtrott
micha@linux-micha: /tmp
$ awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' numbers.txt sentences.txt
alpha
bravo
delta
echo
edited Mar 15 at 1:03
answered Mar 14 at 23:55
MichaMicha
973
973
This one invokesawkmany times and will be really slow ifsentences.txtis a huge file.
– Weijun Zhou
Mar 14 at 23:59
W. Zhou is right. Thank you! So I'll think twice.
– Micha
Mar 15 at 0:18
1
So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?
– Micha
Mar 15 at 1:05
add a comment |
This one invokesawkmany times and will be really slow ifsentences.txtis a huge file.
– Weijun Zhou
Mar 14 at 23:59
W. Zhou is right. Thank you! So I'll think twice.
– Micha
Mar 15 at 0:18
1
So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?
– Micha
Mar 15 at 1:05
This one invokes
awk many times and will be really slow if sentences.txt is a huge file.– Weijun Zhou
Mar 14 at 23:59
This one invokes
awk many times and will be really slow if sentences.txt is a huge file.– Weijun Zhou
Mar 14 at 23:59
W. Zhou is right. Thank you! So I'll think twice.
– Micha
Mar 15 at 0:18
W. Zhou is right. Thank you! So I'll think twice.
– Micha
Mar 15 at 0:18
1
1
So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?
– Micha
Mar 15 at 1:05
So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?
– Micha
Mar 15 at 1:05
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f506207%2ffast-way-to-extract-lines-from-a-large-file-with-80-billion-lines%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
If you expect to have to do this more than once, consider putting the lines into an SQL database or something of the sort.
– Nate Eldredge
Mar 14 at 3:36
Are they sequential 10000 lines or sporadic throughout the log file? If sequential and there's some unique pattern at the beginning you could just use grep -A 10000 <pattern> <filename>.
– kevlinux
Mar 14 at 3:48
2
Are line numbers in line number file sorted?
– JohnKoch
Mar 14 at 7:54
2
Are the lines expected to be extracted in the order of the line numbers in the smaller file?
– Kusalananda
Mar 14 at 8:03
1
This might help: stackoverflow.com/questions/6022384/…
– kevlinux
Mar 15 at 6:08