Fast way to extract lines from a large file with 80 billion lines2019 Community Moderator Election“Multipass” scripted modification of large file in-place (file-system level)?Easy way to copy lines from one file to anotherExtract several lines from large text fileHow to group values based on a “connectedness” metric?Print each line multiple but different timesextracting lines of text from a long fileExtracting time from text filefast ways of removing beginning lines from large text fileExtract lines that have a specific ending and use those to extract from another fileWhy does head; tail on a large file sometimes take a long time and sometimes not?

Does fire aspect on a sword, destroy mob drops?

Would mining huge amounts of resources on the Moon change its orbit?

Is "inadequate referencing" a euphemism for plagiarism?

Determine voltage drop over 10G resistors with cheap multimeter

How do hiring committees for research positions view getting "scooped"?

Don't understand why (5 | -2) > 0 is False where (5 or -2) > 0 is True

How to test the sharpness of a knife?

Do I need an EFI partition for each 18.04 ubuntu I have on my HD?

Why doesn't the fusion process of the sun speed up?

How to find the largest number(s) in a list of elements?

label a part of commutative diagram

Norwegian Refugee travel document

Is there any common country to visit for uk and schengen visa?

Air travel with refrigerated insulin

How to balance a monster modification (zombie)?

Animating wave motion in water

Error in master's thesis, I do not know what to do

Writing in a Christian voice

Did Nintendo change its mind about 68000 SNES?

Why doesn't the chatan sign the ketubah?

How can I query the supported timezones in Apex?

Inhabiting Mars versus going straight for a Dyson swarm

How much propellant is used up until liftoff?

What is it called when someone votes for an option that's not their first choice?

Fast way to extract lines from a large file with 80 billion lines

2019 Community Moderator Election“Multipass” scripted modification of large file in-place (file-system level)?Easy way to copy lines from one file to anotherExtract several lines from large text fileHow to group values based on a “connectedness” metric?Print each line multiple but different timesextracting lines of text from a long fileExtracting time from text filefast ways of removing beginning lines from large text fileExtract lines that have a specific ending and use those to extract from another fileWhy does head; tail on a large file sometimes take a long time and sometimes not?

I have a large file with 80 billion lines. Now I want to extract a few lines (around 10000) which I know the line number, what is the fastest way to deal with it. Your help is really appreciated

Is it possible to extract those lines from using another file which contains the line numbers? The line numbers in the file of line numbers would not always be consecutive.

For example, the original file is:

0.1
0.2
0.3
0.4
...

the line number file:

1
3
4

the output:

0.1
0.3
0.4

edited Mar 14 at 8:38

Kusalananda

136k17257425

asked Mar 14 at 3:28

user2842390

183

If you expect to have to do this more than once, consider putting the lines into an SQL database or something of the sort.

– Nate Eldredge
Mar 14 at 3:36

Are they sequential 10000 lines or sporadic throughout the log file? If sequential and there's some unique pattern at the beginning you could just use grep -A 10000 <pattern> <filename>.

– kevlinux
Mar 14 at 3:48

2

Are line numbers in line number file sorted?

– JohnKoch
Mar 14 at 7:54

2

Are the lines expected to be extracted in the order of the line numbers in the smaller file?

– Kusalananda
Mar 14 at 8:03

1

This might help: stackoverflow.com/questions/6022384/…

– kevlinux
Mar 15 at 6:08

|
show 4 more comments

I have a large file with 80 billion lines. Now I want to extract a few lines (around 10000) which I know the line number, what is the fastest way to deal with it. Your help is really appreciated

Is it possible to extract those lines from using another file which contains the line numbers? The line numbers in the file of line numbers would not always be consecutive.

For example, the original file is:

0.1
0.2
0.3
0.4
...

the line number file:

1
3
4

the output:

0.1
0.3
0.4

edited Mar 14 at 8:38

Kusalananda

136k17257425

asked Mar 14 at 3:28

user2842390

183

If you expect to have to do this more than once, consider putting the lines into an SQL database or something of the sort.

– Nate Eldredge
Mar 14 at 3:36

Are they sequential 10000 lines or sporadic throughout the log file? If sequential and there's some unique pattern at the beginning you could just use grep -A 10000 <pattern> <filename>.

– kevlinux
Mar 14 at 3:48

2

Are line numbers in line number file sorted?

– JohnKoch
Mar 14 at 7:54

2

Are the lines expected to be extracted in the order of the line numbers in the smaller file?

– Kusalananda
Mar 14 at 8:03

1

This might help: stackoverflow.com/questions/6022384/…

– kevlinux
Mar 15 at 6:08

|
show 4 more comments

I have a large file with 80 billion lines. Now I want to extract a few lines (around 10000) which I know the line number, what is the fastest way to deal with it. Your help is really appreciated

Is it possible to extract those lines from using another file which contains the line numbers? The line numbers in the file of line numbers would not always be consecutive.

For example, the original file is:

0.1
0.2
0.3
0.4
...

the line number file:

1
3
4

the output:

0.1
0.3
0.4

edited Mar 14 at 8:38

Kusalananda

136k17257425

asked Mar 14 at 3:28

user2842390

183

I have a large file with 80 billion lines. Now I want to extract a few lines (around 10000) which I know the line number, what is the fastest way to deal with it. Your help is really appreciated

Is it possible to extract those lines from using another file which contains the line numbers? The line numbers in the file of line numbers would not always be consecutive.

For example, the original file is:

0.1
0.2
0.3
0.4
...

the line number file:

1
3
4

the output:

0.1
0.3
0.4

linux large-files

edited Mar 14 at 8:38

Kusalananda

136k17257425

asked Mar 14 at 3:28

user2842390

183

edited Mar 14 at 8:38

Kusalananda

136k17257425

asked Mar 14 at 3:28

user2842390

183

edited Mar 14 at 8:38

Kusalananda

136k17257425

edited Mar 14 at 8:38

Kusalananda

136k17257425

edited Mar 14 at 8:38

Kusalananda

136k17257425

asked Mar 14 at 3:28

user2842390

183

asked Mar 14 at 3:28

user2842390

183

asked Mar 14 at 3:28

user2842390

183

If you expect to have to do this more than once, consider putting the lines into an SQL database or something of the sort.

– Nate Eldredge
Mar 14 at 3:36

Are they sequential 10000 lines or sporadic throughout the log file? If sequential and there's some unique pattern at the beginning you could just use grep -A 10000 <pattern> <filename>.

– kevlinux
Mar 14 at 3:48

2

Are line numbers in line number file sorted?

– JohnKoch
Mar 14 at 7:54

2

Are the lines expected to be extracted in the order of the line numbers in the smaller file?

– Kusalananda
Mar 14 at 8:03

1

This might help: stackoverflow.com/questions/6022384/…

– kevlinux
Mar 15 at 6:08

|
show 4 more comments

If you expect to have to do this more than once, consider putting the lines into an SQL database or something of the sort.

– Nate Eldredge
Mar 14 at 3:36

Are they sequential 10000 lines or sporadic throughout the log file? If sequential and there's some unique pattern at the beginning you could just use grep -A 10000 <pattern> <filename>.

– kevlinux
Mar 14 at 3:48

2

Are line numbers in line number file sorted?

– JohnKoch
Mar 14 at 7:54

2

Are the lines expected to be extracted in the order of the line numbers in the smaller file?

– Kusalananda
Mar 14 at 8:03

1

This might help: stackoverflow.com/questions/6022384/…

– kevlinux
Mar 15 at 6:08

If you expect to have to do this more than once, consider putting the lines into an SQL database or something of the sort.

– Nate Eldredge
Mar 14 at 3:36

Are they sequential 10000 lines or sporadic throughout the log file? If sequential and there's some unique pattern at the beginning you could just use grep -A 10000 <pattern> <filename>.

– kevlinux
Mar 14 at 3:48

Are line numbers in line number file sorted?

– JohnKoch
Mar 14 at 7:54

Are the lines expected to be extracted in the order of the line numbers in the smaller file?

– Kusalananda
Mar 14 at 8:03

This might help: stackoverflow.com/questions/6022384/…

– kevlinux
Mar 15 at 6:08

|
show 4 more comments

4 Answers
4

active

oldest

votes

One liner, using sed:

sed -nf <(sed 's/$/p/' linenumberfile) contentfile

To keep the original order in linenumberfile, you can do

sed -nf <(sed 's/$/p/' linenumberfile) contentfile | paste <(nl linenumberfile | sort -n -k 2,2) - | sort -n -k 1,1 | cut -f 3-

Explanation:

sed 's/$/p/' linenumberfile

generates a sed script which prints the specified line. The script is then fed into another sed (with -n to suppress default printing of the pattern space) to do the actual printing. Since sed process the content file line by line, the output will be in the same order as in the content file. Note that this is a one-pass process so I would expect the speed to be acceptable.

To accelerate the process, one can change p to p;b and add a q at the end of the generated sed script.

To retain the order of the lines as they are in the line number file, nl is used to add "line numbers" to the line number file. So a line number file

4
5
2

would become

1 4
2 5
3 2

The first column records the original order in the line number file.

The file with "line numbers" is then sorted and pasted to the output of sed, to make

3 2 content_of_line2
1 4 content_of_line4
2 5 content_of_line5

then it is sorted using the 1st column as the key, to finally obtain

1 4 content_of_line4
2 5 content_of_line5
3 2 content_of_line2

Finally, cut is used to remove the 2 extra columns.

Benchmarking

It seems sed would do best for a few lines, but perl is the way to go for 10000 lines as specified in the question.

$ cat /proc/cpuinfo | grep -A 4 -m 1 processor
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz

$ wc -l linenumber
10 linenumber

$ wc -l content
8982457 content

$ file content
content: ASCII text

$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null" 
real 0m0.791s
user 0m0.661s
sys 0m0.133s

$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.061s
user 0m2.908s
sys 0m0.152s

$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.706s
user 0m1.582s
sys 0m0.124s

$ ./genlinenumber.py 100 > linenumber
$ wc -l linenumber
100 linenumber

$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m3.326s
user 0m3.164s
sys 0m0.164s

$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.055s
user 0m2.890s
sys 0m0.164s

$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.769s
user 0m1.604s
sys 0m0.165s

If it is required to retain the order of lines, the command after the first | can still be used since the time is negligible.

$ ./genlinenumber.py 10000 > linenumber
$ wc -l linenumber
10000 linenumber

$ time bash -c "./ln.pl linenumber content > extract"
real 0m1.933s
user 0m1.791s
sys 0m0.141s

$ time bash -c "paste <(nl linenumber | sort -n -k 2,2) extract | sort -n -k 1,1 | cut -f 3- > /dev/null"
real 0m0.018s
user 0m0.012s
sys 0m0.005s

edited Mar 15 at 7:34

answered Mar 14 at 8:39

Weijun Zhou

1,575325

1

Done. Suggestions are welcome.

– Weijun Zhou
Mar 14 at 8:58

Do common versions of sed support line numbers larger than 32 bits?

– Nate Eldredge
Mar 14 at 14:40

1

You are right. I made a mistake in last edits.

– Weijun Zhou
Mar 14 at 20:22

@NateEldredge I'm not sure about that.

– Weijun Zhou
Mar 15 at 0:01

add a comment |

I would use a perl script for this. I came up with this:

#!/usr/bin/perl

# usage: thisscript linenumberslist.txt contentsfile

unless (open(IN, $ARGV[0])) 
 die "Can't open list of line numbers file '$ARGV[0]'n";

my %linenumbers = ();
while (<IN>) 
 chomp;
 $linenumbers$_ = 1;


unless (open(IN, $ARGV[1])) 
 die "Can't open contents file '$ARGV[1]'n";

$. = 0;
while (<IN>) 
 print if defined $linenumbers$.;


exit;

This first reads the list of line numbers that we're interested in into an associative array, where the line numbers are the key. chomp removes the newline at the end of the line, $_ is the line itself.

Next the data file is opened, and when the line number is an existing key in the array of line numbers, then the line is printed.

The $. is perl's line number counter, this increments for every line read. As this is counted across files, I reset it to zero before reading any lines of the data file.

This could probably be written much more in "perl" style, but I prefer to keep it a bit more readable.

If the list of lines you want to extract is very large, this may not be the most efficient way, but I find that perl is often amazingly efficient at these things.

If you require the lines to be extracted in the order that they are listed, i.e. not sequentially, then it becomes a lot more complicated...

answered Mar 14 at 7:26

wurtel

11k11628

add a comment |

Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.

`join`

Assuming you have a data file you want to extract rows from and a line_numbers file that lists the numbers of the rows you want to extract, if the sorting order of the output is not important you can use:

join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-

This will number the lines of your data file, join it with the padded_line_numbers file on the first field (the default) and print out the common lines (excluding the join field itself, that is cut away).

join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:

while read rownum; do
 printf '%.12dn' "$rownum"
done <line_numbers >padded_line_numbers

The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.

If the sorting order of the output has to match that of your line_numbers file, you can use:

join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2) 
 <(nl -w 12 -n rz data) |
 sort -k 2,2n |
 cut -d ' ' -f 3-

Where we are numbering the padded_line_numbers file, sorting the result alphabetically by its second field, joining it with the numbered data file and numerically sorting the result by the original sorting order of padded_line_numbers.

Process substitution is here used for convenience. If you can not or do not want to rely on it and, as it is likely, you are not willing to waste the storage needed for creating regular files to hold intermediate results, you can leverage named pipes:

mkfifo padded_line_numbers
mkfifo numbered_data

while read rownum; do
 printf '%.12dn' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &

nl -w 12 -n rz data >numbered_data &

join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-

Benchmarking

Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.

For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':

$ head -n 3 data
c15d
061d
5787

$ wc -l data
3221254963 data

The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:

shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers

The testing environment was a laptop with a i7-2670QM Intel quad-core processor, 16 GiB of memory, SSD storage, GNU/Linux, bash 5.0 and GNU tools.

The only dimension I measured has been the execution time, by means of the time shell builtin.

Here I'm considering:

The sed solution from Weijun Zhou's answer.

The awk solution from Micha's answer.

The perl solution from wurtel's answer.

The join solution above.

perl seems to be the fastest:

$ time perl_script line_numbers data | wc -l
10000

real 14m51.597s
user 14m41.878s
sys 0m9.299s

awk's performance looks comparable:

$ time awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000

real 29m3.808s
user 28m52.616s
sys 0m10.709s

join, too, appears to be comparable:

$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000

real 28m24.053s
user 27m52.857s
sys 0m28.958s

Note that the sorted version mentioned above has roughly no performance penalty over this one.

Finally, sed appears to be significantly slower: I killed it after approximately nine hours:

$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C

real 551m12.747s
user 550m53.390s
sys 0m15.624s

edited 1 hour ago

answered 10 hours ago

fra-san

1,8611520

add a comment |

micha@linux-micha: /tmp
$ cat numbers.txt
1
2
4
5

micha@linux-micha: /tmp
$ cat sentences.txt
alpha
bravo
charlie
delta
echo
foxtrott

micha@linux-micha: /tmp
$ awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' numbers.txt sentences.txt
alpha
bravo
delta
echo

edited Mar 15 at 1:03

answered Mar 14 at 23:55

Micha

973

This one invokes awk many times and will be really slow if sentences.txt is a huge file.

– Weijun Zhou
Mar 14 at 23:59

W. Zhou is right. Thank you! So I'll think twice.

– Micha
Mar 15 at 0:18

1

So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?

– Micha
Mar 15 at 1:05

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f506207%2ffast-way-to-extract-lines-from-a-large-file-with-80-billion-lines%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

One liner, using sed:

sed -nf <(sed 's/$/p/' linenumberfile) contentfile

To keep the original order in linenumberfile, you can do

sed -nf <(sed 's/$/p/' linenumberfile) contentfile | paste <(nl linenumberfile | sort -n -k 2,2) - | sort -n -k 1,1 | cut -f 3-

Explanation:

sed 's/$/p/' linenumberfile

To accelerate the process, one can change p to p;b and add a q at the end of the generated sed script.

To retain the order of the lines as they are in the line number file, nl is used to add "line numbers" to the line number file. So a line number file

4
5
2

would become

1 4
2 5
3 2

The first column records the original order in the line number file.

The file with "line numbers" is then sorted and pasted to the output of sed, to make

3 2 content_of_line2
1 4 content_of_line4
2 5 content_of_line5

then it is sorted using the 1st column as the key, to finally obtain

1 4 content_of_line4
2 5 content_of_line5
3 2 content_of_line2

Finally, cut is used to remove the 2 extra columns.

Benchmarking

It seems sed would do best for a few lines, but perl is the way to go for 10000 lines as specified in the question.

$ cat /proc/cpuinfo | grep -A 4 -m 1 processor
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz

$ wc -l linenumber
10 linenumber

$ wc -l content
8982457 content

$ file content
content: ASCII text

$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null" 
real 0m0.791s
user 0m0.661s
sys 0m0.133s

$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.061s
user 0m2.908s
sys 0m0.152s

$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.706s
user 0m1.582s
sys 0m0.124s

$ ./genlinenumber.py 100 > linenumber
$ wc -l linenumber
100 linenumber

$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m3.326s
user 0m3.164s
sys 0m0.164s

$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.055s
user 0m2.890s
sys 0m0.164s

$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.769s
user 0m1.604s
sys 0m0.165s

If it is required to retain the order of lines, the command after the first | can still be used since the time is negligible.

$ ./genlinenumber.py 10000 > linenumber
$ wc -l linenumber
10000 linenumber

$ time bash -c "./ln.pl linenumber content > extract"
real 0m1.933s
user 0m1.791s
sys 0m0.141s

$ time bash -c "paste <(nl linenumber | sort -n -k 2,2) extract | sort -n -k 1,1 | cut -f 3- > /dev/null"
real 0m0.018s
user 0m0.012s
sys 0m0.005s

edited Mar 15 at 7:34

answered Mar 14 at 8:39

Weijun Zhou

1,575325

1

Done. Suggestions are welcome.

– Weijun Zhou
Mar 14 at 8:58

Do common versions of sed support line numbers larger than 32 bits?

– Nate Eldredge
Mar 14 at 14:40

1

You are right. I made a mistake in last edits.

– Weijun Zhou
Mar 14 at 20:22

@NateEldredge I'm not sure about that.

– Weijun Zhou
Mar 15 at 0:01

add a comment |

One liner, using sed:

sed -nf <(sed 's/$/p/' linenumberfile) contentfile

To keep the original order in linenumberfile, you can do

sed -nf <(sed 's/$/p/' linenumberfile) contentfile | paste <(nl linenumberfile | sort -n -k 2,2) - | sort -n -k 1,1 | cut -f 3-

Explanation:

sed 's/$/p/' linenumberfile

To accelerate the process, one can change p to p;b and add a q at the end of the generated sed script.

To retain the order of the lines as they are in the line number file, nl is used to add "line numbers" to the line number file. So a line number file

4
5
2

would become

1 4
2 5
3 2

The first column records the original order in the line number file.

The file with "line numbers" is then sorted and pasted to the output of sed, to make

3 2 content_of_line2
1 4 content_of_line4
2 5 content_of_line5

then it is sorted using the 1st column as the key, to finally obtain

1 4 content_of_line4
2 5 content_of_line5
3 2 content_of_line2

Finally, cut is used to remove the 2 extra columns.

Benchmarking

It seems sed would do best for a few lines, but perl is the way to go for 10000 lines as specified in the question.

$ cat /proc/cpuinfo | grep -A 4 -m 1 processor
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz

$ wc -l linenumber
10 linenumber

$ wc -l content
8982457 content

$ file content
content: ASCII text

$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null" 
real 0m0.791s
user 0m0.661s
sys 0m0.133s

$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.061s
user 0m2.908s
sys 0m0.152s

$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.706s
user 0m1.582s
sys 0m0.124s

$ ./genlinenumber.py 100 > linenumber
$ wc -l linenumber
100 linenumber

$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m3.326s
user 0m3.164s
sys 0m0.164s

$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.055s
user 0m2.890s
sys 0m0.164s

$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.769s
user 0m1.604s
sys 0m0.165s

If it is required to retain the order of lines, the command after the first | can still be used since the time is negligible.

$ ./genlinenumber.py 10000 > linenumber
$ wc -l linenumber
10000 linenumber

$ time bash -c "./ln.pl linenumber content > extract"
real 0m1.933s
user 0m1.791s
sys 0m0.141s

$ time bash -c "paste <(nl linenumber | sort -n -k 2,2) extract | sort -n -k 1,1 | cut -f 3- > /dev/null"
real 0m0.018s
user 0m0.012s
sys 0m0.005s

edited Mar 15 at 7:34

answered Mar 14 at 8:39

Weijun Zhou

1,575325

1

Done. Suggestions are welcome.

– Weijun Zhou
Mar 14 at 8:58

Do common versions of sed support line numbers larger than 32 bits?

– Nate Eldredge
Mar 14 at 14:40

1

You are right. I made a mistake in last edits.

– Weijun Zhou
Mar 14 at 20:22

@NateEldredge I'm not sure about that.

– Weijun Zhou
Mar 15 at 0:01

add a comment |

One liner, using sed:

sed -nf <(sed 's/$/p/' linenumberfile) contentfile

To keep the original order in linenumberfile, you can do

sed -nf <(sed 's/$/p/' linenumberfile) contentfile | paste <(nl linenumberfile | sort -n -k 2,2) - | sort -n -k 1,1 | cut -f 3-

Explanation:

sed 's/$/p/' linenumberfile

To accelerate the process, one can change p to p;b and add a q at the end of the generated sed script.

To retain the order of the lines as they are in the line number file, nl is used to add "line numbers" to the line number file. So a line number file

4
5
2

would become

1 4
2 5
3 2

The first column records the original order in the line number file.

The file with "line numbers" is then sorted and pasted to the output of sed, to make

3 2 content_of_line2
1 4 content_of_line4
2 5 content_of_line5

then it is sorted using the 1st column as the key, to finally obtain

1 4 content_of_line4
2 5 content_of_line5
3 2 content_of_line2

Finally, cut is used to remove the 2 extra columns.

Benchmarking

It seems sed would do best for a few lines, but perl is the way to go for 10000 lines as specified in the question.

$ cat /proc/cpuinfo | grep -A 4 -m 1 processor
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz

$ wc -l linenumber
10 linenumber

$ wc -l content
8982457 content

$ file content
content: ASCII text

$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null" 
real 0m0.791s
user 0m0.661s
sys 0m0.133s

$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.061s
user 0m2.908s
sys 0m0.152s

$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.706s
user 0m1.582s
sys 0m0.124s

$ ./genlinenumber.py 100 > linenumber
$ wc -l linenumber
100 linenumber

$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m3.326s
user 0m3.164s
sys 0m0.164s

$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.055s
user 0m2.890s
sys 0m0.164s

$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.769s
user 0m1.604s
sys 0m0.165s

If it is required to retain the order of lines, the command after the first | can still be used since the time is negligible.

$ ./genlinenumber.py 10000 > linenumber
$ wc -l linenumber
10000 linenumber

$ time bash -c "./ln.pl linenumber content > extract"
real 0m1.933s
user 0m1.791s
sys 0m0.141s

$ time bash -c "paste <(nl linenumber | sort -n -k 2,2) extract | sort -n -k 1,1 | cut -f 3- > /dev/null"
real 0m0.018s
user 0m0.012s
sys 0m0.005s

edited Mar 15 at 7:34

answered Mar 14 at 8:39

Weijun Zhou

1,575325

One liner, using sed:

sed -nf <(sed 's/$/p/' linenumberfile) contentfile

To keep the original order in linenumberfile, you can do

sed -nf <(sed 's/$/p/' linenumberfile) contentfile | paste <(nl linenumberfile | sort -n -k 2,2) - | sort -n -k 1,1 | cut -f 3-

Explanation:

sed 's/$/p/' linenumberfile

To accelerate the process, one can change p to p;b and add a q at the end of the generated sed script.

To retain the order of the lines as they are in the line number file, nl is used to add "line numbers" to the line number file. So a line number file

4
5
2

would become

1 4
2 5
3 2

The first column records the original order in the line number file.

The file with "line numbers" is then sorted and pasted to the output of sed, to make

3 2 content_of_line2
1 4 content_of_line4
2 5 content_of_line5

then it is sorted using the 1st column as the key, to finally obtain

1 4 content_of_line4
2 5 content_of_line5
3 2 content_of_line2

Finally, cut is used to remove the 2 extra columns.

Benchmarking

It seems sed would do best for a few lines, but perl is the way to go for 10000 lines as specified in the question.

$ cat /proc/cpuinfo | grep -A 4 -m 1 processor
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz

$ wc -l linenumber
10 linenumber

$ wc -l content
8982457 content

$ file content
content: ASCII text

$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null" 
real 0m0.791s
user 0m0.661s
sys 0m0.133s

$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.061s
user 0m2.908s
sys 0m0.152s

$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.706s
user 0m1.582s
sys 0m0.124s

$ ./genlinenumber.py 100 > linenumber
$ wc -l linenumber
100 linenumber

$ time bash -c "sed -nf <(sed 's/$/p/' linenumber) content > /dev/null"
real 0m3.326s
user 0m3.164s
sys 0m0.164s

$ time bash -c "awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' linenumber content > /dev/null"
real 0m3.055s
user 0m2.890s
sys 0m0.164s

$ time bash -c "./ln.pl linenumber content > /dev/null"
real 0m1.769s
user 0m1.604s
sys 0m0.165s

If it is required to retain the order of lines, the command after the first | can still be used since the time is negligible.

$ ./genlinenumber.py 10000 > linenumber
$ wc -l linenumber
10000 linenumber

$ time bash -c "./ln.pl linenumber content > extract"
real 0m1.933s
user 0m1.791s
sys 0m0.141s

$ time bash -c "paste <(nl linenumber | sort -n -k 2,2) extract | sort -n -k 1,1 | cut -f 3- > /dev/null"
real 0m0.018s
user 0m0.012s
sys 0m0.005s

edited Mar 15 at 7:34

answered Mar 14 at 8:39

Weijun Zhou

1,575325

edited Mar 15 at 7:34

answered Mar 14 at 8:39

Weijun Zhou

1,575325

answered Mar 14 at 8:39

Weijun Zhou

1,575325

answered Mar 14 at 8:39

Weijun Zhou

1,575325

1

Done. Suggestions are welcome.

– Weijun Zhou
Mar 14 at 8:58

Do common versions of sed support line numbers larger than 32 bits?

– Nate Eldredge
Mar 14 at 14:40

1

You are right. I made a mistake in last edits.

– Weijun Zhou
Mar 14 at 20:22

@NateEldredge I'm not sure about that.

– Weijun Zhou
Mar 15 at 0:01

add a comment |

1

Done. Suggestions are welcome.

– Weijun Zhou
Mar 14 at 8:58

Do common versions of sed support line numbers larger than 32 bits?

– Nate Eldredge
Mar 14 at 14:40

1

You are right. I made a mistake in last edits.

– Weijun Zhou
Mar 14 at 20:22

@NateEldredge I'm not sure about that.

– Weijun Zhou
Mar 15 at 0:01

Done. Suggestions are welcome.

– Weijun Zhou
Mar 14 at 8:58

Do common versions of sed support line numbers larger than 32 bits?

– Nate Eldredge
Mar 14 at 14:40

You are right. I made a mistake in last edits.

– Weijun Zhou
Mar 14 at 20:22

@NateEldredge I'm not sure about that.

– Weijun Zhou
Mar 15 at 0:01

add a comment |

I would use a perl script for this. I came up with this:

#!/usr/bin/perl

# usage: thisscript linenumberslist.txt contentsfile

unless (open(IN, $ARGV[0])) 
 die "Can't open list of line numbers file '$ARGV[0]'n";

my %linenumbers = ();
while (<IN>) 
 chomp;
 $linenumbers$_ = 1;


unless (open(IN, $ARGV[1])) 
 die "Can't open contents file '$ARGV[1]'n";

$. = 0;
while (<IN>) 
 print if defined $linenumbers$.;


exit;

Next the data file is opened, and when the line number is an existing key in the array of line numbers, then the line is printed.

The $. is perl's line number counter, this increments for every line read. As this is counted across files, I reset it to zero before reading any lines of the data file.

This could probably be written much more in "perl" style, but I prefer to keep it a bit more readable.

If the list of lines you want to extract is very large, this may not be the most efficient way, but I find that perl is often amazingly efficient at these things.

If you require the lines to be extracted in the order that they are listed, i.e. not sequentially, then it becomes a lot more complicated...

answered Mar 14 at 7:26

wurtel

11k11628

add a comment |

I would use a perl script for this. I came up with this:

#!/usr/bin/perl

# usage: thisscript linenumberslist.txt contentsfile

unless (open(IN, $ARGV[0])) 
 die "Can't open list of line numbers file '$ARGV[0]'n";

my %linenumbers = ();
while (<IN>) 
 chomp;
 $linenumbers$_ = 1;


unless (open(IN, $ARGV[1])) 
 die "Can't open contents file '$ARGV[1]'n";

$. = 0;
while (<IN>) 
 print if defined $linenumbers$.;


exit;

Next the data file is opened, and when the line number is an existing key in the array of line numbers, then the line is printed.

The $. is perl's line number counter, this increments for every line read. As this is counted across files, I reset it to zero before reading any lines of the data file.

This could probably be written much more in "perl" style, but I prefer to keep it a bit more readable.

If the list of lines you want to extract is very large, this may not be the most efficient way, but I find that perl is often amazingly efficient at these things.

If you require the lines to be extracted in the order that they are listed, i.e. not sequentially, then it becomes a lot more complicated...

answered Mar 14 at 7:26

wurtel

11k11628

add a comment |

I would use a perl script for this. I came up with this:

#!/usr/bin/perl

# usage: thisscript linenumberslist.txt contentsfile

unless (open(IN, $ARGV[0])) 
 die "Can't open list of line numbers file '$ARGV[0]'n";

my %linenumbers = ();
while (<IN>) 
 chomp;
 $linenumbers$_ = 1;


unless (open(IN, $ARGV[1])) 
 die "Can't open contents file '$ARGV[1]'n";

$. = 0;
while (<IN>) 
 print if defined $linenumbers$.;


exit;

Next the data file is opened, and when the line number is an existing key in the array of line numbers, then the line is printed.

The $. is perl's line number counter, this increments for every line read. As this is counted across files, I reset it to zero before reading any lines of the data file.

This could probably be written much more in "perl" style, but I prefer to keep it a bit more readable.

If the list of lines you want to extract is very large, this may not be the most efficient way, but I find that perl is often amazingly efficient at these things.

If you require the lines to be extracted in the order that they are listed, i.e. not sequentially, then it becomes a lot more complicated...

answered Mar 14 at 7:26

wurtel

11k11628

I would use a perl script for this. I came up with this:

#!/usr/bin/perl

# usage: thisscript linenumberslist.txt contentsfile

unless (open(IN, $ARGV[0])) 
 die "Can't open list of line numbers file '$ARGV[0]'n";

my %linenumbers = ();
while (<IN>) 
 chomp;
 $linenumbers$_ = 1;


unless (open(IN, $ARGV[1])) 
 die "Can't open contents file '$ARGV[1]'n";

$. = 0;
while (<IN>) 
 print if defined $linenumbers$.;


exit;

Next the data file is opened, and when the line number is an existing key in the array of line numbers, then the line is printed.

The $. is perl's line number counter, this increments for every line read. As this is counted across files, I reset it to zero before reading any lines of the data file.

This could probably be written much more in "perl" style, but I prefer to keep it a bit more readable.

If the list of lines you want to extract is very large, this may not be the most efficient way, but I find that perl is often amazingly efficient at these things.

If you require the lines to be extracted in the order that they are listed, i.e. not sequentially, then it becomes a lot more complicated...

answered Mar 14 at 7:26

wurtel

11k11628

answered Mar 14 at 7:26

wurtel

11k11628

answered Mar 14 at 7:26

wurtel

11k11628

answered Mar 14 at 7:26

wurtel

11k11628

add a comment |

Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.

`join`

join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-

join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:

while read rownum; do
 printf '%.12dn' "$rownum"
done <line_numbers >padded_line_numbers

The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.

If the sorting order of the output has to match that of your line_numbers file, you can use:

join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2) 
 <(nl -w 12 -n rz data) |
 sort -k 2,2n |
 cut -d ' ' -f 3-

mkfifo padded_line_numbers
mkfifo numbered_data

while read rownum; do
 printf '%.12dn' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &

nl -w 12 -n rz data >numbered_data &

join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-

Benchmarking

Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.

For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':

$ head -n 3 data
c15d
061d
5787

$ wc -l data
3221254963 data

The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:

shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers

Here I'm considering:

The sed solution from Weijun Zhou's answer.

The awk solution from Micha's answer.

The perl solution from wurtel's answer.

The join solution above.

perl seems to be the fastest:

$ time perl_script line_numbers data | wc -l
10000

real 14m51.597s
user 14m41.878s
sys 0m9.299s

awk's performance looks comparable:

$ time awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000

real 29m3.808s
user 28m52.616s
sys 0m10.709s

join, too, appears to be comparable:

$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000

real 28m24.053s
user 27m52.857s
sys 0m28.958s

Note that the sorted version mentioned above has roughly no performance penalty over this one.

Finally, sed appears to be significantly slower: I killed it after approximately nine hours:

$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C

real 551m12.747s
user 550m53.390s
sys 0m15.624s

edited 1 hour ago

answered 10 hours ago

fra-san

1,8611520

add a comment |

Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.

`join`

join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-

join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:

while read rownum; do
 printf '%.12dn' "$rownum"
done <line_numbers >padded_line_numbers

The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.

If the sorting order of the output has to match that of your line_numbers file, you can use:

join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2) 
 <(nl -w 12 -n rz data) |
 sort -k 2,2n |
 cut -d ' ' -f 3-

mkfifo padded_line_numbers
mkfifo numbered_data

while read rownum; do
 printf '%.12dn' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &

nl -w 12 -n rz data >numbered_data &

join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-

Benchmarking

Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.

For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':

$ head -n 3 data
c15d
061d
5787

$ wc -l data
3221254963 data

The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:

shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers

Here I'm considering:

The sed solution from Weijun Zhou's answer.

The awk solution from Micha's answer.

The perl solution from wurtel's answer.

The join solution above.

perl seems to be the fastest:

$ time perl_script line_numbers data | wc -l
10000

real 14m51.597s
user 14m41.878s
sys 0m9.299s

awk's performance looks comparable:

$ time awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000

real 29m3.808s
user 28m52.616s
sys 0m10.709s

join, too, appears to be comparable:

$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000

real 28m24.053s
user 27m52.857s
sys 0m28.958s

Note that the sorted version mentioned above has roughly no performance penalty over this one.

Finally, sed appears to be significantly slower: I killed it after approximately nine hours:

$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C

real 551m12.747s
user 550m53.390s
sys 0m15.624s

edited 1 hour ago

answered 10 hours ago

fra-san

1,8611520

add a comment |

Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.

`join`

join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-

join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:

while read rownum; do
 printf '%.12dn' "$rownum"
done <line_numbers >padded_line_numbers

The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.

If the sorting order of the output has to match that of your line_numbers file, you can use:

join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2) 
 <(nl -w 12 -n rz data) |
 sort -k 2,2n |
 cut -d ' ' -f 3-

mkfifo padded_line_numbers
mkfifo numbered_data

while read rownum; do
 printf '%.12dn' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &

nl -w 12 -n rz data >numbered_data &

join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-

Benchmarking

Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.

For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':

$ head -n 3 data
c15d
061d
5787

$ wc -l data
3221254963 data

The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:

shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers

Here I'm considering:

The sed solution from Weijun Zhou's answer.

The awk solution from Micha's answer.

The perl solution from wurtel's answer.

The join solution above.

perl seems to be the fastest:

$ time perl_script line_numbers data | wc -l
10000

real 14m51.597s
user 14m41.878s
sys 0m9.299s

awk's performance looks comparable:

$ time awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000

real 29m3.808s
user 28m52.616s
sys 0m10.709s

join, too, appears to be comparable:

$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000

real 28m24.053s
user 27m52.857s
sys 0m28.958s

Note that the sorted version mentioned above has roughly no performance penalty over this one.

Finally, sed appears to be significantly slower: I killed it after approximately nine hours:

$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C

real 551m12.747s
user 550m53.390s
sys 0m15.624s

edited 1 hour ago

answered 10 hours ago

fra-san

1,8611520

Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.

`join`

join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-

join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:

while read rownum; do
 printf '%.12dn' "$rownum"
done <line_numbers >padded_line_numbers

The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.

If the sorting order of the output has to match that of your line_numbers file, you can use:

join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2) 
 <(nl -w 12 -n rz data) |
 sort -k 2,2n |
 cut -d ' ' -f 3-

mkfifo padded_line_numbers
mkfifo numbered_data

while read rownum; do
 printf '%.12dn' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &

nl -w 12 -n rz data >numbered_data &

join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-

Benchmarking

Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.

For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':

$ head -n 3 data
c15d
061d
5787

$ wc -l data
3221254963 data

The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:

shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers

Here I'm considering:

The sed solution from Weijun Zhou's answer.

The awk solution from Micha's answer.

The perl solution from wurtel's answer.

The join solution above.

perl seems to be the fastest:

$ time perl_script line_numbers data | wc -l
10000

real 14m51.597s
user 14m41.878s
sys 0m9.299s

awk's performance looks comparable:

$ time awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000

real 29m3.808s
user 28m52.616s
sys 0m10.709s

join, too, appears to be comparable:

$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000

real 28m24.053s
user 27m52.857s
sys 0m28.958s

Note that the sorted version mentioned above has roughly no performance penalty over this one.

Finally, sed appears to be significantly slower: I killed it after approximately nine hours:

$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C

real 551m12.747s
user 550m53.390s
sys 0m15.624s

edited 1 hour ago

answered 10 hours ago

fra-san

1,8611520

edited 1 hour ago

answered 10 hours ago

fra-san

1,8611520

answered 10 hours ago

fra-san

1,8611520

answered 10 hours ago

fra-san

1,8611520

add a comment |

micha@linux-micha: /tmp
$ cat numbers.txt
1
2
4
5

micha@linux-micha: /tmp
$ cat sentences.txt
alpha
bravo
charlie
delta
echo
foxtrott

micha@linux-micha: /tmp
$ awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' numbers.txt sentences.txt
alpha
bravo
delta
echo

edited Mar 15 at 1:03

answered Mar 14 at 23:55

Micha

973

This one invokes awk many times and will be really slow if sentences.txt is a huge file.

– Weijun Zhou
Mar 14 at 23:59

W. Zhou is right. Thank you! So I'll think twice.

– Micha
Mar 15 at 0:18

1

So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?

– Micha
Mar 15 at 1:05

add a comment |

micha@linux-micha: /tmp
$ cat numbers.txt
1
2
4
5

micha@linux-micha: /tmp
$ cat sentences.txt
alpha
bravo
charlie
delta
echo
foxtrott

micha@linux-micha: /tmp
$ awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' numbers.txt sentences.txt
alpha
bravo
delta
echo

edited Mar 15 at 1:03

answered Mar 14 at 23:55

Micha

973

This one invokes awk many times and will be really slow if sentences.txt is a huge file.

– Weijun Zhou
Mar 14 at 23:59

W. Zhou is right. Thank you! So I'll think twice.

– Micha
Mar 15 at 0:18

1

So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?

– Micha
Mar 15 at 1:05

add a comment |

micha@linux-micha: /tmp
$ cat numbers.txt
1
2
4
5

micha@linux-micha: /tmp
$ cat sentences.txt
alpha
bravo
charlie
delta
echo
foxtrott

micha@linux-micha: /tmp
$ awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' numbers.txt sentences.txt
alpha
bravo
delta
echo

edited Mar 15 at 1:03

answered Mar 14 at 23:55

Micha

973

micha@linux-micha: /tmp
$ cat numbers.txt
1
2
4
5

micha@linux-micha: /tmp
$ cat sentences.txt
alpha
bravo
charlie
delta
echo
foxtrott

micha@linux-micha: /tmp
$ awk 'FNR==NR seen[$0]++ ; FNR!=NR && FNR in seen' numbers.txt sentences.txt
alpha
bravo
delta
echo

edited Mar 15 at 1:03

answered Mar 14 at 23:55

Micha

973

edited Mar 15 at 1:03

answered Mar 14 at 23:55

Micha

973

answered Mar 14 at 23:55

Micha

973

answered Mar 14 at 23:55

Micha

973

This one invokes awk many times and will be really slow if sentences.txt is a huge file.

– Weijun Zhou
Mar 14 at 23:59

W. Zhou is right. Thank you! So I'll think twice.

– Micha
Mar 15 at 0:18

1

So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?

– Micha
Mar 15 at 1:05

add a comment |

This one invokes awk many times and will be really slow if sentences.txt is a huge file.

– Weijun Zhou
Mar 14 at 23:59

W. Zhou is right. Thank you! So I'll think twice.

– Micha
Mar 15 at 0:18

1

So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?

– Micha
Mar 15 at 1:05

This one invokes awk many times and will be really slow if sentences.txt is a huge file.

– Weijun Zhou
Mar 14 at 23:59

W. Zhou is right. Thank you! So I'll think twice.

– Micha
Mar 15 at 0:18

So, edited my 1st awk approach. Now awk is invoked 1x. May be this is fast enough?

– Micha
Mar 15 at 1:05

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ygtjki

4 Answers
4

`join`

Benchmarking

Your Answer

Post as a guest

4 Answers
4

4 Answers
4

`join`

Benchmarking

`join`

Benchmarking

`join`

Benchmarking

`join`

Benchmarking

Post as a guest

Popular posts from this blog

Àrd-bhaile Cathair chruinne/Baile mòr cruinne | Artagailean ceangailte | Clàr-taice na seòladaireachd

4 Answers 4

join

Benchmarking

Your Answer

Sign up or log in

Post as a guest

Post as a guest

4 Answers 4

4 Answers 4

join

Benchmarking

join

Benchmarking

join

Benchmarking

join

Benchmarking

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Àrd-bhaile Cathair chruinne/Baile mòr cruinne | Artagailean ceangailte | Clàr-taice na seòladaireachd

4 Answers
4

`join`

4 Answers
4

4 Answers
4

`join`

`join`

`join`

`join`