cat a very large number of files together in correct orderWhat defines the maximum size for a command single argument?Deleting large number of filesreverse cat files orderFind files with particular filename format, cat each one with a header file, and save result to original file for each such file/bin/cat: Argument list too longWould cat *.txt concatenate files in alphabetical order?cat files in specific order based on number in filenameExtracting list of filenames, including ones with blanks, from findCan I disable command execution in findserial device output looks fine in gnu screen but garbled using catCan I script recreation of files such their inodes/mtimes increase in filename order? Across subdirectories?
I probably found a bug with the sudo apt install function
How is the claim "I am in New York only if I am in America" the same as "If I am in New York, then I am in America?
What is the offset in a seaplane's hull?
How does one intimidate enemies without having the capacity for violence?
What would happen to a modern skyscraper if it rains micro blackholes?
Why are 150k or 200k jobs considered good when there are 300k+ births a month?
Question about Goedel's incompleteness Proof
Accidentally leaked the solution to an assignment, what to do now? (I'm the prof)
What exactly is the parasitic white layer that forms after iron parts are treated with ammonia?
Banach space and Hilbert space topology
Can I interfere when another PC is about to be attacked?
What would the Romans have called "sorcery"?
Prevent a directory in /tmp from being deleted
How do I create uniquely male characters?
"which" command doesn't work / path of Safari?
How old can references or sources in a thesis be?
Why CLRS example on residual networks does not follows its formula?
Why don't electron-positron collisions release infinite energy?
Draw simple lines in Inkscape
What typically incentivizes a professor to change jobs to a lower ranking university?
Theorems that impeded progress
Is it tax fraud for an individual to declare non-taxable revenue as taxable income? (US tax laws)
Is there a familial term for apples and pears?
Download, install and reboot computer at night if needed
cat a very large number of files together in correct order
What defines the maximum size for a command single argument?Deleting large number of filesreverse cat files orderFind files with particular filename format, cat each one with a header file, and save result to original file for each such file/bin/cat: Argument list too longWould cat *.txt concatenate files in alphabetical order?cat files in specific order based on number in filenameExtracting list of filenames, including ones with blanks, from findCan I disable command execution in findserial device output looks fine in gnu screen but garbled using catCan I script recreation of files such their inodes/mtimes increase in filename order? Across subdirectories?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have about 15,000 files that are named file_1.pdb, file_2.pdb, etc. I can cat about a few thousand of these in order by doing:
cat file_1..2000.pdb >> file_all.pdb
However, if I do this for 15,000 files, I get the error
-bash: /bin/cat: Argument list too long
I have seen this problem being solved by doing find . -name xx -exec xx but this wouldn't preserve the order with which the files are joined. How can I achieve this?
files find cat brace-expansion
add a comment |
I have about 15,000 files that are named file_1.pdb, file_2.pdb, etc. I can cat about a few thousand of these in order by doing:
cat file_1..2000.pdb >> file_all.pdb
However, if I do this for 15,000 files, I get the error
-bash: /bin/cat: Argument list too long
I have seen this problem being solved by doing find . -name xx -exec xx but this wouldn't preserve the order with which the files are joined. How can I achieve this?
files find cat brace-expansion
3
What is the tenth file named as? (Or any file with more than a single digit numbered ordering.)
– roaima
Feb 26 '18 at 17:33
I (now) have 15,000 of these files in a directory and yourcat file_1..15000.pdbconstruct works fine for me.
– roaima
Feb 26 '18 at 17:36
11
depends on the system what the limit is.getconf ARG_MAXshould tell.
– ilkkachu
Feb 26 '18 at 17:43
@ilkkachu new one to me - thank you
– roaima
Feb 26 '18 at 21:40
3
Consider changing your question to "thousands of " or "a very large number of" files. Might make the question easier to find for other people with a similar problem.
– msouth
Feb 27 '18 at 6:32
add a comment |
I have about 15,000 files that are named file_1.pdb, file_2.pdb, etc. I can cat about a few thousand of these in order by doing:
cat file_1..2000.pdb >> file_all.pdb
However, if I do this for 15,000 files, I get the error
-bash: /bin/cat: Argument list too long
I have seen this problem being solved by doing find . -name xx -exec xx but this wouldn't preserve the order with which the files are joined. How can I achieve this?
files find cat brace-expansion
I have about 15,000 files that are named file_1.pdb, file_2.pdb, etc. I can cat about a few thousand of these in order by doing:
cat file_1..2000.pdb >> file_all.pdb
However, if I do this for 15,000 files, I get the error
-bash: /bin/cat: Argument list too long
I have seen this problem being solved by doing find . -name xx -exec xx but this wouldn't preserve the order with which the files are joined. How can I achieve this?
files find cat brace-expansion
files find cat brace-expansion
edited Feb 27 '18 at 6:55
sodiumnitrate
asked Feb 26 '18 at 17:25
sodiumnitratesodiumnitrate
355413
355413
3
What is the tenth file named as? (Or any file with more than a single digit numbered ordering.)
– roaima
Feb 26 '18 at 17:33
I (now) have 15,000 of these files in a directory and yourcat file_1..15000.pdbconstruct works fine for me.
– roaima
Feb 26 '18 at 17:36
11
depends on the system what the limit is.getconf ARG_MAXshould tell.
– ilkkachu
Feb 26 '18 at 17:43
@ilkkachu new one to me - thank you
– roaima
Feb 26 '18 at 21:40
3
Consider changing your question to "thousands of " or "a very large number of" files. Might make the question easier to find for other people with a similar problem.
– msouth
Feb 27 '18 at 6:32
add a comment |
3
What is the tenth file named as? (Or any file with more than a single digit numbered ordering.)
– roaima
Feb 26 '18 at 17:33
I (now) have 15,000 of these files in a directory and yourcat file_1..15000.pdbconstruct works fine for me.
– roaima
Feb 26 '18 at 17:36
11
depends on the system what the limit is.getconf ARG_MAXshould tell.
– ilkkachu
Feb 26 '18 at 17:43
@ilkkachu new one to me - thank you
– roaima
Feb 26 '18 at 21:40
3
Consider changing your question to "thousands of " or "a very large number of" files. Might make the question easier to find for other people with a similar problem.
– msouth
Feb 27 '18 at 6:32
3
3
What is the tenth file named as? (Or any file with more than a single digit numbered ordering.)
– roaima
Feb 26 '18 at 17:33
What is the tenth file named as? (Or any file with more than a single digit numbered ordering.)
– roaima
Feb 26 '18 at 17:33
I (now) have 15,000 of these files in a directory and your
cat file_1..15000.pdb construct works fine for me.– roaima
Feb 26 '18 at 17:36
I (now) have 15,000 of these files in a directory and your
cat file_1..15000.pdb construct works fine for me.– roaima
Feb 26 '18 at 17:36
11
11
depends on the system what the limit is.
getconf ARG_MAX should tell.– ilkkachu
Feb 26 '18 at 17:43
depends on the system what the limit is.
getconf ARG_MAX should tell.– ilkkachu
Feb 26 '18 at 17:43
@ilkkachu new one to me - thank you
– roaima
Feb 26 '18 at 21:40
@ilkkachu new one to me - thank you
– roaima
Feb 26 '18 at 21:40
3
3
Consider changing your question to "thousands of " or "a very large number of" files. Might make the question easier to find for other people with a similar problem.
– msouth
Feb 27 '18 at 6:32
Consider changing your question to "thousands of " or "a very large number of" files. Might make the question easier to find for other people with a similar problem.
– msouth
Feb 27 '18 at 6:32
add a comment |
6 Answers
6
active
oldest
votes
Using find, sort and xargs:
find . -maxdepth 1 -type f -name 'file_*.pdb' -print0 |
sort -zV |
xargs -0 cat >all.pdb
The find command finds all relevant files, then prints their pathnames out to sort that does a "version sort" to get them in the right order (if the numbers in the filenames had been zero-filled to a fixed width we would not have needed -V). xargs takes this list of sorted pathnames and runs cat on these in as large batches as possible.
This should work even if the filenames contains strange characters such as newlines and spaces. We use -print0 with find to give sort nul-terminated names to sort, and sort handles these using -z. xargs too reads nul-terminated names with its -0 flag.
Note that I'm writing the result to a file whose name does not match the pattern file_*.pdb.
The above solution uses some non-standard flags for some utilities. These are supported by the GNU implementation of these utilities and at least by the OpenBSD and the macOS implementation.
The non-standard flags used are
-maxdepth 1, to makefindonly enter the top-most directory but no subdirectories. POSIXly, usefind . ! -name . -prune ...-print0, to makefindoutput nul-terminated pathnames (this was considered by POSIX but rejected). One could use-exec printf '%s' +instead.-z, to makesorttake nul-terminated records. There is no POSIX equivalence.-V, to makesortsort e.g.200after3. There is no POSIX equivalence, but could be replaced by a numeric sort on specific parts of the filename if the filenames have a fixed prefix.-0, to makexargsread nul-terminated records. There is no POSIX equivalence. POSIXly, one would need to quote the file names in a format recognised byxargs.
If the pathnames are well behaved, and if the directory structure is flat (no subdirectories), then one could make do without these flags, except for -V with sort.
1
You don't need nonstandard null termination for this. These filenames are exceedingly boring and the POSIX tools are entirely capable of handling then.
– Kevin
Feb 27 '18 at 0:52
6
You could also write this more succinctly with the asker’s specification asprintf ‘file_%d.pdb’ 1..15000 | xargs -0 cat, or even with Kevin’s point,echo file_1..15000.pdb | xargs cat. Thefindsolution has considerably more overhead since it has to search the file system for those files, but it is more useful when some of the files may not exist.
– kojiro
Feb 27 '18 at 3:48
4
@Kevin while what you are saying is true, it's arguably better to have an answer that applies in more general circumstances. Of the next thousand people that have this question, it's likely that some of them will have spaces or whatever in their file names.
– msouth
Feb 27 '18 at 6:30
1
@chrylis A redirection is never part of a command's arguments, and it'sxargsrather thancatthat is redirected (eachcatinvocation will usexargsstandard output). If we had saidxargs -0 sh -c 'cat >all.pdb'then it would have made sense to use>>instead of>, if that's what you're hinting at.
– Kusalananda♦
Feb 27 '18 at 7:14
1
It looks likesort -n -k1.6would work (for the original,file_nnnfilenames, orsort -n -k1.5for the ones without the underscore).
– Scott
Feb 28 '18 at 8:32
|
show 6 more comments
With zsh (where that 1..15000 operator comes from):
autoload zargs # best in ~/.zshrc
zargs file_1..15000.pdb -- cat > file_all.pdb
Or for all file_<digits>.pdb files in numerical order:
zargs file_<->.pdb(n) -- cat > file_all.pdb
(where <x-y> is a glob operator that matches on decimal numbers x to y. With no x nor y, it's any decimal number. Equivalent to extendedglob's [0-9]## or kshglob's +([0-9]) (one or more digits)).
With ksh93, using its builtin cat command (so not affected by that limit of the execve() system call since there's no execution):
command /opt/ast/bin/cat file_1..15000.pdb > file_all.pdb
With bash/zsh/ksh93 (which support zsh's x..y and have printf builtin):
printf '%sn' file_1..15000.pdb | xargs cat > file_all.pdb
On a GNU system or compatible, you could also use seq:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
For the xargs-based solutions, special care would have to be taken for file names that contain blanks, single or double quotes or backslashes.
Like for -It's a trickier filename - 12.pdb, use:
seq -f ""./-It's a trickier filename - %.17g.pdb"" 15000 |
xargs cat > file_all.pdb
Theseq -f | xarg cat >is the most elegant, and effective solution. (IMHO).
– Hastur
Feb 27 '18 at 11:38
Check the trickier filename... maybe'"./-It'''s a trickier filename - %.17g.pdb"'?
– Hastur
Feb 27 '18 at 11:41
@Hastur, oops! Yes, thanks, I've changed it to an alternative quoting syntax. Yours would work as well.
– Stéphane Chazelas
Feb 27 '18 at 11:56
add a comment |
A for loop is possible, and very simple.
for i in file_1..15000.pdb; do cat $i >> file_all.pdb; done
The downside is that you invoke cat a hell of a lot of times. But if you can't remember exactly how to do the stuff with find and the invocation overhead isn't too bad in your situation, then it's worth keeping in mind.
I often add aecho $i;in the loop body as a "progress indicator"
– Rolf
Mar 6 '18 at 7:31
add a comment |
seq 1 15000 | awk 'print "file_"$0".dat"' | xargs cat > file_all.pdb
1
awk can do seq's job here and seq can do awk's job:seq -f file_%.10g.pdb 15000. Note thatseqis not a standard command.
– Stéphane Chazelas
Feb 27 '18 at 9:31
Thanks Stéphane -- I thinkseq -fis a great way to do this; will remember that.
– LarryC
Feb 28 '18 at 17:53
add a comment |
Premise
You shouldn't incur in that error for only 15k files with that specific name format [1,2].
If you are running that expansion from another directory and you have to add the path to each file, the size of your command will be bigger, and of course it can occur.
Solution run the command from that directory.
(cd That/Directory ; cat file_1..2000.pdb >> file_all.pdb )
Best Solution If instead I guessed bad and you run it from the directory in which the files are...
IMHO the best solution is the Stéphane Chazelas' ones:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
with printf or seq; tested on 15k files with only their number inside pre-cached it is even the faster one (at present and except the OP one from the same directory in which the files are).
Some words more
You should be able to pass to your shell command lines more long.
Your command line is 213914 characters long and contains 15003 wordscat file_1..15000.pdb " > file_all.pdb" | wc
...even adding 8 bytes for each word is 333 938 bytes (0.3M) far below from the 2097142 (2.1M) reported by ARG_MAX on a kernel 3.13.0 or the slightly smaller 2088232 reported as "Maximum length of command we could actually use" by xargs --show-limits
Give it a look on your system to the output of
getconf ARG_MAX
xargs --show-limits
Laziness guided solution
In cases like this I prefer to work with blocks even because usually come out a time efficient solution.
The logic (if any) is I'm far too lazy to write 1...1000 1001..2000 etc etc...
So I ask a script to do it for me.
Only after I've checked the output is correctness I redirect it to a script.
... but Laziness is a state of mind.
Since I'm allergic to xargs (I really should have used xargs here) and I do not want to check how to use it, I punctually finish to reinvent the wheel as in the examples below (tl;dr).
Note that since the file names are controlled (no spaces, newlines...) you can go easily with something like the script below.
tl;dr
Version 1: pass as optional parameter the 1st file number, the last, the block size, the output file
#!/bin/bash
StartN=$1:-1 # First file number
EndN=$2:-15000 # Last file number
BlockN=$3:-100 # files in a Block
OutFile=$4:-"all.pdb" # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
cat $(seq -f file_%.17g.pdb $CurrentStart $CurrentEnd) >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
cat $(seq -f file_%.17g.pdb $CurrentStart $EndN) >> $OutFile;
Version 2
Calling bash for the expansion (a bit slower in my tests ~20%).
#!/bin/bash
StartN=$1:-1 # First file number
EndN=$2:-15000 # Last file number
BlockN=$3:-100 # files in a Block
OutFile=$4:-"all.pdb" # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
echo cat file_$CurrentStart..$CurrentEnd.pdb | /bin/bash >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
echo cat file_$CurrentStart..$EndN.pdb | /bin/bash >> $OutFile;
Of course you can go forward and get completely rid of seq [3] (from coreutils) and work directly with the variables in bash, or use python, or compile a c program to do it [4]...
Note that%gis short for%.6g. It would represent 1,000,000 as 1e+06 for instance.
– Stéphane Chazelas
Feb 27 '18 at 11:15
Really lazy people use the tools designed for the task of working around that E2BIG limitation likexargs, zsh'szargsorksh93'scommand -x.
– Stéphane Chazelas
Feb 27 '18 at 11:16
seqis not a bash builtin, it's a command from GNU coreutils.seq -f %g 1000000 1000000outputs 1e+06 even in the latest version of coreutils.
– Stéphane Chazelas
Feb 27 '18 at 11:26
@StéphaneChazelas Laziness is a state of mind. Strange to say but I feel more cosy when I can see (and visually check the output of a serialized command) and only then redirect to the execution. That construction give me to think less thanxarg... but I understand it is personal and maybe related only to me.
– Hastur
Feb 27 '18 at 11:28
@StéphaneChazelas Gotcha, right... Fixed. Thanks. I tested only with the 15k files given by the OP, my bad.
– Hastur
Feb 27 '18 at 11:31
|
show 2 more comments
Another way to do it could be
(cat file_1..499.pdb; cat file_500..999.pdb; cat file_1000..1499.pdb; cat file_1500..2000.pdb) >> file_all.pdb
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f426748%2fcat-a-very-large-number-of-files-together-in-correct-order%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
Using find, sort and xargs:
find . -maxdepth 1 -type f -name 'file_*.pdb' -print0 |
sort -zV |
xargs -0 cat >all.pdb
The find command finds all relevant files, then prints their pathnames out to sort that does a "version sort" to get them in the right order (if the numbers in the filenames had been zero-filled to a fixed width we would not have needed -V). xargs takes this list of sorted pathnames and runs cat on these in as large batches as possible.
This should work even if the filenames contains strange characters such as newlines and spaces. We use -print0 with find to give sort nul-terminated names to sort, and sort handles these using -z. xargs too reads nul-terminated names with its -0 flag.
Note that I'm writing the result to a file whose name does not match the pattern file_*.pdb.
The above solution uses some non-standard flags for some utilities. These are supported by the GNU implementation of these utilities and at least by the OpenBSD and the macOS implementation.
The non-standard flags used are
-maxdepth 1, to makefindonly enter the top-most directory but no subdirectories. POSIXly, usefind . ! -name . -prune ...-print0, to makefindoutput nul-terminated pathnames (this was considered by POSIX but rejected). One could use-exec printf '%s' +instead.-z, to makesorttake nul-terminated records. There is no POSIX equivalence.-V, to makesortsort e.g.200after3. There is no POSIX equivalence, but could be replaced by a numeric sort on specific parts of the filename if the filenames have a fixed prefix.-0, to makexargsread nul-terminated records. There is no POSIX equivalence. POSIXly, one would need to quote the file names in a format recognised byxargs.
If the pathnames are well behaved, and if the directory structure is flat (no subdirectories), then one could make do without these flags, except for -V with sort.
1
You don't need nonstandard null termination for this. These filenames are exceedingly boring and the POSIX tools are entirely capable of handling then.
– Kevin
Feb 27 '18 at 0:52
6
You could also write this more succinctly with the asker’s specification asprintf ‘file_%d.pdb’ 1..15000 | xargs -0 cat, or even with Kevin’s point,echo file_1..15000.pdb | xargs cat. Thefindsolution has considerably more overhead since it has to search the file system for those files, but it is more useful when some of the files may not exist.
– kojiro
Feb 27 '18 at 3:48
4
@Kevin while what you are saying is true, it's arguably better to have an answer that applies in more general circumstances. Of the next thousand people that have this question, it's likely that some of them will have spaces or whatever in their file names.
– msouth
Feb 27 '18 at 6:30
1
@chrylis A redirection is never part of a command's arguments, and it'sxargsrather thancatthat is redirected (eachcatinvocation will usexargsstandard output). If we had saidxargs -0 sh -c 'cat >all.pdb'then it would have made sense to use>>instead of>, if that's what you're hinting at.
– Kusalananda♦
Feb 27 '18 at 7:14
1
It looks likesort -n -k1.6would work (for the original,file_nnnfilenames, orsort -n -k1.5for the ones without the underscore).
– Scott
Feb 28 '18 at 8:32
|
show 6 more comments
Using find, sort and xargs:
find . -maxdepth 1 -type f -name 'file_*.pdb' -print0 |
sort -zV |
xargs -0 cat >all.pdb
The find command finds all relevant files, then prints their pathnames out to sort that does a "version sort" to get them in the right order (if the numbers in the filenames had been zero-filled to a fixed width we would not have needed -V). xargs takes this list of sorted pathnames and runs cat on these in as large batches as possible.
This should work even if the filenames contains strange characters such as newlines and spaces. We use -print0 with find to give sort nul-terminated names to sort, and sort handles these using -z. xargs too reads nul-terminated names with its -0 flag.
Note that I'm writing the result to a file whose name does not match the pattern file_*.pdb.
The above solution uses some non-standard flags for some utilities. These are supported by the GNU implementation of these utilities and at least by the OpenBSD and the macOS implementation.
The non-standard flags used are
-maxdepth 1, to makefindonly enter the top-most directory but no subdirectories. POSIXly, usefind . ! -name . -prune ...-print0, to makefindoutput nul-terminated pathnames (this was considered by POSIX but rejected). One could use-exec printf '%s' +instead.-z, to makesorttake nul-terminated records. There is no POSIX equivalence.-V, to makesortsort e.g.200after3. There is no POSIX equivalence, but could be replaced by a numeric sort on specific parts of the filename if the filenames have a fixed prefix.-0, to makexargsread nul-terminated records. There is no POSIX equivalence. POSIXly, one would need to quote the file names in a format recognised byxargs.
If the pathnames are well behaved, and if the directory structure is flat (no subdirectories), then one could make do without these flags, except for -V with sort.
1
You don't need nonstandard null termination for this. These filenames are exceedingly boring and the POSIX tools are entirely capable of handling then.
– Kevin
Feb 27 '18 at 0:52
6
You could also write this more succinctly with the asker’s specification asprintf ‘file_%d.pdb’ 1..15000 | xargs -0 cat, or even with Kevin’s point,echo file_1..15000.pdb | xargs cat. Thefindsolution has considerably more overhead since it has to search the file system for those files, but it is more useful when some of the files may not exist.
– kojiro
Feb 27 '18 at 3:48
4
@Kevin while what you are saying is true, it's arguably better to have an answer that applies in more general circumstances. Of the next thousand people that have this question, it's likely that some of them will have spaces or whatever in their file names.
– msouth
Feb 27 '18 at 6:30
1
@chrylis A redirection is never part of a command's arguments, and it'sxargsrather thancatthat is redirected (eachcatinvocation will usexargsstandard output). If we had saidxargs -0 sh -c 'cat >all.pdb'then it would have made sense to use>>instead of>, if that's what you're hinting at.
– Kusalananda♦
Feb 27 '18 at 7:14
1
It looks likesort -n -k1.6would work (for the original,file_nnnfilenames, orsort -n -k1.5for the ones without the underscore).
– Scott
Feb 28 '18 at 8:32
|
show 6 more comments
Using find, sort and xargs:
find . -maxdepth 1 -type f -name 'file_*.pdb' -print0 |
sort -zV |
xargs -0 cat >all.pdb
The find command finds all relevant files, then prints their pathnames out to sort that does a "version sort" to get them in the right order (if the numbers in the filenames had been zero-filled to a fixed width we would not have needed -V). xargs takes this list of sorted pathnames and runs cat on these in as large batches as possible.
This should work even if the filenames contains strange characters such as newlines and spaces. We use -print0 with find to give sort nul-terminated names to sort, and sort handles these using -z. xargs too reads nul-terminated names with its -0 flag.
Note that I'm writing the result to a file whose name does not match the pattern file_*.pdb.
The above solution uses some non-standard flags for some utilities. These are supported by the GNU implementation of these utilities and at least by the OpenBSD and the macOS implementation.
The non-standard flags used are
-maxdepth 1, to makefindonly enter the top-most directory but no subdirectories. POSIXly, usefind . ! -name . -prune ...-print0, to makefindoutput nul-terminated pathnames (this was considered by POSIX but rejected). One could use-exec printf '%s' +instead.-z, to makesorttake nul-terminated records. There is no POSIX equivalence.-V, to makesortsort e.g.200after3. There is no POSIX equivalence, but could be replaced by a numeric sort on specific parts of the filename if the filenames have a fixed prefix.-0, to makexargsread nul-terminated records. There is no POSIX equivalence. POSIXly, one would need to quote the file names in a format recognised byxargs.
If the pathnames are well behaved, and if the directory structure is flat (no subdirectories), then one could make do without these flags, except for -V with sort.
Using find, sort and xargs:
find . -maxdepth 1 -type f -name 'file_*.pdb' -print0 |
sort -zV |
xargs -0 cat >all.pdb
The find command finds all relevant files, then prints their pathnames out to sort that does a "version sort" to get them in the right order (if the numbers in the filenames had been zero-filled to a fixed width we would not have needed -V). xargs takes this list of sorted pathnames and runs cat on these in as large batches as possible.
This should work even if the filenames contains strange characters such as newlines and spaces. We use -print0 with find to give sort nul-terminated names to sort, and sort handles these using -z. xargs too reads nul-terminated names with its -0 flag.
Note that I'm writing the result to a file whose name does not match the pattern file_*.pdb.
The above solution uses some non-standard flags for some utilities. These are supported by the GNU implementation of these utilities and at least by the OpenBSD and the macOS implementation.
The non-standard flags used are
-maxdepth 1, to makefindonly enter the top-most directory but no subdirectories. POSIXly, usefind . ! -name . -prune ...-print0, to makefindoutput nul-terminated pathnames (this was considered by POSIX but rejected). One could use-exec printf '%s' +instead.-z, to makesorttake nul-terminated records. There is no POSIX equivalence.-V, to makesortsort e.g.200after3. There is no POSIX equivalence, but could be replaced by a numeric sort on specific parts of the filename if the filenames have a fixed prefix.-0, to makexargsread nul-terminated records. There is no POSIX equivalence. POSIXly, one would need to quote the file names in a format recognised byxargs.
If the pathnames are well behaved, and if the directory structure is flat (no subdirectories), then one could make do without these flags, except for -V with sort.
edited Apr 4 at 21:22
answered Feb 26 '18 at 17:33
Kusalananda♦Kusalananda
140k17261435
140k17261435
1
You don't need nonstandard null termination for this. These filenames are exceedingly boring and the POSIX tools are entirely capable of handling then.
– Kevin
Feb 27 '18 at 0:52
6
You could also write this more succinctly with the asker’s specification asprintf ‘file_%d.pdb’ 1..15000 | xargs -0 cat, or even with Kevin’s point,echo file_1..15000.pdb | xargs cat. Thefindsolution has considerably more overhead since it has to search the file system for those files, but it is more useful when some of the files may not exist.
– kojiro
Feb 27 '18 at 3:48
4
@Kevin while what you are saying is true, it's arguably better to have an answer that applies in more general circumstances. Of the next thousand people that have this question, it's likely that some of them will have spaces or whatever in their file names.
– msouth
Feb 27 '18 at 6:30
1
@chrylis A redirection is never part of a command's arguments, and it'sxargsrather thancatthat is redirected (eachcatinvocation will usexargsstandard output). If we had saidxargs -0 sh -c 'cat >all.pdb'then it would have made sense to use>>instead of>, if that's what you're hinting at.
– Kusalananda♦
Feb 27 '18 at 7:14
1
It looks likesort -n -k1.6would work (for the original,file_nnnfilenames, orsort -n -k1.5for the ones without the underscore).
– Scott
Feb 28 '18 at 8:32
|
show 6 more comments
1
You don't need nonstandard null termination for this. These filenames are exceedingly boring and the POSIX tools are entirely capable of handling then.
– Kevin
Feb 27 '18 at 0:52
6
You could also write this more succinctly with the asker’s specification asprintf ‘file_%d.pdb’ 1..15000 | xargs -0 cat, or even with Kevin’s point,echo file_1..15000.pdb | xargs cat. Thefindsolution has considerably more overhead since it has to search the file system for those files, but it is more useful when some of the files may not exist.
– kojiro
Feb 27 '18 at 3:48
4
@Kevin while what you are saying is true, it's arguably better to have an answer that applies in more general circumstances. Of the next thousand people that have this question, it's likely that some of them will have spaces or whatever in their file names.
– msouth
Feb 27 '18 at 6:30
1
@chrylis A redirection is never part of a command's arguments, and it'sxargsrather thancatthat is redirected (eachcatinvocation will usexargsstandard output). If we had saidxargs -0 sh -c 'cat >all.pdb'then it would have made sense to use>>instead of>, if that's what you're hinting at.
– Kusalananda♦
Feb 27 '18 at 7:14
1
It looks likesort -n -k1.6would work (for the original,file_nnnfilenames, orsort -n -k1.5for the ones without the underscore).
– Scott
Feb 28 '18 at 8:32
1
1
You don't need nonstandard null termination for this. These filenames are exceedingly boring and the POSIX tools are entirely capable of handling then.
– Kevin
Feb 27 '18 at 0:52
You don't need nonstandard null termination for this. These filenames are exceedingly boring and the POSIX tools are entirely capable of handling then.
– Kevin
Feb 27 '18 at 0:52
6
6
You could also write this more succinctly with the asker’s specification as
printf ‘file_%d.pdb’ 1..15000 | xargs -0 cat, or even with Kevin’s point, echo file_1..15000.pdb | xargs cat. The find solution has considerably more overhead since it has to search the file system for those files, but it is more useful when some of the files may not exist.– kojiro
Feb 27 '18 at 3:48
You could also write this more succinctly with the asker’s specification as
printf ‘file_%d.pdb’ 1..15000 | xargs -0 cat, or even with Kevin’s point, echo file_1..15000.pdb | xargs cat. The find solution has considerably more overhead since it has to search the file system for those files, but it is more useful when some of the files may not exist.– kojiro
Feb 27 '18 at 3:48
4
4
@Kevin while what you are saying is true, it's arguably better to have an answer that applies in more general circumstances. Of the next thousand people that have this question, it's likely that some of them will have spaces or whatever in their file names.
– msouth
Feb 27 '18 at 6:30
@Kevin while what you are saying is true, it's arguably better to have an answer that applies in more general circumstances. Of the next thousand people that have this question, it's likely that some of them will have spaces or whatever in their file names.
– msouth
Feb 27 '18 at 6:30
1
1
@chrylis A redirection is never part of a command's arguments, and it's
xargs rather than cat that is redirected (each cat invocation will use xargs standard output). If we had said xargs -0 sh -c 'cat >all.pdb' then it would have made sense to use >> instead of >, if that's what you're hinting at.– Kusalananda♦
Feb 27 '18 at 7:14
@chrylis A redirection is never part of a command's arguments, and it's
xargs rather than cat that is redirected (each cat invocation will use xargs standard output). If we had said xargs -0 sh -c 'cat >all.pdb' then it would have made sense to use >> instead of >, if that's what you're hinting at.– Kusalananda♦
Feb 27 '18 at 7:14
1
1
It looks like
sort -n -k1.6 would work (for the original, file_nnn filenames, or sort -n -k1.5 for the ones without the underscore).– Scott
Feb 28 '18 at 8:32
It looks like
sort -n -k1.6 would work (for the original, file_nnn filenames, or sort -n -k1.5 for the ones without the underscore).– Scott
Feb 28 '18 at 8:32
|
show 6 more comments
With zsh (where that 1..15000 operator comes from):
autoload zargs # best in ~/.zshrc
zargs file_1..15000.pdb -- cat > file_all.pdb
Or for all file_<digits>.pdb files in numerical order:
zargs file_<->.pdb(n) -- cat > file_all.pdb
(where <x-y> is a glob operator that matches on decimal numbers x to y. With no x nor y, it's any decimal number. Equivalent to extendedglob's [0-9]## or kshglob's +([0-9]) (one or more digits)).
With ksh93, using its builtin cat command (so not affected by that limit of the execve() system call since there's no execution):
command /opt/ast/bin/cat file_1..15000.pdb > file_all.pdb
With bash/zsh/ksh93 (which support zsh's x..y and have printf builtin):
printf '%sn' file_1..15000.pdb | xargs cat > file_all.pdb
On a GNU system or compatible, you could also use seq:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
For the xargs-based solutions, special care would have to be taken for file names that contain blanks, single or double quotes or backslashes.
Like for -It's a trickier filename - 12.pdb, use:
seq -f ""./-It's a trickier filename - %.17g.pdb"" 15000 |
xargs cat > file_all.pdb
Theseq -f | xarg cat >is the most elegant, and effective solution. (IMHO).
– Hastur
Feb 27 '18 at 11:38
Check the trickier filename... maybe'"./-It'''s a trickier filename - %.17g.pdb"'?
– Hastur
Feb 27 '18 at 11:41
@Hastur, oops! Yes, thanks, I've changed it to an alternative quoting syntax. Yours would work as well.
– Stéphane Chazelas
Feb 27 '18 at 11:56
add a comment |
With zsh (where that 1..15000 operator comes from):
autoload zargs # best in ~/.zshrc
zargs file_1..15000.pdb -- cat > file_all.pdb
Or for all file_<digits>.pdb files in numerical order:
zargs file_<->.pdb(n) -- cat > file_all.pdb
(where <x-y> is a glob operator that matches on decimal numbers x to y. With no x nor y, it's any decimal number. Equivalent to extendedglob's [0-9]## or kshglob's +([0-9]) (one or more digits)).
With ksh93, using its builtin cat command (so not affected by that limit of the execve() system call since there's no execution):
command /opt/ast/bin/cat file_1..15000.pdb > file_all.pdb
With bash/zsh/ksh93 (which support zsh's x..y and have printf builtin):
printf '%sn' file_1..15000.pdb | xargs cat > file_all.pdb
On a GNU system or compatible, you could also use seq:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
For the xargs-based solutions, special care would have to be taken for file names that contain blanks, single or double quotes or backslashes.
Like for -It's a trickier filename - 12.pdb, use:
seq -f ""./-It's a trickier filename - %.17g.pdb"" 15000 |
xargs cat > file_all.pdb
Theseq -f | xarg cat >is the most elegant, and effective solution. (IMHO).
– Hastur
Feb 27 '18 at 11:38
Check the trickier filename... maybe'"./-It'''s a trickier filename - %.17g.pdb"'?
– Hastur
Feb 27 '18 at 11:41
@Hastur, oops! Yes, thanks, I've changed it to an alternative quoting syntax. Yours would work as well.
– Stéphane Chazelas
Feb 27 '18 at 11:56
add a comment |
With zsh (where that 1..15000 operator comes from):
autoload zargs # best in ~/.zshrc
zargs file_1..15000.pdb -- cat > file_all.pdb
Or for all file_<digits>.pdb files in numerical order:
zargs file_<->.pdb(n) -- cat > file_all.pdb
(where <x-y> is a glob operator that matches on decimal numbers x to y. With no x nor y, it's any decimal number. Equivalent to extendedglob's [0-9]## or kshglob's +([0-9]) (one or more digits)).
With ksh93, using its builtin cat command (so not affected by that limit of the execve() system call since there's no execution):
command /opt/ast/bin/cat file_1..15000.pdb > file_all.pdb
With bash/zsh/ksh93 (which support zsh's x..y and have printf builtin):
printf '%sn' file_1..15000.pdb | xargs cat > file_all.pdb
On a GNU system or compatible, you could also use seq:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
For the xargs-based solutions, special care would have to be taken for file names that contain blanks, single or double quotes or backslashes.
Like for -It's a trickier filename - 12.pdb, use:
seq -f ""./-It's a trickier filename - %.17g.pdb"" 15000 |
xargs cat > file_all.pdb
With zsh (where that 1..15000 operator comes from):
autoload zargs # best in ~/.zshrc
zargs file_1..15000.pdb -- cat > file_all.pdb
Or for all file_<digits>.pdb files in numerical order:
zargs file_<->.pdb(n) -- cat > file_all.pdb
(where <x-y> is a glob operator that matches on decimal numbers x to y. With no x nor y, it's any decimal number. Equivalent to extendedglob's [0-9]## or kshglob's +([0-9]) (one or more digits)).
With ksh93, using its builtin cat command (so not affected by that limit of the execve() system call since there's no execution):
command /opt/ast/bin/cat file_1..15000.pdb > file_all.pdb
With bash/zsh/ksh93 (which support zsh's x..y and have printf builtin):
printf '%sn' file_1..15000.pdb | xargs cat > file_all.pdb
On a GNU system or compatible, you could also use seq:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
For the xargs-based solutions, special care would have to be taken for file names that contain blanks, single or double quotes or backslashes.
Like for -It's a trickier filename - 12.pdb, use:
seq -f ""./-It's a trickier filename - %.17g.pdb"" 15000 |
xargs cat > file_all.pdb
edited Feb 27 '18 at 11:55
answered Feb 26 '18 at 17:52
Stéphane ChazelasStéphane Chazelas
313k57593949
313k57593949
Theseq -f | xarg cat >is the most elegant, and effective solution. (IMHO).
– Hastur
Feb 27 '18 at 11:38
Check the trickier filename... maybe'"./-It'''s a trickier filename - %.17g.pdb"'?
– Hastur
Feb 27 '18 at 11:41
@Hastur, oops! Yes, thanks, I've changed it to an alternative quoting syntax. Yours would work as well.
– Stéphane Chazelas
Feb 27 '18 at 11:56
add a comment |
Theseq -f | xarg cat >is the most elegant, and effective solution. (IMHO).
– Hastur
Feb 27 '18 at 11:38
Check the trickier filename... maybe'"./-It'''s a trickier filename - %.17g.pdb"'?
– Hastur
Feb 27 '18 at 11:41
@Hastur, oops! Yes, thanks, I've changed it to an alternative quoting syntax. Yours would work as well.
– Stéphane Chazelas
Feb 27 '18 at 11:56
The
seq -f | xarg cat > is the most elegant, and effective solution. (IMHO).– Hastur
Feb 27 '18 at 11:38
The
seq -f | xarg cat > is the most elegant, and effective solution. (IMHO).– Hastur
Feb 27 '18 at 11:38
Check the trickier filename... maybe
'"./-It'''s a trickier filename - %.17g.pdb"' ?– Hastur
Feb 27 '18 at 11:41
Check the trickier filename... maybe
'"./-It'''s a trickier filename - %.17g.pdb"' ?– Hastur
Feb 27 '18 at 11:41
@Hastur, oops! Yes, thanks, I've changed it to an alternative quoting syntax. Yours would work as well.
– Stéphane Chazelas
Feb 27 '18 at 11:56
@Hastur, oops! Yes, thanks, I've changed it to an alternative quoting syntax. Yours would work as well.
– Stéphane Chazelas
Feb 27 '18 at 11:56
add a comment |
A for loop is possible, and very simple.
for i in file_1..15000.pdb; do cat $i >> file_all.pdb; done
The downside is that you invoke cat a hell of a lot of times. But if you can't remember exactly how to do the stuff with find and the invocation overhead isn't too bad in your situation, then it's worth keeping in mind.
I often add aecho $i;in the loop body as a "progress indicator"
– Rolf
Mar 6 '18 at 7:31
add a comment |
A for loop is possible, and very simple.
for i in file_1..15000.pdb; do cat $i >> file_all.pdb; done
The downside is that you invoke cat a hell of a lot of times. But if you can't remember exactly how to do the stuff with find and the invocation overhead isn't too bad in your situation, then it's worth keeping in mind.
I often add aecho $i;in the loop body as a "progress indicator"
– Rolf
Mar 6 '18 at 7:31
add a comment |
A for loop is possible, and very simple.
for i in file_1..15000.pdb; do cat $i >> file_all.pdb; done
The downside is that you invoke cat a hell of a lot of times. But if you can't remember exactly how to do the stuff with find and the invocation overhead isn't too bad in your situation, then it's worth keeping in mind.
A for loop is possible, and very simple.
for i in file_1..15000.pdb; do cat $i >> file_all.pdb; done
The downside is that you invoke cat a hell of a lot of times. But if you can't remember exactly how to do the stuff with find and the invocation overhead isn't too bad in your situation, then it's worth keeping in mind.
answered Feb 26 '18 at 18:54
OmnipotentEntityOmnipotentEntity
21317
21317
I often add aecho $i;in the loop body as a "progress indicator"
– Rolf
Mar 6 '18 at 7:31
add a comment |
I often add aecho $i;in the loop body as a "progress indicator"
– Rolf
Mar 6 '18 at 7:31
I often add a
echo $i; in the loop body as a "progress indicator"– Rolf
Mar 6 '18 at 7:31
I often add a
echo $i; in the loop body as a "progress indicator"– Rolf
Mar 6 '18 at 7:31
add a comment |
seq 1 15000 | awk 'print "file_"$0".dat"' | xargs cat > file_all.pdb
1
awk can do seq's job here and seq can do awk's job:seq -f file_%.10g.pdb 15000. Note thatseqis not a standard command.
– Stéphane Chazelas
Feb 27 '18 at 9:31
Thanks Stéphane -- I thinkseq -fis a great way to do this; will remember that.
– LarryC
Feb 28 '18 at 17:53
add a comment |
seq 1 15000 | awk 'print "file_"$0".dat"' | xargs cat > file_all.pdb
1
awk can do seq's job here and seq can do awk's job:seq -f file_%.10g.pdb 15000. Note thatseqis not a standard command.
– Stéphane Chazelas
Feb 27 '18 at 9:31
Thanks Stéphane -- I thinkseq -fis a great way to do this; will remember that.
– LarryC
Feb 28 '18 at 17:53
add a comment |
seq 1 15000 | awk 'print "file_"$0".dat"' | xargs cat > file_all.pdb
seq 1 15000 | awk 'print "file_"$0".dat"' | xargs cat > file_all.pdb
answered Feb 26 '18 at 20:12
LarryCLarryC
312
312
1
awk can do seq's job here and seq can do awk's job:seq -f file_%.10g.pdb 15000. Note thatseqis not a standard command.
– Stéphane Chazelas
Feb 27 '18 at 9:31
Thanks Stéphane -- I thinkseq -fis a great way to do this; will remember that.
– LarryC
Feb 28 '18 at 17:53
add a comment |
1
awk can do seq's job here and seq can do awk's job:seq -f file_%.10g.pdb 15000. Note thatseqis not a standard command.
– Stéphane Chazelas
Feb 27 '18 at 9:31
Thanks Stéphane -- I thinkseq -fis a great way to do this; will remember that.
– LarryC
Feb 28 '18 at 17:53
1
1
awk can do seq's job here and seq can do awk's job:
seq -f file_%.10g.pdb 15000. Note that seq is not a standard command.– Stéphane Chazelas
Feb 27 '18 at 9:31
awk can do seq's job here and seq can do awk's job:
seq -f file_%.10g.pdb 15000. Note that seq is not a standard command.– Stéphane Chazelas
Feb 27 '18 at 9:31
Thanks Stéphane -- I think
seq -f is a great way to do this; will remember that.– LarryC
Feb 28 '18 at 17:53
Thanks Stéphane -- I think
seq -f is a great way to do this; will remember that.– LarryC
Feb 28 '18 at 17:53
add a comment |
Premise
You shouldn't incur in that error for only 15k files with that specific name format [1,2].
If you are running that expansion from another directory and you have to add the path to each file, the size of your command will be bigger, and of course it can occur.
Solution run the command from that directory.
(cd That/Directory ; cat file_1..2000.pdb >> file_all.pdb )
Best Solution If instead I guessed bad and you run it from the directory in which the files are...
IMHO the best solution is the Stéphane Chazelas' ones:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
with printf or seq; tested on 15k files with only their number inside pre-cached it is even the faster one (at present and except the OP one from the same directory in which the files are).
Some words more
You should be able to pass to your shell command lines more long.
Your command line is 213914 characters long and contains 15003 wordscat file_1..15000.pdb " > file_all.pdb" | wc
...even adding 8 bytes for each word is 333 938 bytes (0.3M) far below from the 2097142 (2.1M) reported by ARG_MAX on a kernel 3.13.0 or the slightly smaller 2088232 reported as "Maximum length of command we could actually use" by xargs --show-limits
Give it a look on your system to the output of
getconf ARG_MAX
xargs --show-limits
Laziness guided solution
In cases like this I prefer to work with blocks even because usually come out a time efficient solution.
The logic (if any) is I'm far too lazy to write 1...1000 1001..2000 etc etc...
So I ask a script to do it for me.
Only after I've checked the output is correctness I redirect it to a script.
... but Laziness is a state of mind.
Since I'm allergic to xargs (I really should have used xargs here) and I do not want to check how to use it, I punctually finish to reinvent the wheel as in the examples below (tl;dr).
Note that since the file names are controlled (no spaces, newlines...) you can go easily with something like the script below.
tl;dr
Version 1: pass as optional parameter the 1st file number, the last, the block size, the output file
#!/bin/bash
StartN=$1:-1 # First file number
EndN=$2:-15000 # Last file number
BlockN=$3:-100 # files in a Block
OutFile=$4:-"all.pdb" # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
cat $(seq -f file_%.17g.pdb $CurrentStart $CurrentEnd) >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
cat $(seq -f file_%.17g.pdb $CurrentStart $EndN) >> $OutFile;
Version 2
Calling bash for the expansion (a bit slower in my tests ~20%).
#!/bin/bash
StartN=$1:-1 # First file number
EndN=$2:-15000 # Last file number
BlockN=$3:-100 # files in a Block
OutFile=$4:-"all.pdb" # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
echo cat file_$CurrentStart..$CurrentEnd.pdb | /bin/bash >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
echo cat file_$CurrentStart..$EndN.pdb | /bin/bash >> $OutFile;
Of course you can go forward and get completely rid of seq [3] (from coreutils) and work directly with the variables in bash, or use python, or compile a c program to do it [4]...
Note that%gis short for%.6g. It would represent 1,000,000 as 1e+06 for instance.
– Stéphane Chazelas
Feb 27 '18 at 11:15
Really lazy people use the tools designed for the task of working around that E2BIG limitation likexargs, zsh'szargsorksh93'scommand -x.
– Stéphane Chazelas
Feb 27 '18 at 11:16
seqis not a bash builtin, it's a command from GNU coreutils.seq -f %g 1000000 1000000outputs 1e+06 even in the latest version of coreutils.
– Stéphane Chazelas
Feb 27 '18 at 11:26
@StéphaneChazelas Laziness is a state of mind. Strange to say but I feel more cosy when I can see (and visually check the output of a serialized command) and only then redirect to the execution. That construction give me to think less thanxarg... but I understand it is personal and maybe related only to me.
– Hastur
Feb 27 '18 at 11:28
@StéphaneChazelas Gotcha, right... Fixed. Thanks. I tested only with the 15k files given by the OP, my bad.
– Hastur
Feb 27 '18 at 11:31
|
show 2 more comments
Premise
You shouldn't incur in that error for only 15k files with that specific name format [1,2].
If you are running that expansion from another directory and you have to add the path to each file, the size of your command will be bigger, and of course it can occur.
Solution run the command from that directory.
(cd That/Directory ; cat file_1..2000.pdb >> file_all.pdb )
Best Solution If instead I guessed bad and you run it from the directory in which the files are...
IMHO the best solution is the Stéphane Chazelas' ones:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
with printf or seq; tested on 15k files with only their number inside pre-cached it is even the faster one (at present and except the OP one from the same directory in which the files are).
Some words more
You should be able to pass to your shell command lines more long.
Your command line is 213914 characters long and contains 15003 wordscat file_1..15000.pdb " > file_all.pdb" | wc
...even adding 8 bytes for each word is 333 938 bytes (0.3M) far below from the 2097142 (2.1M) reported by ARG_MAX on a kernel 3.13.0 or the slightly smaller 2088232 reported as "Maximum length of command we could actually use" by xargs --show-limits
Give it a look on your system to the output of
getconf ARG_MAX
xargs --show-limits
Laziness guided solution
In cases like this I prefer to work with blocks even because usually come out a time efficient solution.
The logic (if any) is I'm far too lazy to write 1...1000 1001..2000 etc etc...
So I ask a script to do it for me.
Only after I've checked the output is correctness I redirect it to a script.
... but Laziness is a state of mind.
Since I'm allergic to xargs (I really should have used xargs here) and I do not want to check how to use it, I punctually finish to reinvent the wheel as in the examples below (tl;dr).
Note that since the file names are controlled (no spaces, newlines...) you can go easily with something like the script below.
tl;dr
Version 1: pass as optional parameter the 1st file number, the last, the block size, the output file
#!/bin/bash
StartN=$1:-1 # First file number
EndN=$2:-15000 # Last file number
BlockN=$3:-100 # files in a Block
OutFile=$4:-"all.pdb" # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
cat $(seq -f file_%.17g.pdb $CurrentStart $CurrentEnd) >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
cat $(seq -f file_%.17g.pdb $CurrentStart $EndN) >> $OutFile;
Version 2
Calling bash for the expansion (a bit slower in my tests ~20%).
#!/bin/bash
StartN=$1:-1 # First file number
EndN=$2:-15000 # Last file number
BlockN=$3:-100 # files in a Block
OutFile=$4:-"all.pdb" # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
echo cat file_$CurrentStart..$CurrentEnd.pdb | /bin/bash >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
echo cat file_$CurrentStart..$EndN.pdb | /bin/bash >> $OutFile;
Of course you can go forward and get completely rid of seq [3] (from coreutils) and work directly with the variables in bash, or use python, or compile a c program to do it [4]...
Note that%gis short for%.6g. It would represent 1,000,000 as 1e+06 for instance.
– Stéphane Chazelas
Feb 27 '18 at 11:15
Really lazy people use the tools designed for the task of working around that E2BIG limitation likexargs, zsh'szargsorksh93'scommand -x.
– Stéphane Chazelas
Feb 27 '18 at 11:16
seqis not a bash builtin, it's a command from GNU coreutils.seq -f %g 1000000 1000000outputs 1e+06 even in the latest version of coreutils.
– Stéphane Chazelas
Feb 27 '18 at 11:26
@StéphaneChazelas Laziness is a state of mind. Strange to say but I feel more cosy when I can see (and visually check the output of a serialized command) and only then redirect to the execution. That construction give me to think less thanxarg... but I understand it is personal and maybe related only to me.
– Hastur
Feb 27 '18 at 11:28
@StéphaneChazelas Gotcha, right... Fixed. Thanks. I tested only with the 15k files given by the OP, my bad.
– Hastur
Feb 27 '18 at 11:31
|
show 2 more comments
Premise
You shouldn't incur in that error for only 15k files with that specific name format [1,2].
If you are running that expansion from another directory and you have to add the path to each file, the size of your command will be bigger, and of course it can occur.
Solution run the command from that directory.
(cd That/Directory ; cat file_1..2000.pdb >> file_all.pdb )
Best Solution If instead I guessed bad and you run it from the directory in which the files are...
IMHO the best solution is the Stéphane Chazelas' ones:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
with printf or seq; tested on 15k files with only their number inside pre-cached it is even the faster one (at present and except the OP one from the same directory in which the files are).
Some words more
You should be able to pass to your shell command lines more long.
Your command line is 213914 characters long and contains 15003 wordscat file_1..15000.pdb " > file_all.pdb" | wc
...even adding 8 bytes for each word is 333 938 bytes (0.3M) far below from the 2097142 (2.1M) reported by ARG_MAX on a kernel 3.13.0 or the slightly smaller 2088232 reported as "Maximum length of command we could actually use" by xargs --show-limits
Give it a look on your system to the output of
getconf ARG_MAX
xargs --show-limits
Laziness guided solution
In cases like this I prefer to work with blocks even because usually come out a time efficient solution.
The logic (if any) is I'm far too lazy to write 1...1000 1001..2000 etc etc...
So I ask a script to do it for me.
Only after I've checked the output is correctness I redirect it to a script.
... but Laziness is a state of mind.
Since I'm allergic to xargs (I really should have used xargs here) and I do not want to check how to use it, I punctually finish to reinvent the wheel as in the examples below (tl;dr).
Note that since the file names are controlled (no spaces, newlines...) you can go easily with something like the script below.
tl;dr
Version 1: pass as optional parameter the 1st file number, the last, the block size, the output file
#!/bin/bash
StartN=$1:-1 # First file number
EndN=$2:-15000 # Last file number
BlockN=$3:-100 # files in a Block
OutFile=$4:-"all.pdb" # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
cat $(seq -f file_%.17g.pdb $CurrentStart $CurrentEnd) >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
cat $(seq -f file_%.17g.pdb $CurrentStart $EndN) >> $OutFile;
Version 2
Calling bash for the expansion (a bit slower in my tests ~20%).
#!/bin/bash
StartN=$1:-1 # First file number
EndN=$2:-15000 # Last file number
BlockN=$3:-100 # files in a Block
OutFile=$4:-"all.pdb" # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
echo cat file_$CurrentStart..$CurrentEnd.pdb | /bin/bash >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
echo cat file_$CurrentStart..$EndN.pdb | /bin/bash >> $OutFile;
Of course you can go forward and get completely rid of seq [3] (from coreutils) and work directly with the variables in bash, or use python, or compile a c program to do it [4]...
Premise
You shouldn't incur in that error for only 15k files with that specific name format [1,2].
If you are running that expansion from another directory and you have to add the path to each file, the size of your command will be bigger, and of course it can occur.
Solution run the command from that directory.
(cd That/Directory ; cat file_1..2000.pdb >> file_all.pdb )
Best Solution If instead I guessed bad and you run it from the directory in which the files are...
IMHO the best solution is the Stéphane Chazelas' ones:
seq -f 'file_%.17g.pdb' 15000 | xargs cat > file_all.pdb
with printf or seq; tested on 15k files with only their number inside pre-cached it is even the faster one (at present and except the OP one from the same directory in which the files are).
Some words more
You should be able to pass to your shell command lines more long.
Your command line is 213914 characters long and contains 15003 wordscat file_1..15000.pdb " > file_all.pdb" | wc
...even adding 8 bytes for each word is 333 938 bytes (0.3M) far below from the 2097142 (2.1M) reported by ARG_MAX on a kernel 3.13.0 or the slightly smaller 2088232 reported as "Maximum length of command we could actually use" by xargs --show-limits
Give it a look on your system to the output of
getconf ARG_MAX
xargs --show-limits
Laziness guided solution
In cases like this I prefer to work with blocks even because usually come out a time efficient solution.
The logic (if any) is I'm far too lazy to write 1...1000 1001..2000 etc etc...
So I ask a script to do it for me.
Only after I've checked the output is correctness I redirect it to a script.
... but Laziness is a state of mind.
Since I'm allergic to xargs (I really should have used xargs here) and I do not want to check how to use it, I punctually finish to reinvent the wheel as in the examples below (tl;dr).
Note that since the file names are controlled (no spaces, newlines...) you can go easily with something like the script below.
tl;dr
Version 1: pass as optional parameter the 1st file number, the last, the block size, the output file
#!/bin/bash
StartN=$1:-1 # First file number
EndN=$2:-15000 # Last file number
BlockN=$3:-100 # files in a Block
OutFile=$4:-"all.pdb" # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
cat $(seq -f file_%.17g.pdb $CurrentStart $CurrentEnd) >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
cat $(seq -f file_%.17g.pdb $CurrentStart $EndN) >> $OutFile;
Version 2
Calling bash for the expansion (a bit slower in my tests ~20%).
#!/bin/bash
StartN=$1:-1 # First file number
EndN=$2:-15000 # Last file number
BlockN=$3:-100 # files in a Block
OutFile=$4:-"all.pdb" # Output file name
CurrentStart=$StartN
for i in $(seq $StartN $BlockN $EndN)
do
CurrentEnd=$i ;
echo cat file_$CurrentStart..$CurrentEnd.pdb | /bin/bash >> $OutFile;
CurrentStart=$(( CurrentEnd + 1 ))
done
# Here you may need to do a last iteration for the part cut from seq
[[ $EndN -ge $CurrentStart ]] &&
echo cat file_$CurrentStart..$EndN.pdb | /bin/bash >> $OutFile;
Of course you can go forward and get completely rid of seq [3] (from coreutils) and work directly with the variables in bash, or use python, or compile a c program to do it [4]...
edited Feb 27 '18 at 13:11
answered Feb 27 '18 at 11:08
HasturHastur
1,8481022
1,8481022
Note that%gis short for%.6g. It would represent 1,000,000 as 1e+06 for instance.
– Stéphane Chazelas
Feb 27 '18 at 11:15
Really lazy people use the tools designed for the task of working around that E2BIG limitation likexargs, zsh'szargsorksh93'scommand -x.
– Stéphane Chazelas
Feb 27 '18 at 11:16
seqis not a bash builtin, it's a command from GNU coreutils.seq -f %g 1000000 1000000outputs 1e+06 even in the latest version of coreutils.
– Stéphane Chazelas
Feb 27 '18 at 11:26
@StéphaneChazelas Laziness is a state of mind. Strange to say but I feel more cosy when I can see (and visually check the output of a serialized command) and only then redirect to the execution. That construction give me to think less thanxarg... but I understand it is personal and maybe related only to me.
– Hastur
Feb 27 '18 at 11:28
@StéphaneChazelas Gotcha, right... Fixed. Thanks. I tested only with the 15k files given by the OP, my bad.
– Hastur
Feb 27 '18 at 11:31
|
show 2 more comments
Note that%gis short for%.6g. It would represent 1,000,000 as 1e+06 for instance.
– Stéphane Chazelas
Feb 27 '18 at 11:15
Really lazy people use the tools designed for the task of working around that E2BIG limitation likexargs, zsh'szargsorksh93'scommand -x.
– Stéphane Chazelas
Feb 27 '18 at 11:16
seqis not a bash builtin, it's a command from GNU coreutils.seq -f %g 1000000 1000000outputs 1e+06 even in the latest version of coreutils.
– Stéphane Chazelas
Feb 27 '18 at 11:26
@StéphaneChazelas Laziness is a state of mind. Strange to say but I feel more cosy when I can see (and visually check the output of a serialized command) and only then redirect to the execution. That construction give me to think less thanxarg... but I understand it is personal and maybe related only to me.
– Hastur
Feb 27 '18 at 11:28
@StéphaneChazelas Gotcha, right... Fixed. Thanks. I tested only with the 15k files given by the OP, my bad.
– Hastur
Feb 27 '18 at 11:31
Note that
%g is short for %.6g. It would represent 1,000,000 as 1e+06 for instance.– Stéphane Chazelas
Feb 27 '18 at 11:15
Note that
%g is short for %.6g. It would represent 1,000,000 as 1e+06 for instance.– Stéphane Chazelas
Feb 27 '18 at 11:15
Really lazy people use the tools designed for the task of working around that E2BIG limitation like
xargs, zsh's zargs or ksh93's command -x.– Stéphane Chazelas
Feb 27 '18 at 11:16
Really lazy people use the tools designed for the task of working around that E2BIG limitation like
xargs, zsh's zargs or ksh93's command -x.– Stéphane Chazelas
Feb 27 '18 at 11:16
seq is not a bash builtin, it's a command from GNU coreutils. seq -f %g 1000000 1000000 outputs 1e+06 even in the latest version of coreutils.– Stéphane Chazelas
Feb 27 '18 at 11:26
seq is not a bash builtin, it's a command from GNU coreutils. seq -f %g 1000000 1000000 outputs 1e+06 even in the latest version of coreutils.– Stéphane Chazelas
Feb 27 '18 at 11:26
@StéphaneChazelas Laziness is a state of mind. Strange to say but I feel more cosy when I can see (and visually check the output of a serialized command) and only then redirect to the execution. That construction give me to think less than
xarg... but I understand it is personal and maybe related only to me.– Hastur
Feb 27 '18 at 11:28
@StéphaneChazelas Laziness is a state of mind. Strange to say but I feel more cosy when I can see (and visually check the output of a serialized command) and only then redirect to the execution. That construction give me to think less than
xarg... but I understand it is personal and maybe related only to me.– Hastur
Feb 27 '18 at 11:28
@StéphaneChazelas Gotcha, right... Fixed. Thanks. I tested only with the 15k files given by the OP, my bad.
– Hastur
Feb 27 '18 at 11:31
@StéphaneChazelas Gotcha, right... Fixed. Thanks. I tested only with the 15k files given by the OP, my bad.
– Hastur
Feb 27 '18 at 11:31
|
show 2 more comments
Another way to do it could be
(cat file_1..499.pdb; cat file_500..999.pdb; cat file_1000..1499.pdb; cat file_1500..2000.pdb) >> file_all.pdb
add a comment |
Another way to do it could be
(cat file_1..499.pdb; cat file_500..999.pdb; cat file_1000..1499.pdb; cat file_1500..2000.pdb) >> file_all.pdb
add a comment |
Another way to do it could be
(cat file_1..499.pdb; cat file_500..999.pdb; cat file_1000..1499.pdb; cat file_1500..2000.pdb) >> file_all.pdb
Another way to do it could be
(cat file_1..499.pdb; cat file_500..999.pdb; cat file_1000..1499.pdb; cat file_1500..2000.pdb) >> file_all.pdb
answered Feb 27 '18 at 14:51
glglglglglgl
1,174812
1,174812
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f426748%2fcat-a-very-large-number-of-files-together-in-correct-order%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
What is the tenth file named as? (Or any file with more than a single digit numbered ordering.)
– roaima
Feb 26 '18 at 17:33
I (now) have 15,000 of these files in a directory and your
cat file_1..15000.pdbconstruct works fine for me.– roaima
Feb 26 '18 at 17:36
11
depends on the system what the limit is.
getconf ARG_MAXshould tell.– ilkkachu
Feb 26 '18 at 17:43
@ilkkachu new one to me - thank you
– roaima
Feb 26 '18 at 21:40
3
Consider changing your question to "thousands of " or "a very large number of" files. Might make the question easier to find for other people with a similar problem.
– msouth
Feb 27 '18 at 6:32