uniq -c Equivalent for Groups of Lines of Arbitrary Count The Next CEO of Stack OverflowGet lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c
Where do students learn to solve polynomial equations these days?
Would this house-rule that treats advantage as a +1 to the roll instead (and disadvantage as -1) and allows them to stack be balanced?
What did we know about the Kessel run before the prequels?
Why is quantifier elimination desirable for a given theory?
Why do airplanes bank sharply to the right after air-to-air refueling?
Can a Bladesinger Wizard use Bladesong with a Hand Crossbow?
Writing differences on a blackboard
Plot of histogram similar to output from @risk
Is it professional to write unrelated content in an almost-empty email?
Prepend last line of stdin to entire stdin
Is it okay to majorly distort historical facts while writing a fiction story?
Rotate a column
What flight has the highest ratio of time difference to flight time?
No sign flipping while figuring out the emf of voltaic cell?
How to avoid supervisors with prejudiced views?
How many extra stops do monopods offer for tele photographs?
Make solar eclipses exceedingly rare, but still have new moons
How to scale a tikZ image which is within a figure environment
How to sed chunks text from a stream of files from find
Why does the flight controls check come before arming the autobrake on the A320?
Does increasing your ability score affect your main stat?
Reference request: Grassmannian and Plucker coordinates in type B, C, D
How to install OpenCV on Raspbian Stretch?
Proper way to express "He disappeared them"
uniq -c Equivalent for Groups of Lines of Arbitrary Count
The Next CEO of Stack OverflowGet lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c
I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.
uniq -c works okay :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)'
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
(That format is acceptable to me.)
How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?
Following the above example, I would want output similar to :
4 foo
4 bar
1 baz
4 foo
bar
baz
awk perl uniq
add a comment |
I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.
uniq -c works okay :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)'
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
(That format is acceptable to me.)
How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?
Following the above example, I would want output similar to :
4 foo
4 bar
1 baz
4 foo
bar
baz
awk perl uniq
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
2 days ago
The issue seems to be finding the groups of lines. Your output may as well say that the combination offoofollowed bybaroccur 5 times.
– Kusalananda♦
yesterday
@Kusalananda Do you meanfoofollowed bybar4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (eitherfoox4 thenbarx4, or (foo,bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.
– robut
yesterday
add a comment |
I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.
uniq -c works okay :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)'
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
(That format is acceptable to me.)
How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?
Following the above example, I would want output similar to :
4 foo
4 bar
1 baz
4 foo
bar
baz
awk perl uniq
I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.
uniq -c works okay :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)'
| sed 's/^/__STARTOFSTRINGDELIMITER__/'
| paste - -
| uniq -c
| sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
2 foo
foo
2 bar
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
foo
1 bar
baz
1 foo
bar
1 baz
(That format is acceptable to me.)
How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?
Following the above example, I would want output similar to :
4 foo
4 bar
1 baz
4 foo
bar
baz
awk perl uniq
awk perl uniq
edited 2 days ago
Rui F Ribeiro
41.8k1483142
41.8k1483142
asked 2 days ago
robutrobut
8818
8818
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
2 days ago
The issue seems to be finding the groups of lines. Your output may as well say that the combination offoofollowed bybaroccur 5 times.
– Kusalananda♦
yesterday
@Kusalananda Do you meanfoofollowed bybar4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (eitherfoox4 thenbarx4, or (foo,bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.
– robut
yesterday
add a comment |
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
2 days ago
The issue seems to be finding the groups of lines. Your output may as well say that the combination offoofollowed bybaroccur 5 times.
– Kusalananda♦
yesterday
@Kusalananda Do you meanfoofollowed bybar4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (eitherfoox4 thenbarx4, or (foo,bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.
– robut
yesterday
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
2 days ago
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
2 days ago
The issue seems to be finding the groups of lines. Your output may as well say that the combination of
foo followed by bar occur 5 times.– Kusalananda♦
yesterday
The issue seems to be finding the groups of lines. Your output may as well say that the combination of
foo followed by bar occur 5 times.– Kusalananda♦
yesterday
@Kusalananda Do you mean
foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.– robut
yesterday
@Kusalananda Do you mean
foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.– robut
yesterday
add a comment |
1 Answer
1
active
oldest
votes
I don't have such a huge dataset for benchmarking. Give this a try:
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
Using mawk instead of awk may improve performance.
Can this be adapted to work with multi-word lines ?echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word 'for example only countsa. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.
– robut
yesterday
Just replace the$1with$0to compare whole lines. I've edited my answer.
– finswimmer
yesterday
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509266%2funiq-c-equivalent-for-groups-of-lines-of-arbitrary-count%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I don't have such a huge dataset for benchmarking. Give this a try:
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
Using mawk instead of awk may improve performance.
Can this be adapted to work with multi-word lines ?echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word 'for example only countsa. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.
– robut
yesterday
Just replace the$1with$0to compare whole lines. I've edited my answer.
– finswimmer
yesterday
add a comment |
I don't have such a huge dataset for benchmarking. Give this a try:
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
Using mawk instead of awk may improve performance.
Can this be adapted to work with multi-word lines ?echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word 'for example only countsa. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.
– robut
yesterday
Just replace the$1with$0to compare whole lines. I've edited my answer.
– finswimmer
yesterday
add a comment |
I don't have such a huge dataset for benchmarking. Give this a try:
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
Using mawk instead of awk may improve performance.
I don't have such a huge dataset for benchmarking. Give this a try:
$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
Using mawk instead of awk may improve performance.
edited yesterday
answered yesterday
finswimmerfinswimmer
72918
72918
Can this be adapted to work with multi-word lines ?echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word 'for example only countsa. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.
– robut
yesterday
Just replace the$1with$0to compare whole lines. I've edited my answer.
– finswimmer
yesterday
add a comment |
Can this be adapted to work with multi-word lines ?echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word 'for example only countsa. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.
– robut
yesterday
Just replace the$1with$0to compare whole lines. I've edited my answer.
– finswimmer
yesterday
Can this be adapted to work with multi-word lines ?
echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.– robut
yesterday
Can this be adapted to work with multi-word lines ?
echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.– robut
yesterday
Just replace the
$1 with $0 to compare whole lines. I've edited my answer.– finswimmer
yesterday
Just replace the
$1 with $0 to compare whole lines. I've edited my answer.– finswimmer
yesterday
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509266%2funiq-c-equivalent-for-groups-of-lines-of-arbitrary-count%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
That's similar to what some compression algorithms do. Maybe some avenue worth exploring.
– Stéphane Chazelas
2 days ago
The issue seems to be finding the groups of lines. Your output may as well say that the combination of
foofollowed bybaroccur 5 times.– Kusalananda♦
yesterday
@Kusalananda Do you mean
foofollowed bybar4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (eitherfoox4 thenbarx4, or (foo,bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.– robut
yesterday