uniq -c Equivalent for Groups of Lines of Arbitrary Count The Next CEO of Stack OverflowGet lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c

Where do students learn to solve polynomial equations these days?

Would this house-rule that treats advantage as a +1 to the roll instead (and disadvantage as -1) and allows them to stack be balanced?

What did we know about the Kessel run before the prequels?

Why is quantifier elimination desirable for a given theory?

Why do airplanes bank sharply to the right after air-to-air refueling?

Can a Bladesinger Wizard use Bladesong with a Hand Crossbow?

Writing differences on a blackboard

Plot of histogram similar to output from @risk

Is it professional to write unrelated content in an almost-empty email?

Prepend last line of stdin to entire stdin

Is it okay to majorly distort historical facts while writing a fiction story?

Rotate a column

What flight has the highest ratio of time difference to flight time?

No sign flipping while figuring out the emf of voltaic cell?

How to avoid supervisors with prejudiced views?

How many extra stops do monopods offer for tele photographs?

Make solar eclipses exceedingly rare, but still have new moons

How to scale a tikZ image which is within a figure environment

How to sed chunks text from a stream of files from find

Why does the flight controls check come before arming the autobrake on the A320?

Does increasing your ability score affect your main stat?

Reference request: Grassmannian and Plucker coordinates in type B, C, D

How to install OpenCV on Raspbian Stretch?

Proper way to express "He disappeared them"

uniq -c Equivalent for Groups of Lines of Arbitrary Count

The Next CEO of Stack OverflowGet lines with maximum values in the column using awk, uniq and sortuniq and sed, delete lines with pattern similar in multiple filesuniq showing duplicate linesWhy does this command not sort based on the uniq count?Count unique lines only to a set patternCount lines preserving headerUsing Uniq -c with a regular expression or counting the number of lines removedCount uniq instances of blocks of 2 linesExtracting “count value” after using “uniq -c”How do you count the first column generated from uniq -c

I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.

uniq -c works okay :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
 4 foo
 4 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz

In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
 | sed 's/^/__STARTOFSTRINGDELIMITER__/' 
 | paste - - 
 | uniq -c 
 | sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
 2 foo
 foo
 2 bar
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz

(That format is acceptable to me.)

How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?

Following the above example, I would want output similar to :

4 foo
4 bar
1 baz
4 foo
 bar
 baz

edited 2 days ago

Rui F Ribeiro

41.8k1483142

asked 2 days ago

robut

8818

That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
2 days ago

The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda♦
yesterday

@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
yesterday

add a comment |

I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.

uniq -c works okay :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
 4 foo
 4 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz

In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
 | sed 's/^/__STARTOFSTRINGDELIMITER__/' 
 | paste - - 
 | uniq -c 
 | sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
 2 foo
 foo
 2 bar
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz

(That format is acceptable to me.)

How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?

Following the above example, I would want output similar to :

4 foo
4 bar
1 baz
4 foo
 bar
 baz

edited 2 days ago

Rui F Ribeiro

41.8k1483142

asked 2 days ago

robut

8818

That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
2 days ago

The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda♦
yesterday

@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
yesterday

add a comment |

I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.

uniq -c works okay :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
 4 foo
 4 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz

In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
 | sed 's/^/__STARTOFSTRINGDELIMITER__/' 
 | paste - - 
 | uniq -c 
 | sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
 2 foo
 foo
 2 bar
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz

(That format is acceptable to me.)

How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?

Following the above example, I would want output similar to :

4 foo
4 bar
1 baz
4 foo
 bar
 baz

edited 2 days ago

Rui F Ribeiro

41.8k1483142

asked 2 days ago

robut

8818

I've got a file of ~1-2 million lines that I'm trying to reduce down by counting duplicate groups of lines, preserving order.

uniq -c works okay :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | uniq -c
 4 foo
 4 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz
 1 foo
 1 bar
 1 baz

In my use-case (but not in the following foo-bar-baz example), counting pairs of lines is ~20% more efficient, and looks like :

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' 
 | sed 's/^/__STARTOFSTRINGDELIMITER__/' 
 | paste - - 
 | uniq -c 
 | sed -r 's/__STARTOFSTRINGDELIMITER__//; s/__STARTOFSTRINGDELIMITER__/nt/;'
 2 foo
 foo
 2 bar
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz
 foo
 1 bar
 baz
 1 foo
 bar
 1 baz

(That format is acceptable to me.)

How can I reduce duplicate groups of arbitrary numbers of lines (well, keeping a sane buffer count like 2-10 lines) down to a single copy + count of them ?

Following the above example, I would want output similar to :

4 foo
4 bar
1 baz
4 foo
 bar
 baz

awk perl uniq

edited 2 days ago

Rui F Ribeiro

41.8k1483142

asked 2 days ago

robut

8818

edited 2 days ago

Rui F Ribeiro

41.8k1483142

asked 2 days ago

robut

8818

edited 2 days ago

Rui F Ribeiro

41.8k1483142

edited 2 days ago

Rui F Ribeiro

41.8k1483142

edited 2 days ago

Rui F Ribeiro

41.8k1483142

asked 2 days ago

robut

8818

asked 2 days ago

robut

8818

asked 2 days ago

robut

8818

That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
2 days ago

The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda♦
yesterday

@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
yesterday

add a comment |

That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
2 days ago

The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda♦
yesterday

@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
yesterday

That's similar to what some compression algorithms do. Maybe some avenue worth exploring.

– Stéphane Chazelas
2 days ago

The issue seems to be finding the groups of lines. Your output may as well say that the combination of foo followed by bar occur 5 times.

– Kusalananda♦
yesterday

@Kusalananda Do you mean foo followed by bar 4 times ? (The first two sets of four each). You would be correct then, yes, and either output would be acceptable for me (either foo x4 then bar x4, or (foo, bar) x4). I assume it would depend on the buffer length - 10 lines of buffer would produce the latter, less than 8 lines of buffer would produce the former. It's not really an issue as you say, just a consideration.

– robut
yesterday

add a comment |

1 Answer
1

active

oldest

votes

I don't have such a huge dataset for benchmarking. Give this a try:

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz

Using mawk instead of awk may improve performance.

edited yesterday

answered yesterday

finswimmer

72918

Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
yesterday

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
yesterday

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509266%2funiq-c-equivalent-for-groups-of-lines-of-arbitrary-count%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I don't have such a huge dataset for benchmarking. Give this a try:

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz

Using mawk instead of awk may improve performance.

edited yesterday

answered yesterday

finswimmer

72918

Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
yesterday

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
yesterday

add a comment |

I don't have such a huge dataset for benchmarking. Give this a try:

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz

Using mawk instead of awk may improve performance.

edited yesterday

answered yesterday

finswimmer

72918

Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
yesterday

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
yesterday

add a comment |

I don't have such a huge dataset for benchmarking. Give this a try:

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz

Using mawk instead of awk may improve performance.

edited yesterday

answered yesterday

finswimmer

72918

I don't have such a huge dataset for benchmarking. Give this a try:

$ perl -E 'say for (("foo") x 4, ("bar") x 4, "baz", ("foo", "bar", "baz") x 4)' | awk 'NR == 1 word=$0; count=1; next $0 != word print count,word; word=$0; count=1; next count++ END print count,word '
4 foo
4 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz
1 foo
1 bar
1 baz

Using mawk instead of awk may improve performance.

edited yesterday

answered yesterday

finswimmer

72918

edited yesterday

answered yesterday

finswimmer

72918

answered yesterday

finswimmer

72918

answered yesterday

finswimmer

72918

Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
yesterday

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
yesterday

add a comment |

Can this be adapted to work with multi-word lines ? echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word ' for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
yesterday

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
yesterday

Can this be adapted to work with multi-word lines ?

echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word '

for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
yesterday

Can this be adapted to work with multi-word lines ?

echo -e 'a b c n a b c n a b c' | awk 'NR == 1 word=$1; count=1; next $1 != word print count,word; word=$1; count=1; next count++ END print count,word '

for example only counts a. Sorry that it wasn't clear in my original question that my actual lines are multi-word with non-alphanumeric characters too.

– robut
yesterday

Just replace the $1 with $0 to compare whole lines. I've edited my answer.

– finswimmer
yesterday

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ygtjki

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Àrd-bhaile Cathair chruinne/Baile mòr cruinne | Artagailean ceangailte | Clàr-taice na seòladaireachd

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Àrd-bhaile Cathair chruinne/Baile mòr cruinne | Artagailean ceangailte | Clàr-taice na seòladaireachd

1 Answer
1

1 Answer
1

1 Answer
1