remove duplicated rowsMerge fields in a fileExtract data in linux/unixParse/Manipulate in awkTo count number of matches in a mega string quicklyMerge and print matching and non matching values between a smaller file and a huge fileRemove rows from a file that exist in another file with newer timestampCSV - Converting SOME Columns to Rows with duplicated datascript to parse file for two consecutive lines of unequal lengthalter awk variable based on match inside awkSpeed up script that determines if all columns in a row are the same or not

Why can't we play rap on piano?

How to tell a function to use the default argument values?

Which is the best way to check return result?

Can we compute the area of a quadrilateral with one right angle when we only know the lengths of any three sides?

How do I handle a potential work/personal life conflict as the manager of one of my friends?

How do I deal with an unproductive colleague in a small company?

What type of content (depth/breadth) is expected for a short presentation for Asst Professor interview in the UK?

Extract rows of a table, that include less than x NULLs

What is the idiomatic way to say "clothing fits"?

Alternative to sending password over mail?

One verb to replace 'be a member of' a club

Can my sorcerer use a spellbook only to collect spells and scribe scrolls, not cast?

How does a predictive coding aid in lossless compression?

How do I gain back my faith in my PhD degree?

Why is this clock signal connected to a capacitor to gnd?

In 'Revenger,' what does 'cove' come from?

Why didn't Miles's spider sense work before?

Would Slavery Reparations be considered Bills of Attainder and hence Illegal?

Is it logically or scientifically possible to artificially send energy to the body?

ssTTsSTtRrriinInnnnNNNIiinngg

What does the expression "A Mann!" means

Is this a hacking script in function.php?

Avoiding the "not like other girls" trope?

How could indestructible materials be used in power generation?

remove duplicated rows

Merge fields in a fileExtract data in linux/unixParse/Manipulate in awkTo count number of matches in a mega string quicklyMerge and print matching and non matching values between a smaller file and a huge fileRemove rows from a file that exist in another file with newer timestampCSV - Converting SOME Columns to Rows with duplicated datascript to parse file for two consecutive lines of unequal lengthalter awk variable based on match inside awkSpeed up script that determines if all columns in a row are the same or not

I have a file with bunch of rows, here is how it looks like (just a head of file):

 "chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
 1 315121 317607 "gene2" 315521 317204 "gene3" 1684
 1 315521 317204 "gene3" 315121 317607 "gene2" 1684
 1 407644 408993 "gene4" 408421 409504 "gene5" 573
 1 407644 408993 "gene4" 408616 410013 "gene6" 378
 1 408421 409504 "gene5" 407644 408993 "gene4" 573
 1 408421 409504 "gene5" 408616 410013 "gene6" 889
 1 408616 410013 "gene6" 407644 408993 "gene4" 378
 1 408616 410013 "gene6" 408421 409504 "gene5" 889
 1 408616 410013 "gene6" 409682 411483 "gene7" 332
....

There are some identical rows (the same pair of genes, just the order of start and stop positions differ, but they are exactly the same) which I need to remove the repeated row.
For example:

1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684

are the same, it is genes 2 and 3 combination just in a different order and I want to remove one of them.

Here is my desired output:

"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
 1 315121 317607 "gene2" 315521 317204 "gene3" 1684
 1 407644 408993 "gene4" 408421 409504 "gene5" 573
 1 407644 408993 "gene4" 408616 410013 "gene6" 378
 1 408421 409504 "gene5" 408616 410013 "gene6" 889
 1 408616 410013 "gene6" 409682 411483 "gene7" 332

Is there any idea how I can do this task? Thanks

edited 2 days ago

asked 2 days ago

Anna1364

432213

Are the duplicities always adjacent? How is the file sorted?

– choroba
2 days ago

This looks like genomic data. Should we assume that the amount of data is huge?

– Kusalananda♦
2 days ago

@ Kusalananda, yes .. but it not super huge. I have approximately 300K rows in my file.

– Anna1364
2 days ago

@choroba, no they are not

– Anna1364
2 days ago

2

Would awk '!seen[$4"" < $7 ? $4 OFS $7 : $7 OFS $4]++' work or to you need to look at other columns than the 4th and 7th?

– Stéphane Chazelas
2 days ago

|
show 5 more comments

I have a file with bunch of rows, here is how it looks like (just a head of file):

 "chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
 1 315121 317607 "gene2" 315521 317204 "gene3" 1684
 1 315521 317204 "gene3" 315121 317607 "gene2" 1684
 1 407644 408993 "gene4" 408421 409504 "gene5" 573
 1 407644 408993 "gene4" 408616 410013 "gene6" 378
 1 408421 409504 "gene5" 407644 408993 "gene4" 573
 1 408421 409504 "gene5" 408616 410013 "gene6" 889
 1 408616 410013 "gene6" 407644 408993 "gene4" 378
 1 408616 410013 "gene6" 408421 409504 "gene5" 889
 1 408616 410013 "gene6" 409682 411483 "gene7" 332
....

There are some identical rows (the same pair of genes, just the order of start and stop positions differ, but they are exactly the same) which I need to remove the repeated row.
For example:

1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684

are the same, it is genes 2 and 3 combination just in a different order and I want to remove one of them.

Here is my desired output:

"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
 1 315121 317607 "gene2" 315521 317204 "gene3" 1684
 1 407644 408993 "gene4" 408421 409504 "gene5" 573
 1 407644 408993 "gene4" 408616 410013 "gene6" 378
 1 408421 409504 "gene5" 408616 410013 "gene6" 889
 1 408616 410013 "gene6" 409682 411483 "gene7" 332

Is there any idea how I can do this task? Thanks

edited 2 days ago

asked 2 days ago

Anna1364

432213

Are the duplicities always adjacent? How is the file sorted?

– choroba
2 days ago

This looks like genomic data. Should we assume that the amount of data is huge?

– Kusalananda♦
2 days ago

@ Kusalananda, yes .. but it not super huge. I have approximately 300K rows in my file.

– Anna1364
2 days ago

@choroba, no they are not

– Anna1364
2 days ago

2

Would awk '!seen[$4"" < $7 ? $4 OFS $7 : $7 OFS $4]++' work or to you need to look at other columns than the 4th and 7th?

– Stéphane Chazelas
2 days ago

|
show 5 more comments

I have a file with bunch of rows, here is how it looks like (just a head of file):

 "chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
 1 315121 317607 "gene2" 315521 317204 "gene3" 1684
 1 315521 317204 "gene3" 315121 317607 "gene2" 1684
 1 407644 408993 "gene4" 408421 409504 "gene5" 573
 1 407644 408993 "gene4" 408616 410013 "gene6" 378
 1 408421 409504 "gene5" 407644 408993 "gene4" 573
 1 408421 409504 "gene5" 408616 410013 "gene6" 889
 1 408616 410013 "gene6" 407644 408993 "gene4" 378
 1 408616 410013 "gene6" 408421 409504 "gene5" 889
 1 408616 410013 "gene6" 409682 411483 "gene7" 332
....

There are some identical rows (the same pair of genes, just the order of start and stop positions differ, but they are exactly the same) which I need to remove the repeated row.
For example:

1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684

are the same, it is genes 2 and 3 combination just in a different order and I want to remove one of them.

Here is my desired output:

"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
 1 315121 317607 "gene2" 315521 317204 "gene3" 1684
 1 407644 408993 "gene4" 408421 409504 "gene5" 573
 1 407644 408993 "gene4" 408616 410013 "gene6" 378
 1 408421 409504 "gene5" 408616 410013 "gene6" 889
 1 408616 410013 "gene6" 409682 411483 "gene7" 332

Is there any idea how I can do this task? Thanks

edited 2 days ago

asked 2 days ago

Anna1364

432213

I have a file with bunch of rows, here is how it looks like (just a head of file):

 "chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
 1 315121 317607 "gene2" 315521 317204 "gene3" 1684
 1 315521 317204 "gene3" 315121 317607 "gene2" 1684
 1 407644 408993 "gene4" 408421 409504 "gene5" 573
 1 407644 408993 "gene4" 408616 410013 "gene6" 378
 1 408421 409504 "gene5" 407644 408993 "gene4" 573
 1 408421 409504 "gene5" 408616 410013 "gene6" 889
 1 408616 410013 "gene6" 407644 408993 "gene4" 378
 1 408616 410013 "gene6" 408421 409504 "gene5" 889
 1 408616 410013 "gene6" 409682 411483 "gene7" 332
....

There are some identical rows (the same pair of genes, just the order of start and stop positions differ, but they are exactly the same) which I need to remove the repeated row.
For example:

1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684

are the same, it is genes 2 and 3 combination just in a different order and I want to remove one of them.

Here is my desired output:

"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
 1 315121 317607 "gene2" 315521 317204 "gene3" 1684
 1 407644 408993 "gene4" 408421 409504 "gene5" 573
 1 407644 408993 "gene4" 408616 410013 "gene6" 378
 1 408421 409504 "gene5" 408616 410013 "gene6" 889
 1 408616 410013 "gene6" 409682 411483 "gene7" 332

Is there any idea how I can do this task? Thanks

text-processing sed

edited 2 days ago

asked 2 days ago

Anna1364

432213

edited 2 days ago

asked 2 days ago

Anna1364

432213

edited 2 days ago

asked 2 days ago

Anna1364

432213

asked 2 days ago

Anna1364

432213

asked 2 days ago

Anna1364

432213

Are the duplicities always adjacent? How is the file sorted?

– choroba
2 days ago

This looks like genomic data. Should we assume that the amount of data is huge?

– Kusalananda♦
2 days ago

@ Kusalananda, yes .. but it not super huge. I have approximately 300K rows in my file.

– Anna1364
2 days ago

@choroba, no they are not

– Anna1364
2 days ago

2

Would awk '!seen[$4"" < $7 ? $4 OFS $7 : $7 OFS $4]++' work or to you need to look at other columns than the 4th and 7th?

– Stéphane Chazelas
2 days ago

|
show 5 more comments

Are the duplicities always adjacent? How is the file sorted?

– choroba
2 days ago

This looks like genomic data. Should we assume that the amount of data is huge?

– Kusalananda♦
2 days ago

@ Kusalananda, yes .. but it not super huge. I have approximately 300K rows in my file.

– Anna1364
2 days ago

@choroba, no they are not

– Anna1364
2 days ago

2

Would awk '!seen[$4"" < $7 ? $4 OFS $7 : $7 OFS $4]++' work or to you need to look at other columns than the 4th and 7th?

– Stéphane Chazelas
2 days ago

Are the duplicities always adjacent? How is the file sorted?

– choroba
2 days ago

This looks like genomic data. Should we assume that the amount of data is huge?

– Kusalananda♦
2 days ago

@ Kusalananda, yes .. but it not super huge. I have approximately 300K rows in my file.

– Anna1364
2 days ago

@choroba, no they are not

– Anna1364
2 days ago

Would awk '!seen[$4"" < $7 ? $4 OFS $7 : $7 OFS $4]++' work or to you need to look at other columns than the 4th and 7th?

– Stéphane Chazelas
2 days ago

|
show 5 more comments

3 Answers
3

active

oldest

votes

You might try:

awk 'key = $4 < $7 ? $4 SUBSEP $7 : $7 SUBSEP $4 !seen[key]++' file

That stores the minimum necessary to remove the duplicate records.

!seen[key]++ is a "famous" awk idiom to print a record only for the first time "key" is seen.

answered 2 days ago

community wiki

glenn jackman

2

I swear I did this before I read Stéphane's comment...

– glenn jackman
2 days ago

add a comment |

You can order the triplets of columns 2-3-4 and 5-6-7 by the value in the first column:

perl -lane '@F[1,2,3,4,5,6] = @F[4,5,6,1,2,3] if $F[1] > $F[4]; print "@F"'

Then you can just run sort -u to remove the duplicities (but you need to special case the column names).

answered 2 days ago

choroba

27k45176

I have ~300K rows

– Anna1364
2 days ago

Is it too slow?

– choroba
2 days ago

add a comment |

Not optimal, but solves the problem:

#!/bin/bash

touch result_genes
dupe_found=0

while read GENE_LINE
do
 GL=$(echo $GENE_LINE | awk 'print $1" "$2" "$3" "$4" "$5" "$6" "$7" "$8')
 while read RESULT_LINE
 do
 RL=$(echo $RESULT_LINE | awk 'print $1" "$5" "$6" "$7" "$2" "$3" "$4" "$8')
 if [ "$GL" == "$RL" ];
 then 
 dupe_found=1
 break
 fi
 done < result_genes

 if [ $dupe_found = 1 ];
 then
 dupe_found=0; 
 else
 echo $GENE_LINE >> result_genes
 fi

done < genes

answered 2 days ago

haegor

New contributor

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509887%2fremove-duplicated-rows%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You might try:

awk 'key = $4 < $7 ? $4 SUBSEP $7 : $7 SUBSEP $4 !seen[key]++' file

That stores the minimum necessary to remove the duplicate records.

!seen[key]++ is a "famous" awk idiom to print a record only for the first time "key" is seen.

answered 2 days ago

community wiki

glenn jackman

2

I swear I did this before I read Stéphane's comment...

– glenn jackman
2 days ago

add a comment |

You might try:

awk 'key = $4 < $7 ? $4 SUBSEP $7 : $7 SUBSEP $4 !seen[key]++' file

That stores the minimum necessary to remove the duplicate records.

!seen[key]++ is a "famous" awk idiom to print a record only for the first time "key" is seen.

answered 2 days ago

community wiki

glenn jackman

2

I swear I did this before I read Stéphane's comment...

– glenn jackman
2 days ago

add a comment |

You might try:

awk 'key = $4 < $7 ? $4 SUBSEP $7 : $7 SUBSEP $4 !seen[key]++' file

That stores the minimum necessary to remove the duplicate records.

!seen[key]++ is a "famous" awk idiom to print a record only for the first time "key" is seen.

answered 2 days ago

community wiki

glenn jackman

You might try:

awk 'key = $4 < $7 ? $4 SUBSEP $7 : $7 SUBSEP $4 !seen[key]++' file

That stores the minimum necessary to remove the duplicate records.

!seen[key]++ is a "famous" awk idiom to print a record only for the first time "key" is seen.

answered 2 days ago

community wiki

glenn jackman

answered 2 days ago

community wiki

glenn jackman

community wiki

glenn jackman

community wiki

glenn jackman

2

I swear I did this before I read Stéphane's comment...

– glenn jackman
2 days ago

add a comment |

2

I swear I did this before I read Stéphane's comment...

– glenn jackman
2 days ago

I swear I did this before I read Stéphane's comment...

– glenn jackman
2 days ago

add a comment |

You can order the triplets of columns 2-3-4 and 5-6-7 by the value in the first column:

perl -lane '@F[1,2,3,4,5,6] = @F[4,5,6,1,2,3] if $F[1] > $F[4]; print "@F"'

Then you can just run sort -u to remove the duplicities (but you need to special case the column names).

answered 2 days ago

choroba

27k45176

I have ~300K rows

– Anna1364
2 days ago

Is it too slow?

– choroba
2 days ago

add a comment |

You can order the triplets of columns 2-3-4 and 5-6-7 by the value in the first column:

perl -lane '@F[1,2,3,4,5,6] = @F[4,5,6,1,2,3] if $F[1] > $F[4]; print "@F"'

Then you can just run sort -u to remove the duplicities (but you need to special case the column names).

answered 2 days ago

choroba

27k45176

I have ~300K rows

– Anna1364
2 days ago

Is it too slow?

– choroba
2 days ago

add a comment |

You can order the triplets of columns 2-3-4 and 5-6-7 by the value in the first column:

perl -lane '@F[1,2,3,4,5,6] = @F[4,5,6,1,2,3] if $F[1] > $F[4]; print "@F"'

Then you can just run sort -u to remove the duplicities (but you need to special case the column names).

answered 2 days ago

choroba

27k45176

You can order the triplets of columns 2-3-4 and 5-6-7 by the value in the first column:

perl -lane '@F[1,2,3,4,5,6] = @F[4,5,6,1,2,3] if $F[1] > $F[4]; print "@F"'

Then you can just run sort -u to remove the duplicities (but you need to special case the column names).

answered 2 days ago

choroba

27k45176

answered 2 days ago

choroba

27k45176

answered 2 days ago

choroba

27k45176

answered 2 days ago

choroba

27k45176

I have ~300K rows

– Anna1364
2 days ago

Is it too slow?

– choroba
2 days ago

add a comment |

I have ~300K rows

– Anna1364
2 days ago

Is it too slow?

– choroba
2 days ago

I have ~300K rows

– Anna1364
2 days ago

Is it too slow?

– choroba
2 days ago

add a comment |

Not optimal, but solves the problem:

#!/bin/bash

touch result_genes
dupe_found=0

while read GENE_LINE
do
 GL=$(echo $GENE_LINE | awk 'print $1" "$2" "$3" "$4" "$5" "$6" "$7" "$8')
 while read RESULT_LINE
 do
 RL=$(echo $RESULT_LINE | awk 'print $1" "$5" "$6" "$7" "$2" "$3" "$4" "$8')
 if [ "$GL" == "$RL" ];
 then 
 dupe_found=1
 break
 fi
 done < result_genes

 if [ $dupe_found = 1 ];
 then
 dupe_found=0; 
 else
 echo $GENE_LINE >> result_genes
 fi

done < genes

answered 2 days ago

haegor

New contributor

add a comment |

Not optimal, but solves the problem:

#!/bin/bash

touch result_genes
dupe_found=0

while read GENE_LINE
do
 GL=$(echo $GENE_LINE | awk 'print $1" "$2" "$3" "$4" "$5" "$6" "$7" "$8')
 while read RESULT_LINE
 do
 RL=$(echo $RESULT_LINE | awk 'print $1" "$5" "$6" "$7" "$2" "$3" "$4" "$8')
 if [ "$GL" == "$RL" ];
 then 
 dupe_found=1
 break
 fi
 done < result_genes

 if [ $dupe_found = 1 ];
 then
 dupe_found=0; 
 else
 echo $GENE_LINE >> result_genes
 fi

done < genes

answered 2 days ago

haegor

New contributor

add a comment |

Not optimal, but solves the problem:

#!/bin/bash

touch result_genes
dupe_found=0

while read GENE_LINE
do
 GL=$(echo $GENE_LINE | awk 'print $1" "$2" "$3" "$4" "$5" "$6" "$7" "$8')
 while read RESULT_LINE
 do
 RL=$(echo $RESULT_LINE | awk 'print $1" "$5" "$6" "$7" "$2" "$3" "$4" "$8')
 if [ "$GL" == "$RL" ];
 then 
 dupe_found=1
 break
 fi
 done < result_genes

 if [ $dupe_found = 1 ];
 then
 dupe_found=0; 
 else
 echo $GENE_LINE >> result_genes
 fi

done < genes

answered 2 days ago

haegor

New contributor

Not optimal, but solves the problem:

#!/bin/bash

touch result_genes
dupe_found=0

while read GENE_LINE
do
 GL=$(echo $GENE_LINE | awk 'print $1" "$2" "$3" "$4" "$5" "$6" "$7" "$8')
 while read RESULT_LINE
 do
 RL=$(echo $RESULT_LINE | awk 'print $1" "$5" "$6" "$7" "$2" "$3" "$4" "$8')
 if [ "$GL" == "$RL" ];
 then 
 dupe_found=1
 break
 fi
 done < result_genes

 if [ $dupe_found = 1 ];
 then
 dupe_found=0; 
 else
 echo $GENE_LINE >> result_genes
 fi

done < genes

answered 2 days ago

haegor

New contributor

answered 2 days ago

haegor

New contributor

answered 2 days ago

haegor

answered 2 days ago

haegor

New contributor

haegor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ygtjki

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

Àrd-bhaile Cathair chruinne/Baile mòr cruinne | Artagailean ceangailte | Clàr-taice na seòladaireachd

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Àrd-bhaile Cathair chruinne/Baile mòr cruinne | Artagailean ceangailte | Clàr-taice na seòladaireachd

3 Answers
3

3 Answers
3

3 Answers
3