remove duplicated rowsMerge fields in a fileExtract data in linux/unixParse/Manipulate in awkTo count number of matches in a mega string quicklyMerge and print matching and non matching values between a smaller file and a huge fileRemove rows from a file that exist in another file with newer timestampCSV - Converting SOME Columns to Rows with duplicated datascript to parse file for two consecutive lines of unequal lengthalter awk variable based on match inside awkSpeed up script that determines if all columns in a row are the same or not
Why can't we play rap on piano?
How to tell a function to use the default argument values?
Which is the best way to check return result?
Can we compute the area of a quadrilateral with one right angle when we only know the lengths of any three sides?
How do I handle a potential work/personal life conflict as the manager of one of my friends?
How do I deal with an unproductive colleague in a small company?
What type of content (depth/breadth) is expected for a short presentation for Asst Professor interview in the UK?
Extract rows of a table, that include less than x NULLs
What is the idiomatic way to say "clothing fits"?
Alternative to sending password over mail?
One verb to replace 'be a member of' a club
Can my sorcerer use a spellbook only to collect spells and scribe scrolls, not cast?
How does a predictive coding aid in lossless compression?
How do I gain back my faith in my PhD degree?
Why is this clock signal connected to a capacitor to gnd?
In 'Revenger,' what does 'cove' come from?
Why didn't Miles's spider sense work before?
Would Slavery Reparations be considered Bills of Attainder and hence Illegal?
Is it logically or scientifically possible to artificially send energy to the body?
ssTTsSTtRrriinInnnnNNNIiinngg
What does the expression "A Mann!" means
Is this a hacking script in function.php?
Avoiding the "not like other girls" trope?
How could indestructible materials be used in power generation?
remove duplicated rows
Merge fields in a fileExtract data in linux/unixParse/Manipulate in awkTo count number of matches in a mega string quicklyMerge and print matching and non matching values between a smaller file and a huge fileRemove rows from a file that exist in another file with newer timestampCSV - Converting SOME Columns to Rows with duplicated datascript to parse file for two consecutive lines of unequal lengthalter awk variable based on match inside awkSpeed up script that determines if all columns in a row are the same or not
I have a file with bunch of rows, here is how it looks like (just a head of file):
"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684
1 407644 408993 "gene4" 408421 409504 "gene5" 573
1 407644 408993 "gene4" 408616 410013 "gene6" 378
1 408421 409504 "gene5" 407644 408993 "gene4" 573
1 408421 409504 "gene5" 408616 410013 "gene6" 889
1 408616 410013 "gene6" 407644 408993 "gene4" 378
1 408616 410013 "gene6" 408421 409504 "gene5" 889
1 408616 410013 "gene6" 409682 411483 "gene7" 332
....
There are some identical rows (the same pair of genes, just the order of start and stop positions differ, but they are exactly the same) which I need to remove the repeated row.
For example:
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684
are the same, it is genes 2 and 3 combination just in a different order and I want to remove one of them.
Here is my desired output:
"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 407644 408993 "gene4" 408421 409504 "gene5" 573
1 407644 408993 "gene4" 408616 410013 "gene6" 378
1 408421 409504 "gene5" 408616 410013 "gene6" 889
1 408616 410013 "gene6" 409682 411483 "gene7" 332
Is there any idea how I can do this task? Thanks
text-processing sed
|
show 5 more comments
I have a file with bunch of rows, here is how it looks like (just a head of file):
"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684
1 407644 408993 "gene4" 408421 409504 "gene5" 573
1 407644 408993 "gene4" 408616 410013 "gene6" 378
1 408421 409504 "gene5" 407644 408993 "gene4" 573
1 408421 409504 "gene5" 408616 410013 "gene6" 889
1 408616 410013 "gene6" 407644 408993 "gene4" 378
1 408616 410013 "gene6" 408421 409504 "gene5" 889
1 408616 410013 "gene6" 409682 411483 "gene7" 332
....
There are some identical rows (the same pair of genes, just the order of start and stop positions differ, but they are exactly the same) which I need to remove the repeated row.
For example:
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684
are the same, it is genes 2 and 3 combination just in a different order and I want to remove one of them.
Here is my desired output:
"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 407644 408993 "gene4" 408421 409504 "gene5" 573
1 407644 408993 "gene4" 408616 410013 "gene6" 378
1 408421 409504 "gene5" 408616 410013 "gene6" 889
1 408616 410013 "gene6" 409682 411483 "gene7" 332
Is there any idea how I can do this task? Thanks
text-processing sed
Are the duplicities always adjacent? How is the file sorted?
– choroba
2 days ago
This looks like genomic data. Should we assume that the amount of data is huge?
– Kusalananda♦
2 days ago
@ Kusalananda, yes .. but it not super huge. I have approximately 300K rows in my file.
– Anna1364
2 days ago
@choroba, no they are not
– Anna1364
2 days ago
2
Wouldawk '!seen[$4"" < $7 ? $4 OFS $7 : $7 OFS $4]++'work or to you need to look at other columns than the 4th and 7th?
– Stéphane Chazelas
2 days ago
|
show 5 more comments
I have a file with bunch of rows, here is how it looks like (just a head of file):
"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684
1 407644 408993 "gene4" 408421 409504 "gene5" 573
1 407644 408993 "gene4" 408616 410013 "gene6" 378
1 408421 409504 "gene5" 407644 408993 "gene4" 573
1 408421 409504 "gene5" 408616 410013 "gene6" 889
1 408616 410013 "gene6" 407644 408993 "gene4" 378
1 408616 410013 "gene6" 408421 409504 "gene5" 889
1 408616 410013 "gene6" 409682 411483 "gene7" 332
....
There are some identical rows (the same pair of genes, just the order of start and stop positions differ, but they are exactly the same) which I need to remove the repeated row.
For example:
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684
are the same, it is genes 2 and 3 combination just in a different order and I want to remove one of them.
Here is my desired output:
"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 407644 408993 "gene4" 408421 409504 "gene5" 573
1 407644 408993 "gene4" 408616 410013 "gene6" 378
1 408421 409504 "gene5" 408616 410013 "gene6" 889
1 408616 410013 "gene6" 409682 411483 "gene7" 332
Is there any idea how I can do this task? Thanks
text-processing sed
I have a file with bunch of rows, here is how it looks like (just a head of file):
"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684
1 407644 408993 "gene4" 408421 409504 "gene5" 573
1 407644 408993 "gene4" 408616 410013 "gene6" 378
1 408421 409504 "gene5" 407644 408993 "gene4" 573
1 408421 409504 "gene5" 408616 410013 "gene6" 889
1 408616 410013 "gene6" 407644 408993 "gene4" 378
1 408616 410013 "gene6" 408421 409504 "gene5" 889
1 408616 410013 "gene6" 409682 411483 "gene7" 332
....
There are some identical rows (the same pair of genes, just the order of start and stop positions differ, but they are exactly the same) which I need to remove the repeated row.
For example:
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 315521 317204 "gene3" 315121 317607 "gene2" 1684
are the same, it is genes 2 and 3 combination just in a different order and I want to remove one of them.
Here is my desired output:
"chrom" "startA" "stopA" "genesA" "startB" "stopB" "genesB" "test"
1 315121 317607 "gene2" 315521 317204 "gene3" 1684
1 407644 408993 "gene4" 408421 409504 "gene5" 573
1 407644 408993 "gene4" 408616 410013 "gene6" 378
1 408421 409504 "gene5" 408616 410013 "gene6" 889
1 408616 410013 "gene6" 409682 411483 "gene7" 332
Is there any idea how I can do this task? Thanks
text-processing sed
text-processing sed
edited 2 days ago
Anna1364
asked 2 days ago
Anna1364Anna1364
432213
432213
Are the duplicities always adjacent? How is the file sorted?
– choroba
2 days ago
This looks like genomic data. Should we assume that the amount of data is huge?
– Kusalananda♦
2 days ago
@ Kusalananda, yes .. but it not super huge. I have approximately 300K rows in my file.
– Anna1364
2 days ago
@choroba, no they are not
– Anna1364
2 days ago
2
Wouldawk '!seen[$4"" < $7 ? $4 OFS $7 : $7 OFS $4]++'work or to you need to look at other columns than the 4th and 7th?
– Stéphane Chazelas
2 days ago
|
show 5 more comments
Are the duplicities always adjacent? How is the file sorted?
– choroba
2 days ago
This looks like genomic data. Should we assume that the amount of data is huge?
– Kusalananda♦
2 days ago
@ Kusalananda, yes .. but it not super huge. I have approximately 300K rows in my file.
– Anna1364
2 days ago
@choroba, no they are not
– Anna1364
2 days ago
2
Wouldawk '!seen[$4"" < $7 ? $4 OFS $7 : $7 OFS $4]++'work or to you need to look at other columns than the 4th and 7th?
– Stéphane Chazelas
2 days ago
Are the duplicities always adjacent? How is the file sorted?
– choroba
2 days ago
Are the duplicities always adjacent? How is the file sorted?
– choroba
2 days ago
This looks like genomic data. Should we assume that the amount of data is huge?
– Kusalananda♦
2 days ago
This looks like genomic data. Should we assume that the amount of data is huge?
– Kusalananda♦
2 days ago
@ Kusalananda, yes .. but it not super huge. I have approximately 300K rows in my file.
– Anna1364
2 days ago
@ Kusalananda, yes .. but it not super huge. I have approximately 300K rows in my file.
– Anna1364
2 days ago
@choroba, no they are not
– Anna1364
2 days ago
@choroba, no they are not
– Anna1364
2 days ago
2
2
Would
awk '!seen[$4"" < $7 ? $4 OFS $7 : $7 OFS $4]++' work or to you need to look at other columns than the 4th and 7th?– Stéphane Chazelas
2 days ago
Would
awk '!seen[$4"" < $7 ? $4 OFS $7 : $7 OFS $4]++' work or to you need to look at other columns than the 4th and 7th?– Stéphane Chazelas
2 days ago
|
show 5 more comments
3 Answers
3
active
oldest
votes
You might try:
awk 'key = $4 < $7 ? $4 SUBSEP $7 : $7 SUBSEP $4 !seen[key]++' file
That stores the minimum necessary to remove the duplicate records.
!seen[key]++ is a "famous" awk idiom to print a record only for the first time "key" is seen.
2
I swear I did this before I read Stéphane's comment...
– glenn jackman
2 days ago
add a comment |
You can order the triplets of columns 2-3-4 and 5-6-7 by the value in the first column:
perl -lane '@F[1,2,3,4,5,6] = @F[4,5,6,1,2,3] if $F[1] > $F[4]; print "@F"'
Then you can just run sort -u to remove the duplicities (but you need to special case the column names).
I have ~300K rows
– Anna1364
2 days ago
Is it too slow?
– choroba
2 days ago
add a comment |
Not optimal, but solves the problem:
#!/bin/bash
touch result_genes
dupe_found=0
while read GENE_LINE
do
GL=$(echo $GENE_LINE | awk 'print $1" "$2" "$3" "$4" "$5" "$6" "$7" "$8')
while read RESULT_LINE
do
RL=$(echo $RESULT_LINE | awk 'print $1" "$5" "$6" "$7" "$2" "$3" "$4" "$8')
if [ "$GL" == "$RL" ];
then
dupe_found=1
break
fi
done < result_genes
if [ $dupe_found = 1 ];
then
dupe_found=0;
else
echo $GENE_LINE >> result_genes
fi
done < genes
New contributor
haegor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509887%2fremove-duplicated-rows%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You might try:
awk 'key = $4 < $7 ? $4 SUBSEP $7 : $7 SUBSEP $4 !seen[key]++' file
That stores the minimum necessary to remove the duplicate records.
!seen[key]++ is a "famous" awk idiom to print a record only for the first time "key" is seen.
2
I swear I did this before I read Stéphane's comment...
– glenn jackman
2 days ago
add a comment |
You might try:
awk 'key = $4 < $7 ? $4 SUBSEP $7 : $7 SUBSEP $4 !seen[key]++' file
That stores the minimum necessary to remove the duplicate records.
!seen[key]++ is a "famous" awk idiom to print a record only for the first time "key" is seen.
2
I swear I did this before I read Stéphane's comment...
– glenn jackman
2 days ago
add a comment |
You might try:
awk 'key = $4 < $7 ? $4 SUBSEP $7 : $7 SUBSEP $4 !seen[key]++' file
That stores the minimum necessary to remove the duplicate records.
!seen[key]++ is a "famous" awk idiom to print a record only for the first time "key" is seen.
You might try:
awk 'key = $4 < $7 ? $4 SUBSEP $7 : $7 SUBSEP $4 !seen[key]++' file
That stores the minimum necessary to remove the duplicate records.
!seen[key]++ is a "famous" awk idiom to print a record only for the first time "key" is seen.
answered 2 days ago
community wiki
glenn jackman
2
I swear I did this before I read Stéphane's comment...
– glenn jackman
2 days ago
add a comment |
2
I swear I did this before I read Stéphane's comment...
– glenn jackman
2 days ago
2
2
I swear I did this before I read Stéphane's comment...
– glenn jackman
2 days ago
I swear I did this before I read Stéphane's comment...
– glenn jackman
2 days ago
add a comment |
You can order the triplets of columns 2-3-4 and 5-6-7 by the value in the first column:
perl -lane '@F[1,2,3,4,5,6] = @F[4,5,6,1,2,3] if $F[1] > $F[4]; print "@F"'
Then you can just run sort -u to remove the duplicities (but you need to special case the column names).
I have ~300K rows
– Anna1364
2 days ago
Is it too slow?
– choroba
2 days ago
add a comment |
You can order the triplets of columns 2-3-4 and 5-6-7 by the value in the first column:
perl -lane '@F[1,2,3,4,5,6] = @F[4,5,6,1,2,3] if $F[1] > $F[4]; print "@F"'
Then you can just run sort -u to remove the duplicities (but you need to special case the column names).
I have ~300K rows
– Anna1364
2 days ago
Is it too slow?
– choroba
2 days ago
add a comment |
You can order the triplets of columns 2-3-4 and 5-6-7 by the value in the first column:
perl -lane '@F[1,2,3,4,5,6] = @F[4,5,6,1,2,3] if $F[1] > $F[4]; print "@F"'
Then you can just run sort -u to remove the duplicities (but you need to special case the column names).
You can order the triplets of columns 2-3-4 and 5-6-7 by the value in the first column:
perl -lane '@F[1,2,3,4,5,6] = @F[4,5,6,1,2,3] if $F[1] > $F[4]; print "@F"'
Then you can just run sort -u to remove the duplicities (but you need to special case the column names).
answered 2 days ago
chorobachoroba
27k45176
27k45176
I have ~300K rows
– Anna1364
2 days ago
Is it too slow?
– choroba
2 days ago
add a comment |
I have ~300K rows
– Anna1364
2 days ago
Is it too slow?
– choroba
2 days ago
I have ~300K rows
– Anna1364
2 days ago
I have ~300K rows
– Anna1364
2 days ago
Is it too slow?
– choroba
2 days ago
Is it too slow?
– choroba
2 days ago
add a comment |
Not optimal, but solves the problem:
#!/bin/bash
touch result_genes
dupe_found=0
while read GENE_LINE
do
GL=$(echo $GENE_LINE | awk 'print $1" "$2" "$3" "$4" "$5" "$6" "$7" "$8')
while read RESULT_LINE
do
RL=$(echo $RESULT_LINE | awk 'print $1" "$5" "$6" "$7" "$2" "$3" "$4" "$8')
if [ "$GL" == "$RL" ];
then
dupe_found=1
break
fi
done < result_genes
if [ $dupe_found = 1 ];
then
dupe_found=0;
else
echo $GENE_LINE >> result_genes
fi
done < genes
New contributor
haegor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
Not optimal, but solves the problem:
#!/bin/bash
touch result_genes
dupe_found=0
while read GENE_LINE
do
GL=$(echo $GENE_LINE | awk 'print $1" "$2" "$3" "$4" "$5" "$6" "$7" "$8')
while read RESULT_LINE
do
RL=$(echo $RESULT_LINE | awk 'print $1" "$5" "$6" "$7" "$2" "$3" "$4" "$8')
if [ "$GL" == "$RL" ];
then
dupe_found=1
break
fi
done < result_genes
if [ $dupe_found = 1 ];
then
dupe_found=0;
else
echo $GENE_LINE >> result_genes
fi
done < genes
New contributor
haegor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
Not optimal, but solves the problem:
#!/bin/bash
touch result_genes
dupe_found=0
while read GENE_LINE
do
GL=$(echo $GENE_LINE | awk 'print $1" "$2" "$3" "$4" "$5" "$6" "$7" "$8')
while read RESULT_LINE
do
RL=$(echo $RESULT_LINE | awk 'print $1" "$5" "$6" "$7" "$2" "$3" "$4" "$8')
if [ "$GL" == "$RL" ];
then
dupe_found=1
break
fi
done < result_genes
if [ $dupe_found = 1 ];
then
dupe_found=0;
else
echo $GENE_LINE >> result_genes
fi
done < genes
New contributor
haegor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Not optimal, but solves the problem:
#!/bin/bash
touch result_genes
dupe_found=0
while read GENE_LINE
do
GL=$(echo $GENE_LINE | awk 'print $1" "$2" "$3" "$4" "$5" "$6" "$7" "$8')
while read RESULT_LINE
do
RL=$(echo $RESULT_LINE | awk 'print $1" "$5" "$6" "$7" "$2" "$3" "$4" "$8')
if [ "$GL" == "$RL" ];
then
dupe_found=1
break
fi
done < result_genes
if [ $dupe_found = 1 ];
then
dupe_found=0;
else
echo $GENE_LINE >> result_genes
fi
done < genes
New contributor
haegor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
haegor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered 2 days ago
haegor haegor
1
1
New contributor
haegor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
haegor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
haegor is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f509887%2fremove-duplicated-rows%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Are the duplicities always adjacent? How is the file sorted?
– choroba
2 days ago
This looks like genomic data. Should we assume that the amount of data is huge?
– Kusalananda♦
2 days ago
@ Kusalananda, yes .. but it not super huge. I have approximately 300K rows in my file.
– Anna1364
2 days ago
@choroba, no they are not
– Anna1364
2 days ago
2
Would
awk '!seen[$4"" < $7 ? $4 OFS $7 : $7 OFS $4]++'work or to you need to look at other columns than the 4th and 7th?– Stéphane Chazelas
2 days ago