Extracting Text between Two Strings in a Huge Ordered Text FileFinding text between two specific characters or stringsRemoving text between two specific stringsHow can I extract the text between two strings in a log file?Filtering multi-lines from a logRetrieve string between two stringsExtracting IPs from a web listRead text lines between two stringsExtracting columns from a huge, delimited text fileReplacing text between two specific stringsFinding Strings in very huge text files
Why has Russell's definition of numbers using equivalence classes been finally abandoned? ( If it has actually been abandoned).
Example of a relative pronoun
How long does it take to type this?
Email Account under attack (really) - anything I can do?
How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?
How does one intimidate enemies without having the capacity for violence?
Could a US political party gain complete control over the government by removing checks & balances?
How do we improve the relationship with a client software team that performs poorly and is becoming less collaborative?
Accidentally leaked the solution to an assignment, what to do now? (I'm the prof)
What makes Graph invariants so useful/important?
Is there a familial term for apples and pears?
Banach space and Hilbert space topology
Copenhagen passport control - US citizen
How is it possible to have an ability score that is less than 3?
Is there a minimum number of transactions in a block?
Possibly bubble sort algorithm
Validation accuracy vs Testing accuracy
whey we use polarized capacitor?
What would the Romans have called "sorcery"?
Draw simple lines in Inkscape
Why did the Germans forbid the possession of pet pigeons in Rostov-on-Don in 1941?
How can the DM most effectively choose 1 out of an odd number of players to be targeted by an attack or effect?
What Brexit solution does the DUP want?
Is it tax fraud for an individual to declare non-taxable revenue as taxable income? (US tax laws)
Extracting Text between Two Strings in a Huge Ordered Text File
Finding text between two specific characters or stringsRemoving text between two specific stringsHow can I extract the text between two strings in a log file?Filtering multi-lines from a logRetrieve string between two stringsExtracting IPs from a web listRead text lines between two stringsExtracting columns from a huge, delimited text fileReplacing text between two specific stringsFinding Strings in very huge text files
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
We have a huge text file containing millions of ordered timestamped observations and given the start point and the end point, we need a fast method to extract the observations in that period.
For instance, this could be part of the file:
"2018-04-05 12:53:00",28,13.6,7.961,1746,104.7878,102.2,9.78,29.1,0,2.432,76.12,955,38.25,249.9,362.4,281.1,0.04
"2018-04-05 12:54:00",29,13.59,7.915,1738,104.2898,102.2,10.01,29.53,0,1.45,200.3,952,40.63,249.3,361.4,281.1,0.043
"2018-04-05 12:55:00",30,13.59,7.907,1734,104.0326,102.2,10.33,28.79,0,2.457,164.1,948,41.39,249.8,361.3,281.1,0.044
"2018-04-05 12:56:00",31,13.59,7.937,1718,103.0523,102.2,10.72,31.42,0,1.545,8.22,941,42.06,249.4,361.1,281.1,0.045
"2018-04-05 12:57:00",32,13.59,7.975,1719,103.1556,102.2,10.68,29.26,0,2.541,0.018,940,41.95,249.1,360.1,281.1,0.045
"2018-04-05 12:58:00",33,13.59,8,1724,103.4344,102.2,10.35,29.58,0,1.908,329.8,942,42.65,249.5,361.4,281.1,0.045
"2018-04-05 12:59:00",34,13.59,8,1733,103.9831,102.2,10.23,30.17,0,2.59,333.1,948,42.21,250.2,362,281.2,0.045
"2018-04-05 13:00:00",35,13.59,7.98,1753,105.1546,102.2,10.17,29.06,0,3.306,332.4,960,42,250.4,362.7,281.1,0.044
"2018-04-05 13:01:00",36,13.59,7.964,1757,105.3951,102.2,10.24,30.75,0,2.452,0.012,962,42.03,250.4,362.4,281.1,0.044
"2018-04-05 13:02:00",37,13.59,7.953,1757,105.4047,102.2,10.31,31.66,0,3.907,2.997,961,41.1,250.6,362.4,281.1,0.043
"2018-04-05 13:03:00",38,13.59,7.923,1758,105.4588,102.2,10.28,29.64,0,4.336,50.19,962,40.85,250.3,362.6,281.1,0.042
"2018-04-05 13:04:00",39,13.59,7.893,1757,105.449,102.1,10.27,30.42,0,1.771,12.98,962,41.73,249.8,362.1,281.1,0.043
"2018-04-05 13:05:00",40,13.6,7.89,1757,105.4433,102.1,10.46,29.54,0,2.296,93.7,962,43.02,249.9,361.7,281,0.045
"2018-04-05 13:06:00",41,13.59,7.915,1756,105.3322,102.1,10.52,29.53,0,0.632,190.8,961,43.64,249.3,361.5,281,0.045
"2018-04-05 13:07:00",42,13.6,7.972,1758,105.4697,102.1,10.77,29.49,0,0.376,322.5,961,44.69,249.1,360.9,281.1,0.046
"2018-04-05 13:08:00",43,13.6,8.05,1754,105.233,102.1,11.26,28.66,0,0.493,216.8,959,44.8,248.4,360.1,281.2,0.047
If we want the datapoints between "2018-04-05 13:00:00" and "2018-04-05 13:05:00", the output should be:
"2018-04-05 13:00:00",35,13.59,7.98,1753,105.1546,102.2,10.17,29.06,0,3.306,332.4,960,42,250.4,362.7,281.1,0.044
"2018-04-05 13:01:00",36,13.59,7.964,1757,105.3951,102.2,10.24,30.75,0,2.452,0.012,962,42.03,250.4,362.4,281.1,0.044
"2018-04-05 13:02:00",37,13.59,7.953,1757,105.4047,102.2,10.31,31.66,0,3.907,2.997,961,41.1,250.6,362.4,281.1,0.043
"2018-04-05 13:03:00",38,13.59,7.923,1758,105.4588,102.2,10.28,29.64,0,4.336,50.19,962,40.85,250.3,362.6,281.1,0.042
"2018-04-05 13:04:00",39,13.59,7.893,1757,105.449,102.1,10.27,30.42,0,1.771,12.98,962,41.73,249.8,362.1,281.1,0.043
"2018-04-05 13:05:00",40,13.6,7.89,1757,105.4433,102.1,10.46,29.54,0,2.296,93.7,962,43.02,249.9,361.7,281,0.045
Regular tools like grep
or sed
or awk
are not optimized to be applied to sorted files. So they are not fast enough for. A tool which uses a binary search would be ideal for this type of problems.
text-processing grep sort
New contributor
vahid-dan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
We have a huge text file containing millions of ordered timestamped observations and given the start point and the end point, we need a fast method to extract the observations in that period.
For instance, this could be part of the file:
"2018-04-05 12:53:00",28,13.6,7.961,1746,104.7878,102.2,9.78,29.1,0,2.432,76.12,955,38.25,249.9,362.4,281.1,0.04
"2018-04-05 12:54:00",29,13.59,7.915,1738,104.2898,102.2,10.01,29.53,0,1.45,200.3,952,40.63,249.3,361.4,281.1,0.043
"2018-04-05 12:55:00",30,13.59,7.907,1734,104.0326,102.2,10.33,28.79,0,2.457,164.1,948,41.39,249.8,361.3,281.1,0.044
"2018-04-05 12:56:00",31,13.59,7.937,1718,103.0523,102.2,10.72,31.42,0,1.545,8.22,941,42.06,249.4,361.1,281.1,0.045
"2018-04-05 12:57:00",32,13.59,7.975,1719,103.1556,102.2,10.68,29.26,0,2.541,0.018,940,41.95,249.1,360.1,281.1,0.045
"2018-04-05 12:58:00",33,13.59,8,1724,103.4344,102.2,10.35,29.58,0,1.908,329.8,942,42.65,249.5,361.4,281.1,0.045
"2018-04-05 12:59:00",34,13.59,8,1733,103.9831,102.2,10.23,30.17,0,2.59,333.1,948,42.21,250.2,362,281.2,0.045
"2018-04-05 13:00:00",35,13.59,7.98,1753,105.1546,102.2,10.17,29.06,0,3.306,332.4,960,42,250.4,362.7,281.1,0.044
"2018-04-05 13:01:00",36,13.59,7.964,1757,105.3951,102.2,10.24,30.75,0,2.452,0.012,962,42.03,250.4,362.4,281.1,0.044
"2018-04-05 13:02:00",37,13.59,7.953,1757,105.4047,102.2,10.31,31.66,0,3.907,2.997,961,41.1,250.6,362.4,281.1,0.043
"2018-04-05 13:03:00",38,13.59,7.923,1758,105.4588,102.2,10.28,29.64,0,4.336,50.19,962,40.85,250.3,362.6,281.1,0.042
"2018-04-05 13:04:00",39,13.59,7.893,1757,105.449,102.1,10.27,30.42,0,1.771,12.98,962,41.73,249.8,362.1,281.1,0.043
"2018-04-05 13:05:00",40,13.6,7.89,1757,105.4433,102.1,10.46,29.54,0,2.296,93.7,962,43.02,249.9,361.7,281,0.045
"2018-04-05 13:06:00",41,13.59,7.915,1756,105.3322,102.1,10.52,29.53,0,0.632,190.8,961,43.64,249.3,361.5,281,0.045
"2018-04-05 13:07:00",42,13.6,7.972,1758,105.4697,102.1,10.77,29.49,0,0.376,322.5,961,44.69,249.1,360.9,281.1,0.046
"2018-04-05 13:08:00",43,13.6,8.05,1754,105.233,102.1,11.26,28.66,0,0.493,216.8,959,44.8,248.4,360.1,281.2,0.047
If we want the datapoints between "2018-04-05 13:00:00" and "2018-04-05 13:05:00", the output should be:
"2018-04-05 13:00:00",35,13.59,7.98,1753,105.1546,102.2,10.17,29.06,0,3.306,332.4,960,42,250.4,362.7,281.1,0.044
"2018-04-05 13:01:00",36,13.59,7.964,1757,105.3951,102.2,10.24,30.75,0,2.452,0.012,962,42.03,250.4,362.4,281.1,0.044
"2018-04-05 13:02:00",37,13.59,7.953,1757,105.4047,102.2,10.31,31.66,0,3.907,2.997,961,41.1,250.6,362.4,281.1,0.043
"2018-04-05 13:03:00",38,13.59,7.923,1758,105.4588,102.2,10.28,29.64,0,4.336,50.19,962,40.85,250.3,362.6,281.1,0.042
"2018-04-05 13:04:00",39,13.59,7.893,1757,105.449,102.1,10.27,30.42,0,1.771,12.98,962,41.73,249.8,362.1,281.1,0.043
"2018-04-05 13:05:00",40,13.6,7.89,1757,105.4433,102.1,10.46,29.54,0,2.296,93.7,962,43.02,249.9,361.7,281,0.045
Regular tools like grep
or sed
or awk
are not optimized to be applied to sorted files. So they are not fast enough for. A tool which uses a binary search would be ideal for this type of problems.
text-processing grep sort
New contributor
vahid-dan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
We have a huge text file containing millions of ordered timestamped observations and given the start point and the end point, we need a fast method to extract the observations in that period.
For instance, this could be part of the file:
"2018-04-05 12:53:00",28,13.6,7.961,1746,104.7878,102.2,9.78,29.1,0,2.432,76.12,955,38.25,249.9,362.4,281.1,0.04
"2018-04-05 12:54:00",29,13.59,7.915,1738,104.2898,102.2,10.01,29.53,0,1.45,200.3,952,40.63,249.3,361.4,281.1,0.043
"2018-04-05 12:55:00",30,13.59,7.907,1734,104.0326,102.2,10.33,28.79,0,2.457,164.1,948,41.39,249.8,361.3,281.1,0.044
"2018-04-05 12:56:00",31,13.59,7.937,1718,103.0523,102.2,10.72,31.42,0,1.545,8.22,941,42.06,249.4,361.1,281.1,0.045
"2018-04-05 12:57:00",32,13.59,7.975,1719,103.1556,102.2,10.68,29.26,0,2.541,0.018,940,41.95,249.1,360.1,281.1,0.045
"2018-04-05 12:58:00",33,13.59,8,1724,103.4344,102.2,10.35,29.58,0,1.908,329.8,942,42.65,249.5,361.4,281.1,0.045
"2018-04-05 12:59:00",34,13.59,8,1733,103.9831,102.2,10.23,30.17,0,2.59,333.1,948,42.21,250.2,362,281.2,0.045
"2018-04-05 13:00:00",35,13.59,7.98,1753,105.1546,102.2,10.17,29.06,0,3.306,332.4,960,42,250.4,362.7,281.1,0.044
"2018-04-05 13:01:00",36,13.59,7.964,1757,105.3951,102.2,10.24,30.75,0,2.452,0.012,962,42.03,250.4,362.4,281.1,0.044
"2018-04-05 13:02:00",37,13.59,7.953,1757,105.4047,102.2,10.31,31.66,0,3.907,2.997,961,41.1,250.6,362.4,281.1,0.043
"2018-04-05 13:03:00",38,13.59,7.923,1758,105.4588,102.2,10.28,29.64,0,4.336,50.19,962,40.85,250.3,362.6,281.1,0.042
"2018-04-05 13:04:00",39,13.59,7.893,1757,105.449,102.1,10.27,30.42,0,1.771,12.98,962,41.73,249.8,362.1,281.1,0.043
"2018-04-05 13:05:00",40,13.6,7.89,1757,105.4433,102.1,10.46,29.54,0,2.296,93.7,962,43.02,249.9,361.7,281,0.045
"2018-04-05 13:06:00",41,13.59,7.915,1756,105.3322,102.1,10.52,29.53,0,0.632,190.8,961,43.64,249.3,361.5,281,0.045
"2018-04-05 13:07:00",42,13.6,7.972,1758,105.4697,102.1,10.77,29.49,0,0.376,322.5,961,44.69,249.1,360.9,281.1,0.046
"2018-04-05 13:08:00",43,13.6,8.05,1754,105.233,102.1,11.26,28.66,0,0.493,216.8,959,44.8,248.4,360.1,281.2,0.047
If we want the datapoints between "2018-04-05 13:00:00" and "2018-04-05 13:05:00", the output should be:
"2018-04-05 13:00:00",35,13.59,7.98,1753,105.1546,102.2,10.17,29.06,0,3.306,332.4,960,42,250.4,362.7,281.1,0.044
"2018-04-05 13:01:00",36,13.59,7.964,1757,105.3951,102.2,10.24,30.75,0,2.452,0.012,962,42.03,250.4,362.4,281.1,0.044
"2018-04-05 13:02:00",37,13.59,7.953,1757,105.4047,102.2,10.31,31.66,0,3.907,2.997,961,41.1,250.6,362.4,281.1,0.043
"2018-04-05 13:03:00",38,13.59,7.923,1758,105.4588,102.2,10.28,29.64,0,4.336,50.19,962,40.85,250.3,362.6,281.1,0.042
"2018-04-05 13:04:00",39,13.59,7.893,1757,105.449,102.1,10.27,30.42,0,1.771,12.98,962,41.73,249.8,362.1,281.1,0.043
"2018-04-05 13:05:00",40,13.6,7.89,1757,105.4433,102.1,10.46,29.54,0,2.296,93.7,962,43.02,249.9,361.7,281,0.045
Regular tools like grep
or sed
or awk
are not optimized to be applied to sorted files. So they are not fast enough for. A tool which uses a binary search would be ideal for this type of problems.
text-processing grep sort
New contributor
vahid-dan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
We have a huge text file containing millions of ordered timestamped observations and given the start point and the end point, we need a fast method to extract the observations in that period.
For instance, this could be part of the file:
"2018-04-05 12:53:00",28,13.6,7.961,1746,104.7878,102.2,9.78,29.1,0,2.432,76.12,955,38.25,249.9,362.4,281.1,0.04
"2018-04-05 12:54:00",29,13.59,7.915,1738,104.2898,102.2,10.01,29.53,0,1.45,200.3,952,40.63,249.3,361.4,281.1,0.043
"2018-04-05 12:55:00",30,13.59,7.907,1734,104.0326,102.2,10.33,28.79,0,2.457,164.1,948,41.39,249.8,361.3,281.1,0.044
"2018-04-05 12:56:00",31,13.59,7.937,1718,103.0523,102.2,10.72,31.42,0,1.545,8.22,941,42.06,249.4,361.1,281.1,0.045
"2018-04-05 12:57:00",32,13.59,7.975,1719,103.1556,102.2,10.68,29.26,0,2.541,0.018,940,41.95,249.1,360.1,281.1,0.045
"2018-04-05 12:58:00",33,13.59,8,1724,103.4344,102.2,10.35,29.58,0,1.908,329.8,942,42.65,249.5,361.4,281.1,0.045
"2018-04-05 12:59:00",34,13.59,8,1733,103.9831,102.2,10.23,30.17,0,2.59,333.1,948,42.21,250.2,362,281.2,0.045
"2018-04-05 13:00:00",35,13.59,7.98,1753,105.1546,102.2,10.17,29.06,0,3.306,332.4,960,42,250.4,362.7,281.1,0.044
"2018-04-05 13:01:00",36,13.59,7.964,1757,105.3951,102.2,10.24,30.75,0,2.452,0.012,962,42.03,250.4,362.4,281.1,0.044
"2018-04-05 13:02:00",37,13.59,7.953,1757,105.4047,102.2,10.31,31.66,0,3.907,2.997,961,41.1,250.6,362.4,281.1,0.043
"2018-04-05 13:03:00",38,13.59,7.923,1758,105.4588,102.2,10.28,29.64,0,4.336,50.19,962,40.85,250.3,362.6,281.1,0.042
"2018-04-05 13:04:00",39,13.59,7.893,1757,105.449,102.1,10.27,30.42,0,1.771,12.98,962,41.73,249.8,362.1,281.1,0.043
"2018-04-05 13:05:00",40,13.6,7.89,1757,105.4433,102.1,10.46,29.54,0,2.296,93.7,962,43.02,249.9,361.7,281,0.045
"2018-04-05 13:06:00",41,13.59,7.915,1756,105.3322,102.1,10.52,29.53,0,0.632,190.8,961,43.64,249.3,361.5,281,0.045
"2018-04-05 13:07:00",42,13.6,7.972,1758,105.4697,102.1,10.77,29.49,0,0.376,322.5,961,44.69,249.1,360.9,281.1,0.046
"2018-04-05 13:08:00",43,13.6,8.05,1754,105.233,102.1,11.26,28.66,0,0.493,216.8,959,44.8,248.4,360.1,281.2,0.047
If we want the datapoints between "2018-04-05 13:00:00" and "2018-04-05 13:05:00", the output should be:
"2018-04-05 13:00:00",35,13.59,7.98,1753,105.1546,102.2,10.17,29.06,0,3.306,332.4,960,42,250.4,362.7,281.1,0.044
"2018-04-05 13:01:00",36,13.59,7.964,1757,105.3951,102.2,10.24,30.75,0,2.452,0.012,962,42.03,250.4,362.4,281.1,0.044
"2018-04-05 13:02:00",37,13.59,7.953,1757,105.4047,102.2,10.31,31.66,0,3.907,2.997,961,41.1,250.6,362.4,281.1,0.043
"2018-04-05 13:03:00",38,13.59,7.923,1758,105.4588,102.2,10.28,29.64,0,4.336,50.19,962,40.85,250.3,362.6,281.1,0.042
"2018-04-05 13:04:00",39,13.59,7.893,1757,105.449,102.1,10.27,30.42,0,1.771,12.98,962,41.73,249.8,362.1,281.1,0.043
"2018-04-05 13:05:00",40,13.6,7.89,1757,105.4433,102.1,10.46,29.54,0,2.296,93.7,962,43.02,249.9,361.7,281,0.045
Regular tools like grep
or sed
or awk
are not optimized to be applied to sorted files. So they are not fast enough for. A tool which uses a binary search would be ideal for this type of problems.
text-processing grep sort
text-processing grep sort
New contributor
vahid-dan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
vahid-dan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
edited 2 days ago
vahid-dan
New contributor
vahid-dan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked Apr 5 at 0:10
vahid-danvahid-dan
82
82
New contributor
vahid-dan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
vahid-dan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
vahid-dan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
For very large files, you could exploit the natural order of the prefix timestamp to use the look
utility to perform a fast binary search for the largest common prefix of the start
and end
strings. This can then be followed by awk
/sed
post-processing to extract lines of interest from look
's output
in bash
export start='"2018-04-05 13:00:00"'
export end='"2018-04-05 13:05:00"'
#determine common prefix ("2018-04-05 13:0 in this example)
common_prefix=$(awk 'BEGIN
start=ENVIRON["start"]; end=ENVIRON["end"];
len=length(start) > length(end)? length(end): length(start);
i=1;
while (i <= len && substr(ENVIRON["start"], i, 1) == substr(ENVIRON["end"], i, 1))
++i
print(substr(start, 1, i-1))
' </dev/null
)
#the -b option to look forces binary search.
#My version of look on Ubuntu needs this flag to be passed,
#some other versions of look perform a binary search by default and do not support a -b.
look -b "$common_prefix" file | awk '$0 ~ "^"ENVIRON["start"],$0 ~ "^"ENVIRON["end"]'
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
@vahid-dan, check the updated solution
– iruvar
2 days ago
Thanks, @iruvar! Worked like a charm. :-)
– vahid-dan
2 days ago
add a comment |
Print lines between "2018-04-05 13:00:00" and "2018-04-05 13:05:00"
sed -n '/2018-04-05 13:00:00/,/2018-04-05 13:05:00/p' file
or
sed -n /"2018-04-05 13:00:00"/,/"2018-04-05 13:05:00"/p file
Grep for start date "2018-04-05 13:00:00" and output the next 5 lines (=5 minutes), -m1
stops searching after the first match.
grep -m1 -A5 '2018-04-05 13:00:00' file
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
If you need to extract lots of lines it might make sense to split the file and note the first and last line of each part. Or feed a database with the timestamps as SQL TIMESTAMP or DATETIME and index (b-tree).
– Freddy
2 days ago
Thank you @freddy. Actually, the file is merged from several smaller files but I'm not sure working with several smaller files is faster than a single large file. We are looking for a general solution that can be applied to any size of data.
– vahid-dan
2 days ago
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
vahid-dan is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f510601%2fextracting-text-between-two-strings-in-a-huge-ordered-text-file%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
For very large files, you could exploit the natural order of the prefix timestamp to use the look
utility to perform a fast binary search for the largest common prefix of the start
and end
strings. This can then be followed by awk
/sed
post-processing to extract lines of interest from look
's output
in bash
export start='"2018-04-05 13:00:00"'
export end='"2018-04-05 13:05:00"'
#determine common prefix ("2018-04-05 13:0 in this example)
common_prefix=$(awk 'BEGIN
start=ENVIRON["start"]; end=ENVIRON["end"];
len=length(start) > length(end)? length(end): length(start);
i=1;
while (i <= len && substr(ENVIRON["start"], i, 1) == substr(ENVIRON["end"], i, 1))
++i
print(substr(start, 1, i-1))
' </dev/null
)
#the -b option to look forces binary search.
#My version of look on Ubuntu needs this flag to be passed,
#some other versions of look perform a binary search by default and do not support a -b.
look -b "$common_prefix" file | awk '$0 ~ "^"ENVIRON["start"],$0 ~ "^"ENVIRON["end"]'
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
@vahid-dan, check the updated solution
– iruvar
2 days ago
Thanks, @iruvar! Worked like a charm. :-)
– vahid-dan
2 days ago
add a comment |
For very large files, you could exploit the natural order of the prefix timestamp to use the look
utility to perform a fast binary search for the largest common prefix of the start
and end
strings. This can then be followed by awk
/sed
post-processing to extract lines of interest from look
's output
in bash
export start='"2018-04-05 13:00:00"'
export end='"2018-04-05 13:05:00"'
#determine common prefix ("2018-04-05 13:0 in this example)
common_prefix=$(awk 'BEGIN
start=ENVIRON["start"]; end=ENVIRON["end"];
len=length(start) > length(end)? length(end): length(start);
i=1;
while (i <= len && substr(ENVIRON["start"], i, 1) == substr(ENVIRON["end"], i, 1))
++i
print(substr(start, 1, i-1))
' </dev/null
)
#the -b option to look forces binary search.
#My version of look on Ubuntu needs this flag to be passed,
#some other versions of look perform a binary search by default and do not support a -b.
look -b "$common_prefix" file | awk '$0 ~ "^"ENVIRON["start"],$0 ~ "^"ENVIRON["end"]'
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
@vahid-dan, check the updated solution
– iruvar
2 days ago
Thanks, @iruvar! Worked like a charm. :-)
– vahid-dan
2 days ago
add a comment |
For very large files, you could exploit the natural order of the prefix timestamp to use the look
utility to perform a fast binary search for the largest common prefix of the start
and end
strings. This can then be followed by awk
/sed
post-processing to extract lines of interest from look
's output
in bash
export start='"2018-04-05 13:00:00"'
export end='"2018-04-05 13:05:00"'
#determine common prefix ("2018-04-05 13:0 in this example)
common_prefix=$(awk 'BEGIN
start=ENVIRON["start"]; end=ENVIRON["end"];
len=length(start) > length(end)? length(end): length(start);
i=1;
while (i <= len && substr(ENVIRON["start"], i, 1) == substr(ENVIRON["end"], i, 1))
++i
print(substr(start, 1, i-1))
' </dev/null
)
#the -b option to look forces binary search.
#My version of look on Ubuntu needs this flag to be passed,
#some other versions of look perform a binary search by default and do not support a -b.
look -b "$common_prefix" file | awk '$0 ~ "^"ENVIRON["start"],$0 ~ "^"ENVIRON["end"]'
For very large files, you could exploit the natural order of the prefix timestamp to use the look
utility to perform a fast binary search for the largest common prefix of the start
and end
strings. This can then be followed by awk
/sed
post-processing to extract lines of interest from look
's output
in bash
export start='"2018-04-05 13:00:00"'
export end='"2018-04-05 13:05:00"'
#determine common prefix ("2018-04-05 13:0 in this example)
common_prefix=$(awk 'BEGIN
start=ENVIRON["start"]; end=ENVIRON["end"];
len=length(start) > length(end)? length(end): length(start);
i=1;
while (i <= len && substr(ENVIRON["start"], i, 1) == substr(ENVIRON["end"], i, 1))
++i
print(substr(start, 1, i-1))
' </dev/null
)
#the -b option to look forces binary search.
#My version of look on Ubuntu needs this flag to be passed,
#some other versions of look perform a binary search by default and do not support a -b.
look -b "$common_prefix" file | awk '$0 ~ "^"ENVIRON["start"],$0 ~ "^"ENVIRON["end"]'
edited 2 days ago
answered 2 days ago


iruvariruvar
12.2k63062
12.2k63062
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
@vahid-dan, check the updated solution
– iruvar
2 days ago
Thanks, @iruvar! Worked like a charm. :-)
– vahid-dan
2 days ago
add a comment |
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
@vahid-dan, check the updated solution
– iruvar
2 days ago
Thanks, @iruvar! Worked like a charm. :-)
– vahid-dan
2 days ago
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
@vahid-dan, check the updated solution
– iruvar
2 days ago
@vahid-dan, check the updated solution
– iruvar
2 days ago
Thanks, @iruvar! Worked like a charm. :-)
– vahid-dan
2 days ago
Thanks, @iruvar! Worked like a charm. :-)
– vahid-dan
2 days ago
add a comment |
Print lines between "2018-04-05 13:00:00" and "2018-04-05 13:05:00"
sed -n '/2018-04-05 13:00:00/,/2018-04-05 13:05:00/p' file
or
sed -n /"2018-04-05 13:00:00"/,/"2018-04-05 13:05:00"/p file
Grep for start date "2018-04-05 13:00:00" and output the next 5 lines (=5 minutes), -m1
stops searching after the first match.
grep -m1 -A5 '2018-04-05 13:00:00' file
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
If you need to extract lots of lines it might make sense to split the file and note the first and last line of each part. Or feed a database with the timestamps as SQL TIMESTAMP or DATETIME and index (b-tree).
– Freddy
2 days ago
Thank you @freddy. Actually, the file is merged from several smaller files but I'm not sure working with several smaller files is faster than a single large file. We are looking for a general solution that can be applied to any size of data.
– vahid-dan
2 days ago
add a comment |
Print lines between "2018-04-05 13:00:00" and "2018-04-05 13:05:00"
sed -n '/2018-04-05 13:00:00/,/2018-04-05 13:05:00/p' file
or
sed -n /"2018-04-05 13:00:00"/,/"2018-04-05 13:05:00"/p file
Grep for start date "2018-04-05 13:00:00" and output the next 5 lines (=5 minutes), -m1
stops searching after the first match.
grep -m1 -A5 '2018-04-05 13:00:00' file
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
If you need to extract lots of lines it might make sense to split the file and note the first and last line of each part. Or feed a database with the timestamps as SQL TIMESTAMP or DATETIME and index (b-tree).
– Freddy
2 days ago
Thank you @freddy. Actually, the file is merged from several smaller files but I'm not sure working with several smaller files is faster than a single large file. We are looking for a general solution that can be applied to any size of data.
– vahid-dan
2 days ago
add a comment |
Print lines between "2018-04-05 13:00:00" and "2018-04-05 13:05:00"
sed -n '/2018-04-05 13:00:00/,/2018-04-05 13:05:00/p' file
or
sed -n /"2018-04-05 13:00:00"/,/"2018-04-05 13:05:00"/p file
Grep for start date "2018-04-05 13:00:00" and output the next 5 lines (=5 minutes), -m1
stops searching after the first match.
grep -m1 -A5 '2018-04-05 13:00:00' file
Print lines between "2018-04-05 13:00:00" and "2018-04-05 13:05:00"
sed -n '/2018-04-05 13:00:00/,/2018-04-05 13:05:00/p' file
or
sed -n /"2018-04-05 13:00:00"/,/"2018-04-05 13:05:00"/p file
Grep for start date "2018-04-05 13:00:00" and output the next 5 lines (=5 minutes), -m1
stops searching after the first match.
grep -m1 -A5 '2018-04-05 13:00:00' file
answered 2 days ago


FreddyFreddy
1,514210
1,514210
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
If you need to extract lots of lines it might make sense to split the file and note the first and last line of each part. Or feed a database with the timestamps as SQL TIMESTAMP or DATETIME and index (b-tree).
– Freddy
2 days ago
Thank you @freddy. Actually, the file is merged from several smaller files but I'm not sure working with several smaller files is faster than a single large file. We are looking for a general solution that can be applied to any size of data.
– vahid-dan
2 days ago
add a comment |
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
If you need to extract lots of lines it might make sense to split the file and note the first and last line of each part. Or feed a database with the timestamps as SQL TIMESTAMP or DATETIME and index (b-tree).
– Freddy
2 days ago
Thank you @freddy. Actually, the file is merged from several smaller files but I'm not sure working with several smaller files is faster than a single large file. We are looking for a general solution that can be applied to any size of data.
– vahid-dan
2 days ago
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
Thanks. It works but it is not fast enough for a huge file, let's say a 100GB+ file.
– vahid-dan
2 days ago
If you need to extract lots of lines it might make sense to split the file and note the first and last line of each part. Or feed a database with the timestamps as SQL TIMESTAMP or DATETIME and index (b-tree).
– Freddy
2 days ago
If you need to extract lots of lines it might make sense to split the file and note the first and last line of each part. Or feed a database with the timestamps as SQL TIMESTAMP or DATETIME and index (b-tree).
– Freddy
2 days ago
Thank you @freddy. Actually, the file is merged from several smaller files but I'm not sure working with several smaller files is faster than a single large file. We are looking for a general solution that can be applied to any size of data.
– vahid-dan
2 days ago
Thank you @freddy. Actually, the file is merged from several smaller files but I'm not sure working with several smaller files is faster than a single large file. We are looking for a general solution that can be applied to any size of data.
– vahid-dan
2 days ago
add a comment |
vahid-dan is a new contributor. Be nice, and check out our Code of Conduct.
vahid-dan is a new contributor. Be nice, and check out our Code of Conduct.
vahid-dan is a new contributor. Be nice, and check out our Code of Conduct.
vahid-dan is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f510601%2fextracting-text-between-two-strings-in-a-huge-ordered-text-file%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown