Regex searches on text files

Paul R1
14 discussion posts

I've observed another behaviour which I'd like to bring to your attention regarding regex searches. I was finding that my searches were apparently just hanging. After some investigation what I've found is that it's not a hang but rather an exponentially increasing search time when carrying out a regex search on a text file. I found it on a very large csv file but the same is true for text files as they use the same handler (query.dll).

Actually it seems to be because I'm using a positive lookahead so it may just be a fact of life but I tried doing a search on a text file that I kept doubling in size with the following results:

test 1 - 20 lines/726 characters - positive lookahead search takes 4.815 seconds
test 2 - 40 lines/14572 characters - positive lookahead search takes 18.884 seconds
test 3 - 80 lines/29144 characters - positive lookahead search takes 75.012 seconds

So typically a doubling of the file size quadruples the search time. Note that if I do a simple regex search on the 80 line file the time is 0.14s compared to 0.13 seconds using a non-regex search so the positive lookahead really kills it. My problem was that my regex search was hitting a file that was about 400Mb in size hence the whole thing appeared to hang.

One other interesting thing I observed was that when I ran such a search, looking at the task manager on my laptop showed that fileseek was consuming about 25% of my CPU cycles. I simply stopped the search and the pause and stop buttons greyed out seemingly indicating that the search had stopped but fileseek continued to consume 25% of the CPU. If I then restarted the search fileseek started to consume 50% of my CPU. Repeating the cycle jumped it another 25%. So apparently stopping the search doesn't stop the thread doing the work and if you do this a few times your computer grinds to a halt. However closing fileseek completely does resolve the problem.

FYI I was using a number of positive lookaheads as I was trying to do the regex equivalent of an AND function
Oct 24, 2014 (modified Oct 24, 2014)  • #1
RegEx lookaheads are really intensive, unfortunately. There's not much we can do there in terms of performance. Here's an interesting thread on RegEx performance: (not required reading, I just found it interesting :))

As for aborting the search, you're correct, at the moment we wait for it to finish the current file before terminating the thread. This is to avoid leaving a file lock open on the file that's being searched. We can certainly look into whether there's a way to work around that though.

Could you tell me the RegEx string that you're using? We'll test it out here with a huge file to see if we can improve the search stopping code :)

Oct 24, 2014 (modified Oct 24, 2014)  • #2
Paul R1
Ie a regex version of oracle+audit but using \b to delimit the words in order to prevent matches to other words like plaudits (actualy audit* hence no \b after audit)

I thought it might be an unavoidable issue but now I know about it I can make some compromises and work around it.

It's probably worth making it clear that such searches can rapidly become a problem as it looks to the user like the program has simply hung. Also being able to cancel it woul be good because waiting for a really large textfile to complete isn't going to be practical (extrapolating my results suggests my 400Mb file would have taken 600 years to complete and I'm in a bit more of a hurry than that ☺)
Oct 24, 2014 (modified Oct 24, 2014)  • #3
Yeah, I don't think anyone has patience to wait that long :)

You should be able to do that search in FileSeek without RegEx. Could you try disabling the Query is RegEx option, and use the following instead? (including the quotes)


" oracle " +" audit "

The spaces before and after the the words in the quotes should make sure that they only match them as their own words.
Oct 27, 2014  • #4
Paul R1
That only works if the words you are looking for are not at the start or end of a line of text.

Anyway I suggest you close this one as I understand the limitation and can work round it.
Oct 27, 2014  • #5
Ok, no worries! One of our devs checked into this further, and here's his feedback:

"I saw his regex and it looks like he's causing this issue:

His regex is not specific enough. I wrote him a regex that could work for him, but I can only guess because I dont have a sample of his data:


Or, he could use our searching with this query

("oracle " " oracle " " oracle") +("audit " " audit " " audit")

I tested it on a document with 54038 characters and it did the regex search in 0.156s,
and the text query search in 0.145s"

Hope that helps!
Oct 28, 2014 (modified Oct 28, 2014)  • #6
