You are tasked with developing a utility that processes a potentially very large log file and identifies the most frequent keywords, excluding a provided list of stop words.
Problem Description
- The log file is a plain text file where each line represents a log entry. Each entry may contain a mix of alphanumeric characters and punctuation.
- Your utility should accept a file path to the log file and a separate list of stop words (these are words you will ignore when counting frequencies).
- Process the file in a memory-efficient manner (assume the file might be too large to load entirely into memory).
- Parse each log entry to extract words. Words are defined as sequences of alphanumeric characters (ignore case). Punctuation and other symbols should be treated as delimiters.
- Count the frequency of each word that is not in the stop words list.
Requirements
-
Input Parameters:
- Path to the log file.
- List of stop words to ignore.
- An optional parameter to specify how many top frequent words to output (if not provided, output all).
-
Output:
- A list of words along with their frequency counts, sorted in descending order of frequency. In case of ties, sort the words alphabetically.
-
Considerations:
- The solution should be efficient in terms of memory usage.
- Handle potential errors that might occur during file reading (such as the file not existing or access issues).
Example
Suppose your log file contains the following lines:
Error: Failed to connect to database.
Warning: Database connection slow.
Info: User login successful.
Error: Database timeout error.
And the provided stop words list is: ['to', 'the', 'and', 'is']
.
Your program should output the words (after normalizing case) and their frequencies excluding the stop words, sorted by frequency (and alphabetically for ties).
Deliverables
Write a complete program in the language of your choice that meets the above requirements. The program should be well-structured and commented where appropriate.