Oh, following my number 1 rule... Know what you don't know... We suspect this is a resource problem. So I would do the following:
-
using top, atop, or whatever, watch the 4 big resources, CPU, memory, disk IO and network IO. These 4 are often the clue that shows the light at the end of the tunnel. Resource consumption often has a cascade affect, not effect. Such as when memory use is extreme, there is a following disk IO increase as SWAP is engaged (significantly).
-
Watch dmesg... watch for driver, device, component errors or warnings. At times, components of the OS or even the hardware can do odd things, this might be the case here, given the issue consistently happens after a relatively consistent time period after the last restart, if I recall right?
This situation reminds me how we could find why a given virtual machine was not behaving like we expected after we migrated applications from a physical server to a virtual server. The cause was at times pretty odd once we found it.
There was one time when I discovered a a specific code function in an application that on hardware worked fine, but same lines of code, in a virtual machine was slow beyond belief. The developer denied that it was the code for months, I found it using the above methods, in less than 1 day... the developer finally accepted the test results, reworked the function, and 2 days later, the application was just about as fast on a virtual machine, as full hardware.
Months of wasted time because the analysis team believed it was a OS or virtualized hardware/driver issue. The assumption was wrong, and because the developer refused to do time/speed validation. Moreover, my peers, get pain from me... because they did not do the first rule... know what you don't know. They assumed the developer declaration that the code can't run different on a virtual machine as hardware was true.
You just don't know where the evidence is going to lead you.