tomee+activemq: JMS reading sometimes hang with full queue

Discussion:

Emmanuel Touzery

2018-10-22 14:04:52 UTC

Hello,

    we have a tomee+ 7.0.3 installation with activemq, using kahadb as
a persistent message storage. We have an activemq.xml, we plugged it
though :

BrokerXmlConfig = xbean:file:/opt/app/tomee/conf/activemq.xml

    in the tomee.xml. The activemq broken runs within TOMEE:

ServerUrl       = tcp://127.0.0.1:61616

    We have a prefetch of 2000:

<transportConnector name="nio"
uri="nio://0.0.0.0:61616?jms.prefetchPolicy.all=2000"/>

    We use mKaha. We disabled flow control.

    So that everything would work, we had to add a couple of JARs in
the TOMEE lib folder:

activemq-spring-5.14.3.jar
spring-beans-3.2.9.RELEASE.jar
spring-context-3.2.9.RELEASE.jar
spring-core-3.2.9.RELEASE.jar
spring-expression-3.2.9.RELEASE.jar
spring-web-3.2.9.RELEASE.jar
xbean-spring-3.9.jar

    We are "reading" from JMS through message-driven beans,
implementing MessageListener and with @MessageDriven annotations.

    The application is pretty simple... Receive the data from
HTTP/JSON, and store it to SQL (through hibernate).

    Everything works fine as long as the traffic is normal. However
when there is a surge of incoming traffic, sometimes the JMS consumers
stop getting called, and the queue only grows. The issue does not get
fixed until TOMEE is restarted. And then we've seen the issue re-appear
again maybe 40 minutes later. After a while, the server clears the queue
and everything is fine again.

    We took a jstack thread dump of the application when it's in that
"hung" state:
https://www.dropbox.com/s/p8wy7uz6inzsmlj/jstack.txt?dl=0

    What's interesting is that writes fall quite fast, and in steps, in
general not all at once, but as well not slowly:
Loading Image...

    After a restart things are fine again immediately.

    We're not sure what is the cause. From what we can tell from the
thread dump, the consumers are idle, they just don't get notified that
work is available. The server is certainly aware there are items in the
queue, we monitor the queue through JMX and the queue size keeps growing
during these episodes. We don't see anything out of the ordinary in the
logs. We looked at thread IDs for consumers just before the issue, it
doesn't look like the consumers get some deadlock one after the other
for instance. It seems like a bunch of them are called in the last
minute before the dropoff for instance. Also, during a blackout the JDBC
pool usage is at 0 according to our JMX monitoring, so it doesn't seem
to be about a deadlocked JDBC connection.

    We did notice the following activemq warnings in the log file, but
the timestamps don't match with any particular events and from what we
found out, they don't seem to be particularly worrying or likely to be
related to the issue:

WARNING [ActiveMQ Journal Checkpoint Worker]
org.apache.activemq.store.kahadb.MessageDatabase.getNextLocationForAckForward
Failed to load next journal location: null

WARNING [ActiveMQ NIO Worker 6]
org.apache.activemq.broker.TransportConnection.serviceTransportException
Transport Connection to: tcp://127.0.0.1:37024 failed: java.io.EOFException

    Do you have any suggestion to try to fix this issue (which we sadly
can't reproduce at will.. and it only happens pretty rarely)? Should we
rather ask on the activemq mailing list?

    Regards,

emmanuel

exabrial12

2018-10-22 14:33:10 UTC

Permalink

Hello Emmanuel, can we see the full log of the tomee startup, the full
tomee.xml, the full activemq.xml, and any other differences between your
runtime and the downloaded runtime? thanks!

--
Sent from: http://tomee-openejb.979440.n4.nabble.com/TomEE-Users-f979441.html

Emmanuel Touzery

2018-10-22 15:04:35 UTC

Permalink

Hello,

    sure! Regarding the runtime, we also have the postgresql driver in
lib, I think that's all. Hibernate is in our WAR for instance.

    here's the activemq.xml =>
https://www.dropbox.com/s/ukttiupouyiv779/activemq.xml?dl=0

    tomee.xml =>
https://www.dropbox.com/s/452s6d6vknp59a4/tomee.xml?dl=0

    startup log =>
https://www.dropbox.com/s/52uqfatlisodayg/catalina.2018-10-21.log?dl=0

    Regards,

Emmanuel

Post by exabrial12
Hello Emmanuel, can we see the full log of the tomee startup, the full
tomee.xml, the full activemq.xml, and any other differences between your
runtime and the downloaded runtime? thanks!
--
Sent from: http://tomee-openejb.979440.n4.nabble.com/TomEE-Users-f979441.html

Emmanuel Touzery

2018-10-24 05:59:30 UTC

Permalink

Hello,

noone has any suggestion?

Regards,

emmanuel

Post by Emmanuel Touzery
Hello,
    we have a tomee+ 7.0.3 installation with activemq, using kahadb as
a persistent message storage. We have an activemq.xml, we plugged it
BrokerXmlConfig = xbean:file:/opt/app/tomee/conf/activemq.xml
ServerUrl       = tcp://127.0.0.1:61616
<transportConnector name="nio"
uri="nio://0.0.0.0:61616?jms.prefetchPolicy.all=2000"/>
    We use mKaha. We disabled flow control.
    So that everything would work, we had to add a couple of JARs in
activemq-spring-5.14.3.jar
spring-beans-3.2.9.RELEASE.jar
spring-context-3.2.9.RELEASE.jar
spring-core-3.2.9.RELEASE.jar
spring-expression-3.2.9.RELEASE.jar
spring-web-3.2.9.RELEASE.jar
xbean-spring-3.9.jar
    We are "reading" from JMS through message-driven beans,
    The application is pretty simple... Receive the data from
HTTP/JSON, and store it to SQL (through hibernate).
    Everything works fine as long as the traffic is normal. However
when there is a surge of incoming traffic, sometimes the JMS consumers
stop getting called, and the queue only grows. The issue does not get
fixed until TOMEE is restarted. And then we've seen the issue
re-appear again maybe 40 minutes later. After a while, the server
clears the queue and everything is fine again.
    We took a jstack thread dump of the application when it's in that
https://www.dropbox.com/s/p8wy7uz6inzsmlj/jstack.txt?dl=0
    What's interesting is that writes fall quite fast, and in steps,
https://www.dropbox.com/s/nhm5s2zc7r9mk9z/graph_writes.png?dl=0
    After a restart things are fine again immediately.
    We're not sure what is the cause. From what we can tell from the
thread dump, the consumers are idle, they just don't get notified that
work is available. The server is certainly aware there are items in
the queue, we monitor the queue through JMX and the queue size keeps
growing during these episodes. We don't see anything out of the
ordinary in the logs. We looked at thread IDs for consumers just
before the issue, it doesn't look like the consumers get some deadlock
one after the other for instance. It seems like a bunch of them are
called in the last minute before the dropoff for instance. Also,
during a blackout the JDBC pool usage is at 0 according to our JMX
monitoring, so it doesn't seem to be about a deadlocked JDBC connection.
    We did notice the following activemq warnings in the log file, but
the timestamps don't match with any particular events and from what we
found out, they don't seem to be particularly worrying or likely to be
WARNING [ActiveMQ Journal Checkpoint Worker]
org.apache.activemq.store.kahadb.MessageDatabase.getNextLocationForAckForward
Failed to load next journal location: null
WARNING [ActiveMQ NIO Worker 6]
org.apache.activemq.broker.TransportConnection.serviceTransportException
java.io.EOFException
    Do you have any suggestion to try to fix this issue (which we
sadly can't reproduce at will.. and it only happens pretty rarely)?
Should we rather ask on the activemq mailing list?
    Regards,
emmanuel

Romain Manni-Bucau

2018-10-24 06:39:22 UTC

Permalink

Hello Emmanuel

It can be a lot of things like a network breakdown behind a proxy (so AMQ
does not see it in some cases and a restart recreates the connection), some
backpressure (exponentional), some disk issue etc...

It can be interesting to check your config for healthchecks, batch sizes,
and dump the threads in the server and client when hanging. Also testing
with another backend than kahadb can be interesting depending your work
load.

Le mer. 24 oct. 2018 07:59, Emmanuel Touzery <

Post by Emmanuel Touzery
Hello,
noone has any suggestion?
Regards,
emmanuel

Post by Emmanuel Touzery
Hello,
we have a tomee+ 7.0.3 installation with activemq, using kahadb as
a persistent message storage. We have an activemq.xml, we plugged it
BrokerXmlConfig = xbean:file:/opt/app/tomee/conf/activemq.xml
ServerUrl = tcp://127.0.0.1:61616
<transportConnector name="nio"
uri="nio://0.0.0.0:61616?jms.prefetchPolicy.all=2000"/>
We use mKaha. We disabled flow control.
So that everything would work, we had to add a couple of JARs in
activemq-spring-5.14.3.jar
spring-beans-3.2.9.RELEASE.jar
spring-context-3.2.9.RELEASE.jar
spring-core-3.2.9.RELEASE.jar
spring-expression-3.2.9.RELEASE.jar
spring-web-3.2.9.RELEASE.jar
xbean-spring-3.9.jar
We are "reading" from JMS through message-driven beans,
The application is pretty simple... Receive the data from
HTTP/JSON, and store it to SQL (through hibernate).
Everything works fine as long as the traffic is normal. However
when there is a surge of incoming traffic, sometimes the JMS consumers
stop getting called, and the queue only grows. The issue does not get
fixed until TOMEE is restarted. And then we've seen the issue
re-appear again maybe 40 minutes later. After a while, the server
clears the queue and everything is fine again.
We took a jstack thread dump of the application when it's in that
https://www.dropbox.com/s/p8wy7uz6inzsmlj/jstack.txt?dl=0
What's interesting is that writes fall quite fast, and in steps,
https://www.dropbox.com/s/nhm5s2zc7r9mk9z/graph_writes.png?dl=0
After a restart things are fine again immediately.
We're not sure what is the cause. From what we can tell from the
thread dump, the consumers are idle, they just don't get notified that
work is available. The server is certainly aware there are items in
the queue, we monitor the queue through JMX and the queue size keeps
growing during these episodes. We don't see anything out of the
ordinary in the logs. We looked at thread IDs for consumers just
before the issue, it doesn't look like the consumers get some deadlock
one after the other for instance. It seems like a bunch of them are
called in the last minute before the dropoff for instance. Also,
during a blackout the JDBC pool usage is at 0 according to our JMX
monitoring, so it doesn't seem to be about a deadlocked JDBC connection.
We did notice the following activemq warnings in the log file, but
the timestamps don't match with any particular events and from what we
found out, they don't seem to be particularly worrying or likely to be
WARNING [ActiveMQ Journal Checkpoint Worker]

org.apache.activemq.store.kahadb.MessageDatabase.getNextLocationForAckForward

Post by Emmanuel Touzery
Failed to load next journal location: null
WARNING [ActiveMQ NIO Worker 6]
org.apache.activemq.broker.TransportConnection.serviceTransportException
java.io.EOFException
Do you have any suggestion to try to fix this issue (which we
sadly can't reproduce at will.. and it only happens pretty rarely)?
Should we rather ask on the activemq mailing list?
Regards,
emmanuel

Emmanuel Touzery

2018-10-24 07:23:35 UTC

Permalink

Hello,

    thank you for the answer!

    In this case TOMEE and AMQ are in the same process, on the same
machine, communicating through 127.0.0.1 so network between AMQ and
TOMEE shouldn't be an issue.
    In our case, writing to JMS keeps working, but consumers don't get
notified. I'm not sure if there are two separate communication channels
for that?

    I'm not sure what you mean by backpressure, but we did disable flow
control (which should only affect writes though, not notifying
consumers) -- were you referring to something like that?
    Also don't know about a disk issue -- the persistent queue keeps
filling up on disk, and I see no exceptions in the logs.

    When you talk about batch size, do you mean acknowledge
optimization? ("ActiveMQ can acknowledge receipt of messages back to the
broker in batches (to improve performance). The batch size is 65% of the
prefetch limit for the Consumer"). This sounds like it could be
related.. If acknowldge breaks down, AMQ would wait on consumers to
complete, while the consumers did complete and are waiting for new
messages. I already had the idea to check on the JMX "InFlightMessages"
info during such an incident to confirm whether AMQ thinks that
consumers are busy. But even if it turns out it does, that doesn't
really help me, short-term.

    In this case client=server (we get messages over HTTP, write them
in a queue on the activeMQ which runs in the same process as TOMEE, and
consume them from the same TOMEE instance) so the thread dump I did
covers both client & server.

    Another backend than kahadb could be interesting, but there's a lot
of traffic and validation of configuration changes for the server is
expensive. I'm not sure that's really workable, especially since the
chances of this fixing the issue are not that high.

    Regards,

Emmanuel

Post by exabrial12
Hello Emmanuel
It can be a lot of things like a network breakdown behind a proxy (so AMQ
does not see it in some cases and a restart recreates the connection), some
backpressure (exponentional), some disk issue etc...
It can be interesting to check your config for healthchecks, batch sizes,
and dump the threads in the server and client when hanging. Also testing
with another backend than kahadb can be interesting depending your work
load.
Le mer. 24 oct. 2018 07:59, Emmanuel Touzery <

Post by Emmanuel Touzery
Hello,
noone has any suggestion?
Regards,
emmanuel

Post by Emmanuel Touzery
Hello,
we have a tomee+ 7.0.3 installation with activemq, using kahadb as
a persistent message storage. We have an activemq.xml, we plugged it
BrokerXmlConfig = xbean:file:/opt/app/tomee/conf/activemq.xml
ServerUrl = tcp://127.0.0.1:61616
<transportConnector name="nio"
uri="nio://0.0.0.0:61616?jms.prefetchPolicy.all=2000"/>
We use mKaha. We disabled flow control.
So that everything would work, we had to add a couple of JARs in
activemq-spring-5.14.3.jar
spring-beans-3.2.9.RELEASE.jar
spring-context-3.2.9.RELEASE.jar
spring-core-3.2.9.RELEASE.jar
spring-expression-3.2.9.RELEASE.jar
spring-web-3.2.9.RELEASE.jar
xbean-spring-3.9.jar
We are "reading" from JMS through message-driven beans,
The application is pretty simple... Receive the data from
HTTP/JSON, and store it to SQL (through hibernate).
Everything works fine as long as the traffic is normal. However
when there is a surge of incoming traffic, sometimes the JMS consumers
stop getting called, and the queue only grows. The issue does not get
fixed until TOMEE is restarted. And then we've seen the issue
re-appear again maybe 40 minutes later. After a while, the server
clears the queue and everything is fine again.
We took a jstack thread dump of the application when it's in that
https://www.dropbox.com/s/p8wy7uz6inzsmlj/jstack.txt?dl=0
What's interesting is that writes fall quite fast, and in steps,
https://www.dropbox.com/s/nhm5s2zc7r9mk9z/graph_writes.png?dl=0
After a restart things are fine again immediately.
We're not sure what is the cause. From what we can tell from the
thread dump, the consumers are idle, they just don't get notified that
work is available. The server is certainly aware there are items in
the queue, we monitor the queue through JMX and the queue size keeps
growing during these episodes. We don't see anything out of the
ordinary in the logs. We looked at thread IDs for consumers just
before the issue, it doesn't look like the consumers get some deadlock
one after the other for instance. It seems like a bunch of them are
called in the last minute before the dropoff for instance. Also,
during a blackout the JDBC pool usage is at 0 according to our JMX
monitoring, so it doesn't seem to be about a deadlocked JDBC connection.
We did notice the following activemq warnings in the log file, but
the timestamps don't match with any particular events and from what we
found out, they don't seem to be particularly worrying or likely to be
WARNING [ActiveMQ Journal Checkpoint Worker]

org.apache.activemq.store.kahadb.MessageDatabase.getNextLocationForAckForward

Romain Manni-Bucau

2018-10-24 08:11:27 UTC

Permalink

Le mer. 24 oct. 2018 Ã 09:23, Emmanuel Touzery <

Post by Emmanuel Touzery
Hello,
thank you for the answer!
In this case TOMEE and AMQ are in the same process, on the same
machine, communicating through 127.0.0.1 so network between AMQ and
TOMEE shouldn't be an issue.

you are client of yourself? was more thinking about client <-> server

typically if you have

sender ------ some proxy ------- message driven bean

then you can have these "fake state" on network layer

Post by Emmanuel Touzery
In our case, writing to JMS keeps working, but consumers don't get
notified. I'm not sure if there are two separate communication channels
for that?

normally it is the same but depending the queue state it can block

did you check your DLQ too?

Post by Emmanuel Touzery
I'm not sure what you mean by backpressure, but we did disable flow
control (which should only affect writes though, not notifying
consumers) -- were you referring to something like that?

Yep

Post by Emmanuel Touzery
Also don't know about a disk issue -- the persistent queue keeps
filling up on disk, and I see no exceptions in the logs.

maybe next time give a quick peak to the disk state and if the partition is
full,
can also be good to activate AMQ debug logs in such cases (if you can)

Post by Emmanuel Touzery
When you talk about batch size, do you mean acknowledge
optimization? ("ActiveMQ can acknowledge receipt of messages back to the
broker in batches (to improve performance). The batch size is 65% of the
prefetch limit for the Consumer"). This sounds like it could be
related.. If acknowldge breaks down, AMQ would wait on consumers to
complete, while the consumers did complete and are waiting for new
messages. I already had the idea to check on the JMX "InFlightMessages"
info during such an incident to confirm whether AMQ thinks that
consumers are busy. But even if it turns out it does, that doesn't
really help me, short-term.

Yep, here you can test with a batch size of 1 (will be "slow" but each
message is considered alone)

Post by Emmanuel Touzery
In this case client=server (we get messages over HTTP, write them
in a queue on the activeMQ which runs in the same process as TOMEE, and
consume them from the same TOMEE instance) so the thread dump I did
covers both client & server.

It answers to the first question, so if you don't have any dynamic network
(like it can be with docker or so) it is likely not the issue.

Post by Emmanuel Touzery
Another backend than kahadb could be interesting, but there's a lot
of traffic and validation of configuration changes for the server is
expensive. I'm not sure that's really workable, especially since the
chances of this fixing the issue are not that high.

Well, first you need to have an environment reproducing that issue then you
can iterate to identify it:

(no particular order)

0. DLQ state (ensure it is monitored)
1. network (if relevant)
2. backend
3. (potentially) transport
4. AMQ version

...

Post by Emmanuel Touzery
Regards,
Emmanuel

Post by exabrial12
Hello Emmanuel
It can be a lot of things like a network breakdown behind a proxy (so AMQ
does not see it in some cases and a restart recreates the connection),

some

Post by exabrial12
backpressure (exponentional), some disk issue etc...
It can be interesting to check your config for healthchecks, batch sizes,
and dump the threads in the server and client when hanging. Also testing
with another backend than kahadb can be interesting depending your work
load.
Le mer. 24 oct. 2018 07:59, Emmanuel Touzery <

Post by Emmanuel Touzery
Hello,
noone has any suggestion?
Regards,
emmanuel

Post by Emmanuel Touzery
Hello,
we have a tomee+ 7.0.3 installation with activemq, using kahadb as
a persistent message storage. We have an activemq.xml, we plugged it
BrokerXmlConfig = xbean:file:/opt/app/tomee/conf/activemq.xml
ServerUrl = tcp://127.0.0.1:61616
<transportConnector name="nio"
uri="nio://0.0.0.0:61616?jms.prefetchPolicy.all=2000"/>
We use mKaha. We disabled flow control.
So that everything would work, we had to add a couple of JARs in
activemq-spring-5.14.3.jar
spring-beans-3.2.9.RELEASE.jar
spring-context-3.2.9.RELEASE.jar
spring-core-3.2.9.RELEASE.jar
spring-expression-3.2.9.RELEASE.jar
spring-web-3.2.9.RELEASE.jar
xbean-spring-3.9.jar
We are "reading" from JMS through message-driven beans,
The application is pretty simple... Receive the data from
HTTP/JSON, and store it to SQL (through hibernate).
Everything works fine as long as the traffic is normal. However
when there is a surge of incoming traffic, sometimes the JMS consumers
stop getting called, and the queue only grows. The issue does not get
fixed until TOMEE is restarted. And then we've seen the issue
re-appear again maybe 40 minutes later. After a while, the server
clears the queue and everything is fine again.
We took a jstack thread dump of the application when it's in that
https://www.dropbox.com/s/p8wy7uz6inzsmlj/jstack.txt?dl=0
What's interesting is that writes fall quite fast, and in steps,
https://www.dropbox.com/s/nhm5s2zc7r9mk9z/graph_writes.png?dl=0
After a restart things are fine again immediately.
We're not sure what is the cause. From what we can tell from the
thread dump, the consumers are idle, they just don't get notified that
work is available. The server is certainly aware there are items in
the queue, we monitor the queue through JMX and the queue size keeps
growing during these episodes. We don't see anything out of the
ordinary in the logs. We looked at thread IDs for consumers just
before the issue, it doesn't look like the consumers get some deadlock
one after the other for instance. It seems like a bunch of them are
called in the last minute before the dropoff for instance. Also,
during a blackout the JDBC pool usage is at 0 according to our JMX
monitoring, so it doesn't seem to be about a deadlocked JDBC

connection.

Post by exabrial12

Post by Emmanuel Touzery

Post by Emmanuel Touzery
We did notice the following activemq warnings in the log file, but
the timestamps don't match with any particular events and from what we
found out, they don't seem to be particularly worrying or likely to be
WARNING [ActiveMQ Journal Checkpoint Worker]

org.apache.activemq.store.kahadb.MessageDatabase.getNextLocationForAckForward

Post by exabrial12

Post by Emmanuel Touzery

Post by Emmanuel Touzery
Failed to load next journal location: null
WARNING [ActiveMQ NIO Worker 6]

org.apache.activemq.broker.TransportConnection.serviceTransportException

Post by exabrial12

Post by Emmanuel Touzery

Post by Emmanuel Touzery
java.io.EOFException
Do you have any suggestion to try to fix this issue (which we
sadly can't reproduce at will.. and it only happens pretty rarely)?
Should we rather ask on the activemq mailing list?
Regards,
emmanuel