How will relay behave if the storage server(s) become unavailable?

Asked by Jason Dixon

We're looking at scaling up our Graphite installation to use multiple storage servers with a cache-relay in front. However, each end is on a different providers network. How will relay behave if the connection falls over between each site? Will it slowly run out of memory as metrics or cached, or will it periodically store data to disk and resume the stream when the connection is re-established?

Thanks,
Jason

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Nicholas Leskiw (nleskiw) said :
#1

AFAIK, it buffers in memory only.

-Nick

Sent from a mobile device.
Please excuse terse language and spelling mistakes.

Jason Dixon <email address hidden> wrote:

>New question #178316 on Graphite:
>https://answers.launchpad.net/graphite/+question/178316
>
>We're looking at scaling up our Graphite installation to use multiple storage servers with a cache-relay in front. However, each end is on a different providers network. How will relay behave if the connection falls over between each site? Will it slowly (I hope) run out of memory as metrics or cached, or will it periodically store data to disk and resume the stream when the connection is re-established?
>
>Thanks,
>Jason
>
>--
>You received this question notification because you are a member of
>graphite-dev, which is an answer contact for Graphite.
>
>_______________________________________________
>Mailing list: https://launchpad.net/~graphite-dev
>Post to : <email address hidden>
>Unsubscribe : https://launchpad.net/~graphite-dev
>More help : https://help.launchpad.net/ListHelp

Revision history for this message
chrismd (chrismd) said :
#2

The relay only queues datapoints in memory, up to MAX_QUEUE_SIZE datapoints per destination. Once a send queue (to any destination) fills up the relay will stop accepting more datapoints from all clients (because there is no coupling between clients and destinations its impossible to selectively pause them). This is the default behavior if USE_FLOW_CONTROL is True, if its False the relay will simply drop datapoints if there is no room to put them in the send queue.

There are a lot of situations where this behavior sucks. A better alternative in many situations is to use carbon-client.py. You give it a list of destinations to send to (typically carbon-caches) and then write metrics in the usual plaintext format its stdin. It can use relay-rules.conf or consistent hashing just like the relay (they share lots of code). One difference is that carbon-client.py always behaves like USE_FLOW_CONTROL=True, it will block reading stdin and wait until data can be sent. The advantage though is that now you're blocking per-client and you can control the client behavior (choose to timeout and kill the carbon-client, do something else with the data like write it to a file that carbon-client can later read, etc) but you have to implement that behavior in the calling program.

Revision history for this message
Abe Stanway (abestanway) said :
#3

From the code, it looks like there is a send-queue per destination. Is this accurate? It would seem that once a particular send queue is full, it would stop sending data to *that* destination, but continue sending to all other destinations (assuming flow control is off).

Is this accurate? Or is there a single send queue for all destinations?

Revision history for this message
bhardy (bhardy) said :
#4

I know this is a dated post but below are some facts I have come across regarding carbon-relay setup in its default manner. I hope this saves times and helps you plan your setup.

--The MAX_QUEUE_SIZE setting in carbon.conf for relay is per destination.
--If one of the destinations becomes unresponsive(network down,full cache, etc) AND the queue fills up for that single destination, carbon-relay will stop sending to ALL destinations.
--The greater the volume going thru a relay with complex regExs in relay-rules.conf, the higher the CPU will rise until it reaches 100% and causes issues.

Revision history for this message
Abe Stanway (abestanway) said :
#5

@bhardy: After investigating, it only stops sending to all destinations if USE_FLOW_CONTROL is on, as per chrismd. Otherwise, it sends to all other destinations normally, and simply drops incoming datapoint for the disconnected destination only.

We experienced issues in production using large MAX_QUEUE_VALUES and dealing with disconnecting/reconnecting listeners. Specifically, when a listener goes down, the queue fills up as it should, and relay stays alive. However, when the listener reconnects, relay dies as it tries to flush the queue all at once, because it blocks the IOLoop. Our solution was to use a very small MAX_QUEUE_SIZE (5000).

There is a patch on the way from @mleinart to fix this behavior.

Can you help with this problem?

Provide an answer of your own, or ask Jason Dixon for more information if necessary.

To post a message you must log in.