«

Feb 17

[docker][cassandra] Reaching mixed load – 750,000 op\sec

The cart goes nowhere because the swan wants to fly in the air, the pike wants to swim underwater and the crawfish wants to crawl backward.

Cassandra performance tuning - challengeCassandra is one of powerhorse of modern high load landscape – nosql database that can be pluged to Spark for distributed computing, Titan for playing with graph data representation and even to Elastic as a search backend.
And if you really care about pure write performance – this is de-facto choice in world of open source solutions: production proof, with big community that already generated many outdated controversial answers at SO & mailing lists.

 

No single point of failure, scalability, high availability, retention periods, … , but those marketing claims hide few principal caveats… Actually, cassandra has only single drawback(*) – it will not reach its limits with default settings on your hardware. Lonely, single node configuration – it is not use case for cassandra, it will shine in multinoded clustered setup.

If you really want to see full utilization of endless cores and crazy amount of RAM, you have to use some virtualisation technology to manage hardware resource.

Let me start first with some conclusions and recommendations, based on extensive two monthes testing and observartion of trouble tickets after migration of this approach to production. With those considerations in mind I was managed to configure it such way to tolerate 750 k mixed operations per seconds. It was generated for more than 8 hours to check pressure tolerance and emulate peak loads. .It was mixed execution of async inserts, without future processing and synced inserts as well as read requests.

Frankly speaking, I am sure it is still far from its limit.

Bear in mind that I am talking about Cassandra 2.1.*.

About disks

  1. Use ssd disks as mapped volume to docker container. Single container = single dedicated disk.
    It is possible to use multiple disks per containers, but it will lead to 5-15 % of slowdown.
  2. If you use ssd disk you can map all casandra directories to it (saved_cache, data, commit_logs) and adjust casandra.yaml with higher values of throughput, in particularly: compaction_throughput_mb_per_sec, trickle_fsync
  3. It is really depends on data distribution and your data model, but be ready that disk utilization will vary from one node to another up to 20%
  4. Docker should be configured to NOT use host’s root partitions. Don’t be mean and allocate single drive for logs and choose proper storage driver – docker-lvm.
  5. In practice, cluster start strugling when any of nodes come out of space. Surprisingly, in my experiments it was stable even with 3% free, but in real life better to configure your monitoring system to give alert at 15-20%.
  6. Choose compaction and compresion strategies wisely when you design your db
  7. Be careful with column naming – it wil be added for every god damn row!
  8. Do sizing when you think about number of nodes (and disks).

About cpus:

  1. More cpu per node is NOT always good. Stick with 8 cores per node.
    I’ve experimenting with single fat supernode per physical server = 48 cores, 4×12, 6×8.
    6 node with 8 cpu cores outperform all others in 6 kind of stress load scenarious.
  2. If you play with core number you have to adjust few settings at cassandra.yaml to reflect that number: concurent_compactors, concurent_reads, concurent_writes.
  3. Cassandra in most cases endup to be cpu-bound, don’t forget to left for host system 8-16 cores, and allocate cpu exclusivly for containers using –cpuset-cpus

About RAM:

  1. cassandra-env.sh have builtin calculation of free memory to adjust jvm settings using analysing results of command free. Ofcourse it is not for docker based setup. Bear this in mind and tweak your startup scripts to substitue values there.
  2. Disable swap within docker using –memory-swappiness=0
  3. Effectiveness of memory usage depend on cpu amount, how effective multithreaded compaction is implemented at Cassandra and what settings for reader\writer\compactors you have at your cassandra.yaml, i.e. you can have hundreds of RAM but endup in OOM. But even with 8 Gb of RAM per node you already can see benefits. More RAM – mean more memtables, bigger key cache, and more effective OS-based file caching. I would recommend have 24 Gb RAM per node.
  4. Disable huge page at host system or at least tune your jvm settings:
 echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag 

About network

  1. Mandatory to use network stack of host Os using flag at docker –net=host
  2. Most likely network should not be bottleneck for your load, so you can stick with virtual interfaces on top of single real one.

Testing:

  • 3 physical server: each have 72 cores, 400gb ram
  • Cassandra 2.1.15
  • Docker: 1.10
  • Host Os: Centos 7.5
  • Guest Os: Centos 7.5
  • java 8 from oracle with jna

Cassandra 3.* this is competely another story – in my opinion, mainly, because of storage engine changing, but here is a huge list.

DB overview:

  • Dozen keyspaces, each have up to 20(?) tables.
  • Few indexes – just do not use indexes, design schema properly
  • Data replication = 3, gossip through file
  • Each physical server represent dedicated rack within single datacenter.
  • Row cache were disabled at cassandra.yaml i.e. first priority was to focur on write oriented workload

Tools:

  1. datastax stresstool, artificial table – very intresting, but useless, using your schema is very important
  2. Datastax stresstool + your own table definition – nice, give hints of production performance. But you still testing single table – usually it is not the case in real life.
  3. Self written in-house stress tool that generate data according to our data model in randomized fasion + set of dedicated servers for ddos with ability to switch between async inserts (just do not use batches) with and without acknowledgment.
    Once again: no batch inserts as they should not be used in productions.
  4. Probably, you can adapt Yahoo! Cloud Serving Benchmark. I haven’t played with it.

 

That’s it folks, all craps below is my working notes and bookmarks.

How to get c++11 at Centos7 for stress tool compilation:

Install recent version of compiler on centos7: devtoolset-4
update gcc version 4.8 at centos 6: https://gist.github.com/stephenturner/e3bc5cfacc2dc67eca8b

scl enable devtoolset-2 bash

RAM & swap:

How to clean buffer os cache

echo 3 > /proc/sys/vm/drop_caches
check system wide swappiness settings:
more /proc/sys/vm/swappiness

Docker related:

If you are not brave enough to play with DC\OS or Openstack you can find docker-compose to be usefull for manipulation of homogeneous set of containers

Installation:

sudo rpm -iUvh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
rpm -qa | grep docker

If you fucked up with partition settings:

wipefs -a /dev/sda1

Docker and mapped volumes: http://container-solutions.com/understanding-volumes-docker/

If docker info | grep loopback show you something – you already screw up configuration of storage driver.

How to check journal what is happening:

journalctl -u docker.service --since "2017-01-01 00:00:00"

Full flags described here.

Usefull commands for check docker images:

docker inspect
lvdisplay
lsblk
dmsetup info /dev/dm-13

Cassandra:

How to check heap memory consumption per node:

nodetool cfstats | grep 'off heap memory used' | awk 'NR > 3 {sum += $NF } END { print sum }'

How to check what neighbors we can see from nodes:

nodetool -h ring

How to find processes that use swap:

for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | awk '$2 {print $1 FS $2}'

Check how many disk memory we used, from Cassandra perspective

nodetool cfstats | grep 'Space used (total)' | awk '{s+=$NF} END{print s}'

Determine disk usage, OS point of view:

du -ch /var/lib/cassandra/data/

cassandra check health:

 ssh nodetool status

Network:

How to open port at centos 7:

firewall-cmd --zone=public --add-port 9160/tcp --permanent
firewall-cmd --zone=public --add-port 9042/tcp --permanent
firewall-cmd --zone=public --add-port 7200/tcp --permanent
firewall-cmd --zone=public --add-port 7000/tcp --permanent

Open ports for spark, master:

firewall-cmd --zone=public --add-port 7077/tcp --permanent
firewall-cmd --zone=public --add-port 8081/tcp --permanent

Apply changes in ip tables:

firewall-cmd --reload

or do this, usefull in case network manager behave badly:

systemctl restart network.service
systemctl status network.service

And as bonus points –

How to migrate data from old cluster to bright new one

  1. sstableloader
  2. cassandra snapshots
  3. For tiny dataset to get cql file with inserts: – cassandradump

First two approaches represent standard way of data migration.
Limitation of first is speed and necessity to stop old node.
Limitation of second is necessity to manualy deal with token_ring on per node basis.

If life was really cruel to you, you can play with data folders per node.

NOTE: if you can replicate exact same setup – in terms of assigned ip, it will be enough to just copy cassandra.yaml from old nodes to new one, and use exact same mapping folder within docker as it were at old cluster.

If not – you still can do it with copying data folder follow steps below, but better just use sstableloader.

  1. In order to do it you have to run following command on every node to drain node from cluster and flush all data into filesystem:
nodetool drain</pre>
<pre>

NOTE: this is unofficial, not recommended way to deal with data migration.
NOTE 1: it require you to have similar amount of nodes in both clusters
NOTE 2: no need for the same datacenter\rack cfg\ip address

2. Deploy docker based setup according to HW configuration. Total amount of nodes should be equal to total amount of nodes at old cluster. On new cluster deploy exact schema that were deployed on old cluster.

3. Stop new cluster.
within every node data folder of OLD cluster you would have following folders:
system
system_traces

NOTE: do not touch system* tables.

4. Under folder /your/cassandra/data-folder/your-keyspace
you should have set of folders corresponding to that keyspace under which data is stored.

5. You have to copy content of this folder (*.db, *.sha1, *.txt) for every node from OLD cluster to corresponding folder of NEW node cluster in. UUID WILL be different.
I.e. old cluster, node 1 to new cluster, node 2:
data copy example
scp /old/cluster/cassandra-folder/data/your-keyspace/your-table-e31522b0e2d511e6967a67ec03b4d2b5/*.* user@:ip/new/cluster/cassandra-folder/data/your-keyspace/your-table-c56f4dd0e61011e6af481f6740589611/

6. Migrated node of OLD cluster must be stopped OR you have to use `nodetool drain` for processed node to have all data within sstables ~ data folders.

Performance monitoring:

  • general system overview: atop or htop
  • Be sure that you understand memory reporting.
  • JMX based monitoring: jconsole
  • Jconsole connection string: service:jmx:rmi:///jndi/rmi://:/jmxrmi
  • dstat network & disk io
  • strace – show every system call. slow down. can connect to running.
  • netstat -tunapl | lsof -i -P – network\ports per process
  • docker stats – reports cpu\mem\io for container
  • perf + perf-map-agent for java monitoring:
    for example cache miss, more there:
perf stat -e L1-dcache-load-misses

Articles that I find usefull:

 

*cassandra has only single drawback – it have no idea of your data model, whether you configure your data schema correctly, what is your load patterns. That why you have to dive in wonderland of controversial recommendations in blogposts like that one instead of thoroughly read documentations first.

 

P. S. do not believe anyone, measure!

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>