The cart goes nowhere because the swan wants to fly in the air, the pike wants to swim underwater and the crawfish wants to crawl backward.
Cassandra is one of powerhorse of modern high load landscape – nosql database that can be pluged to Spark for distributed computing, Titan for playing with graph data representation and even to Elastic as a search backend.
And if you really care about pure write performance – this is de-facto choice in world of open source solutions: production proof, with big community that already generated many outdated controversial answers at SO & mailing lists.
No single point of failure, scalability, high availability, retention periods, … , but those marketing claims hide few principal caveats… Actually, cassandra has only single drawback(*) – it will not reach its limits with default settings on your hardware. Lonely, single node configuration – it is not use case for cassandra, it will shine in multinoded clustered setup.
If you really want to see full utilization of endless cores and crazy amount of RAM, you have to use some virtualisation technology to manage hardware resource.
Let me start first with some conclusions and recommendations, based on extensive two monthes testing and observartion of trouble tickets after migration of this approach to production. With those considerations in mind I was managed to configure it such way to tolerate 750 k mixed operations per seconds. It was generated for more than 8 hours to check pressure tolerance and emulate peak loads. .It was mixed execution of async inserts, without future processing and synced inserts as well as read requests.
Frankly speaking, I am sure it is still far from its limit.
Bear in mind that I am talking about Cassandra 2.1.*.
About disks
- Use ssd disks as mapped volume to docker container. Single container = single dedicated disk.
It is possible to use multiple disks per containers, but it will lead to 5-15 % of slowdown. - If you use ssd disk you can map all casandra directories to it (saved_cache, data, commit_logs) and adjust casandra.yaml with higher values of throughput, in particularly: compaction_throughput_mb_per_sec, trickle_fsync
- It is really depends on data distribution and your data model, but be ready that disk utilization will vary from one node to another up to 20%
- Docker should be configured to NOT use host’s root partitions. Don’t be mean and allocate single drive for logs and choose proper storage driver – docker-lvm.
- In practice, cluster start strugling when any of nodes come out of space. Surprisingly, in my experiments it was stable even with 3% free, but in real life better to configure your monitoring system to give alert at 15-20%.
- Choose compaction and compresion strategies wisely when you design your db
- Be careful with column naming – it wil be added for every god damn row!
- Do sizing when you think about number of nodes (and disks).
About cpus:
- More cpu per node is NOT always good. Stick with 8 cores per node.
I’ve experimenting with single fat supernode per physical server = 48 cores, 4×12, 6×8.
6 node with 8 cpu cores outperform all others in 6 kind of stress load scenarious. - If you play with core number you have to adjust few settings at cassandra.yaml to reflect that number: concurent_compactors, concurent_reads, concurent_writes.
- Cassandra in most cases endup to be cpu-bound, don’t forget to left for host system 8-16 cores, and allocate cpu exclusivly for containers using –cpuset-cpus
About RAM:
- cassandra-env.sh have builtin calculation of free memory to adjust jvm settings using analysing results of command free. Ofcourse it is not for docker based setup. Bear this in mind and tweak your startup scripts to substitue values there.
- Disable swap within docker using –memory-swappiness=0
- Effectiveness of memory usage depend on cpu amount, how effective multithreaded compaction is implemented at Cassandra and what settings for reader\writer\compactors you have at your cassandra.yaml, i.e. you can have hundreds of RAM but endup in OOM. But even with 8 Gb of RAM per node you already can see benefits. More RAM – mean more memtables, bigger key cache, and more effective OS-based file caching. I would recommend have 24 Gb RAM per node.
- Disable huge page at host system or at least tune your jvm settings:
echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag
About network
- Mandatory to use network stack of host Os using flag at docker –net=host
- Most likely network should not be bottleneck for your load, so you can stick with virtual interfaces on top of single real one.
Testing:
- 3 physical server: each have 72 cores, 400gb ram
- Cassandra 2.1.15
- Docker: 1.10
- Host Os: Centos 7.5
- Guest Os: Centos 7.5
- java 8 from oracle with jna
Cassandra 3.* this is competely another story – in my opinion, mainly, because of storage engine changing, but here is a huge list.
DB overview:
- Dozen keyspaces, each have up to 20(?) tables.
- Few indexes – just do not use indexes, design schema properly
- Data replication = 3, gossip through file
- Each physical server represent dedicated rack within single datacenter.
- Row cache were disabled at cassandra.yaml i.e. first priority was to focur on write oriented workload
Tools:
- datastax stresstool, artificial table – very intresting, but useless, using your schema is very important
- Datastax stresstool + your own table definition – nice, give hints of production performance. But you still testing single table – usually it is not the case in real life.
- Self written in-house stress tool that generate data according to our data model in randomized fasion + set of dedicated servers for ddos with ability to switch between async inserts (just do not use batches) with and without acknowledgment.
Once again: no batch inserts as they should not be used in productions. - Probably, you can adapt Yahoo! Cloud Serving Benchmark. I haven’t played with it.
That’s it folks, all craps below is my working notes and bookmarks.
How to get c++11 at Centos7 for stress tool compilation:
Install recent version of compiler on centos7: devtoolset-4
update gcc version 4.8 at centos 6: https://gist.github.com/stephenturner/e3bc5cfacc2dc67eca8b
scl enable devtoolset-2 bash
RAM & swap:
How to clean buffer os cache
echo 3 > /proc/sys/vm/drop_caches check system wide swappiness settings: more /proc/sys/vm/swappiness
Docker related:
If you are not brave enough to play with DC\OS or Openstack you can find docker-compose to be usefull for manipulation of homogeneous set of containers
Installation:
sudo rpm -iUvh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm rpm -qa | grep docker
If you fucked up with partition settings:
wipefs -a /dev/sda1
Docker and mapped volumes: http://container-solutions.com/understanding-volumes-docker/
If docker info | grep loopback show you something – you already screw up configuration of storage driver.
How to check journal what is happening:
journalctl -u docker.service --since "2017-01-01 00:00:00"
Full flags described here.
Usefull commands for check docker images:
docker inspect lvdisplay lsblk dmsetup info /dev/dm-13
Cassandra:
How to check heap memory consumption per node:
nodetool cfstats | grep 'off heap memory used' | awk 'NR > 3 {sum += $NF } END { print sum }'
How to check what neighbors we can see from nodes:
nodetool -h ring
How to find processes that use swap:
for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | awk '$2 {print $1 FS $2}'
Check how many disk memory we used, from Cassandra perspective
nodetool cfstats | grep 'Space used (total)' | awk '{s+=$NF} END{print s}'
Determine disk usage, OS point of view:
du -ch /var/lib/cassandra/data/
cassandra check health:
ssh nodetool status
Network:
How to open port at centos 7:
firewall-cmd --zone=public --add-port 9160/tcp --permanent firewall-cmd --zone=public --add-port 9042/tcp --permanent firewall-cmd --zone=public --add-port 7200/tcp --permanent firewall-cmd --zone=public --add-port 7000/tcp --permanent
Open ports for spark, master:
firewall-cmd --zone=public --add-port 7077/tcp --permanent firewall-cmd --zone=public --add-port 8081/tcp --permanent
Apply changes in ip tables:
firewall-cmd --reload
or do this, usefull in case network manager behave badly:
systemctl restart network.service systemctl status network.service
And as bonus points –
How to migrate data from old cluster to bright new one
- sstableloader
- cassandra snapshots
- For tiny dataset to get cql file with inserts: – cassandradump
First two approaches represent standard way of data migration.
Limitation of first is speed and necessity to stop old node.
Limitation of second is necessity to manualy deal with token_ring on per node basis.
If life was really cruel to you, you can play with data folders per node.
NOTE: if you can replicate exact same setup – in terms of assigned ip, it will be enough to just copy cassandra.yaml from old nodes to new one, and use exact same mapping folder within docker as it were at old cluster.
If not – you still can do it with copying data folder follow steps below, but better just use sstableloader.
- In order to do it you have to run following command on every node to drain node from cluster and flush all data into filesystem:
nodetool drain</pre> <pre>
NOTE: this is unofficial, not recommended way to deal with data migration.
NOTE 1: it require you to have similar amount of nodes in both clusters
NOTE 2: no need for the same datacenter\rack cfg\ip address
2. Deploy docker based setup according to HW configuration. Total amount of nodes should be equal to total amount of nodes at old cluster. On new cluster deploy exact schema that were deployed on old cluster.
3. Stop new cluster.
within every node data folder of OLD cluster you would have following folders:
system
system_traces
…
NOTE: do not touch system* tables.
4. Under folder /your/cassandra/data-folder/your-keyspace
you should have set of folders corresponding to that keyspace under which data is stored.
5. You have to copy content of this folder (*.db, *.sha1, *.txt) for every node from OLD cluster to corresponding folder of NEW node cluster in. UUID WILL be different.
I.e. old cluster, node 1 to new cluster, node 2:
data copy example
scp /old/cluster/cassandra-folder/data/your-keyspace/your-table-e31522b0e2d511e6967a67ec03b4d2b5/*.* user@:ip/new/cluster/cassandra-folder/data/your-keyspace/your-table-c56f4dd0e61011e6af481f6740589611/
6. Migrated node of OLD cluster must be stopped OR you have to use `nodetool drain` for processed node to have all data within sstables ~ data folders.
Performance monitoring:
- general system overview: atop or htop
- Be sure that you understand memory reporting.
- JMX based monitoring: jconsole
- Jconsole connection string: service:jmx:rmi:///jndi/rmi://:/jmxrmi
- dstat network & disk io
- strace – show every system call. slow down. can connect to running.
- netstat -tunapl | lsof -i -P – network\ports per process
- docker stats – reports cpu\mem\io for container
- perf + perf-map-agent for java monitoring:
for example cache miss, more there:
perf stat -e L1-dcache-load-misses
Articles that I find usefull:
- Brief explanation about docker: https://www.plesk.com/blog/docker-containers-explained/
- Cassandra internals, great and brief: https://uberdev.wordpress.com/2015/11/29/cassandra-developer-certification-study-notes-write-path/
- Cassandra, great intro: https://www.infoq.com/articles/cassandra-mythology
- Cassandra read performance: http://shareitexploreit.blogspot.ae/2012/09/cassandra-read-performance.html
- Nodetool, man: https://wiki.apache.org/cassandra/NodeTool
- Visualization of cassandra stress tool: http://thelastpickle.com/blog/2015/10/23/cassandra-stress-and-graphs.html
- Examples of configuration file for stress tool for custom schemas: https://gist.github.com/tjake/fb166a659e8fe4c8d4a3
- Cassandra performance tuning: http://www.tanzirmusabbir.com/2013/06/cassandra-performance-tuning.html
- Cassandra performance tuning: http://jonathanhui.com/cassandra-performance-tuning-and-monitoring
- Famous post from netflix’s experiments with cassandra: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
- Cassandra performance tuning: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
- Famous million writes benchmarks: http://www.datastax.com/1-million-writes
- Cassandra performance tuning, comparison what is affect performance more and what is less: https://www.javacodegeeks.com/2015/02/apache-cassandra-low-latency-applications.html
- Sstable explained: http://wiki.apache.org/cassandra/ArchitectureSSTable
- How data is read from cassandra: https://docs.datastax.com/en/cassandra_win/3.0/cassandra/dml/dmlAboutReads.html
- Screen reference: http://aperiodic.net/screen/quick_reference
- which centos am I using: https://linuxconfig.org/how-to-check-centos-version
- docker misconfiguration – http://techblog.newsweaver.com/cleaning-up-device-mappings-after-docker-daemon-death/
- change location of docker var lib folder: https://linuxconfig.org/how-to-move-docker-s-default-var-lib-docker-to-another-directory-on-ubuntu-debian-linux
*cassandra has only single drawback – it have no idea of your data model, whether you configure your data schema correctly, what is your load patterns. That why you have to dive in wonderland of controversial recommendations in blogposts like that one instead of thoroughly read documentations first.
P. S. do not believe anyone, measure!