架构
分为三层:
- Client Layer
- Distribution Layer
- Cluster Management Module
- Data Migration Module
- Transaction Processing Module
- Sync/Async Replication
- Proxy
- Duplicate Resolution
- Data Storage Layer
- enhanced key-value store with a schemaless data model
- namespaces => 类似于databases,
- sets => tables
- records => rows
- bins => columns
Data Model
Namespaces
Namespaces
- 类似于database,但是并不一定是一对一的关系。
- The way you collect data in namespaces relates to how the data is stored and managed.
- 包含如下内容:
- records
- indexes
- policies(Configuration)
- How data is stored: on DRAM or disk.
- How many replicas exist for a record.
- When records expire
Sets
类似于RDBMS的table。
记录可以不归属于某个Set,当时一定会归属于某个namespace。Set specification is optional. Some records in the namespace may not be within a set.
Sets inherit the policies defined by their namespace, and can define additional policies or operations specific to the set. For example, secondary indexes can be specified only on data for a particular set, or a scan operation can be done on a specific set.
Records
- Records can belong to a namespace or to a set within the namespace.
- 记录包含如下三个部分:
- key
- Unique identifier. Records are addressable using a hash of its key, called the digest.
- key的类型:Integers, String, and bytes
- 内部会hash成一个 160-bit(20-bytes) 的digest
- metadata
- generation
- time-to-live (TTL)
- last-update-time (LUT)
- bins
- Bins are equivalent to fields in RDBMS
- Data type is defined by the value contained in the bin:
- Basic
- integer: 8 bytes
- string: 128 KB;
- bytes
- double: 8 bytes
- CDTs (Complex Data Types)
- list
- map
- GeoJSON (3.7.0+)
- native-languages serialized(blobs)
- Basic
- key
NOTES & TIPS
- namespaces需要提前定义,但是sets和bins是可以动态创建的。
- 不像RDBMS,不同记录的同一个bin里面可以有不同类型的value。比如有个price的bin,里面可以同时存储“20000.00”,也可以存储“2万”。但是最好不要这样子,会影响索引。
- 为了性能,索引(primary Keys和Secondary keys)是存储在内存中的。数据(values)可以存储在内存,也可以存储在SSD。
- 不管是存储在哪里,Aerospike 通过 Smart Defragmenter 和 Intelligent Evictor 这两个机制保证数据(values)不会丢失。
- 因为是内存索引,系统启动的时候需要根据数据重新构建,所以启动会比较耗时。
- 不同的namespaces的配置是独立和隔离的。
- Aerospike虽然不是使用LSM Tree,但是为了解决随机写的问题,同样引入了日志结构文件系统(Aerospike log structured file system)。
- Maps and lists 支持任意层次的嵌套,但是索引只能指定第一层。
- list和map底层存储是以 MessagePack 方式序列化的
- There is a limit of 32K unique bin names in use within a namespace
- the record size cannot exceed the write-block-size (usually 128KB for SSDs and 1MB for rotational) => 之前版本支持LDTs(Large Data Types),不过后面的版本移除了。
查询和索引
Query
Aerospike通过secondary index,可以支持下面三种条件查询:
- Equal query against string or numeric indexes
- Range query against numeric indexes
- Point-In-Region or Region-Contain-Point query against geo indexes
TIPS
查询到的数据可以通过 Aaerospike Predicate Filtering (3.12+) 或者 Aerospike UDFs (user-defined functions) 进行post-processed。
Primary Index
In Aerospike, the primary key index is a blend of distributed hash table technology with a distributed tree structure in each server.
The entire keyspace in the namespace (database) is partitioned using a robust hash function into partitions.
A total of 4096 partitions are equally distributed across cluster nodes. See data-distribution for details on hashing and partitioning.
Primary Index是纯内存索引,也没有定时持久化会磁盘。系统重启时候会扫描数据重建索引。企业版通过linux的共享内存区域(Linux shared memory segment)支持快速启动(Fast Restart Feature)。
Secondary Index
- are stored in RAM for fast look-up.
- are built on every node in cluster and co-located with the primary index. Each secondary index entry only contains references to records local to the node.
- contain pointers to both master records and replicated records in a cluster.
除了主键之外的字段(bin)都是二级索引,二级索引的key的数据类型只能是如下三种:
- Integer
- String
- Geospatial
但是key所在的字段(bin)数据类型可以是:
- Basic
- List
- MapKeys
- MapValues
Limitations
- supports up to 256 secondary indexes per namespace => :(
- There is a limit of 32K unique bin names in use within a namespace => 其实还好
- Fast restart is not supported. On daemon restart, secondary indexes are rebuilt based on record data => :(
- Aerospike is tuned for queries using high selectivity secondary indexes
- For string data-type, only string size <= 2k can be indexed => 精确匹配问题不大
- RANGE result sets are inclusive (that is, both specified values are included in the results) => 这个应用层过滤一下
- If no set-name is specified during index creation, then the index will only include records without a set name, but not all sets in the namespace.
Distribution(分布式)
Features
- Automatic data location detection
- Automatic cluster balancing
- No single point of failure
机制
- Data Distribution: Robust partitioning ensures uniform data distribution, which avoids hot spots and automatically balances data without manual intervention.
- Clustering: The Aerospike clustered database automatically detects failures and heals.
- Replication: This Aerospike feature includes the following replication abilities to avoid a single point of failure:
- Intra Cluster Replication
- Rack Aware Replication
- Cross-Datacenter Replication
Transaction(事务)
- Single row ACID
UDF(User-Defined Functions)
- code written by a user that runs inside the Aerospike database server
- currently only supports Lua as the UDF language
- Record UDFs: execute on a single database record. They can create, update, or delete a record.
- Stream UDFs: perform read-only operations on a collection of records.
Client(API & SDK)
Aerospike的服务端分布式架构是完全对等的 Shared-Nothing 架构,没有master,也没有metadata server。这就把一些功能下推到客户端了。
client,也就是drivers,主要处理这些事情:
- cluster-status sensing
- efficient transaction routing
- network connection pooling
- failover protection
API
- put()
- get()
- delete()
- CAS (safe read-modify-write) operations.
- In-database counters.
- Batch get() operations(不支持Batch write())
- Scan operations.
- List and Map element operations:
- List
- append(), insert(), insert_items()
- get(), get_range(), get_range_from()
- set()
- pop(), pop_range()
- remove(), remove_range()
- trim()
- clear()
- size()
- Map
- set_type()
- add(), add_items(), increment(), decrement(), clear()
- remove_by_key(), remove_by_index(), remove_by_rank()
- remove_by_key_interval(), remove_by_index_range()
- remove_by_value_interval(), remove_by_rank_range(), remove_all_by_value()
- size()
- get_by_key(), get_by_index(), get_by_rank()
- get_by_key_interval(), get_by_index_range()
- get_by_value_interval(), get_by_rank_range(), get_all_by_value()
- List
- Queries: Bin values are indexed and the database searched by equality or range.
- UDFs extend database processing by executing application code in Aerospike.
- Aggregation: Use UDFs on a collection of records to return aggregate values.
各个语言封装的API使用方式略有不同,具体参见文档 DEVELOPMENT-Client Libraries,如 Java Client。
Prons & Cons
Prons
- Fast
- 分布式
- 事务
- 有比较丰富的数据类型(Intege, String, Double, List, Map, GeoJSON..)和相应的操作(Increment, Append..,操作没有redis丰富)
- 有一定的索引支持(一级索引,二级索引,Equality and Range filters)
- 有命名空间
- 支持UDF
- 支持对某个namespace或者set的全量Scan,结合 UDF
- 支持对某个namespace或者set的Truncation
- 有一定的权限管理
- 有比较丰富的客户端SDK和比较完善的文档
- 有命令行 aql 和管理工具 asinfo
- 比较活跃,有专门的团队支持
Cons
- Index(包括primary index和secondary index)是纯内存的,成本比较大,重启需要根据数据重新构建索引,启动时间比较长(企业版支持Fast Restart)=> :(
- 纯内存模式服务重启数据就全部丢失,不像redis有缓存持久化功能。
- 持久化模式(storage-engine device)社区版本也有问题,不支持删除持久化(Durable Delete),重启服务会发现删除的数据又恢复了。。
- 只有清空set数据接口,但是并没有真正drop掉sets(会留下empty set,然后一个namespace下只有有1024个sets..)
- map索引只支持第一层级属性,而且索引粒度是key或者vallue(而不是一般的某个key对应的value)=> :(
- list索引只支持第一层级属性 => :(
- 采用的是随机sharding,不利于图切割
- 采用B+ Tree,基于Index-based adjacency 方式遍历需要 klog(n)
- namespace limitations:
- 1024 sets
- 256 secondary indexes => :(
- 32K unique bin names
- 4 billion of objects per namespace per server(3.12 扩大为 32 billion, 但是仅限于企业版)
- 不支持动态创建namespace,只能通过修改配置文件、重启服务器(Aerospike计划在下一个release中支持) => :(
- 记录大小有限制: <= 1M => 有点小,不过对于我们的场景基本没问题
- bin name长度: <= 14 Chars => 一般来说单字段不会超过,嵌套属性如果拼接就很容易超长 :(
- 基于Secondary Index的Query不支持逻辑操作(AND,OR,NOT),只支持单属性查询 => :(
- 3.12引入了Predicate Filter,可以对 Scan得到的记录或者索引查询结果(scan and secondary index query)进行多条件过滤。不能用于聚合运算(Aggregations)
- 范围查询只支持BETWEEN语句,没有小于,大于查询,并且RANGE结果只支持inclusive
- 范围查询只支持整数类型,不支持浮点数。。
- Query不支持分页(no cursor or pagination..) => :(
- Query不支持排序(no order by..) => :(
- 没有内建的聚合函数(Aggregations: count, max, min, sum, group by, etc.),通过UDFs可以支持(queryAggregate),但是使用方式不友好,效率也不高。
- 只支持batch read,不支持batch writes.. => :(
- 如果where条件没有相应的索引就会报错,而不是走全表扫描
- 如果没有指定set name,不是对整个namespace进行检索,而是对没有指定set name的数据进行检索。 => :(