背景
对系统运行中的一些情况进行统一上报,监控,统计和展示。允许业务自定义监控项。有助于了解系统的运行情况,如性能,访问频率,Cache命中率等。
实现
具体参见笔者写的另一篇文章:java服务端监控平台设计
使用
首先需要配置pom.xml,引入monitor-client包:
<dependency>
<groupId>com.baidu.global.mobile.server.monitor.client</groupId>
<artifactId>monitor-client</artifactId>
<version>1.0.0</version>
</dependency>
然后创建agent。这里有两种方式。第一种是通过Spring配置,这也是推荐的方式:
<!-- monitor agent -->
<bean id="agent" class="com.baidu.global.mobile.server.monitor.agent.AgentImpl">
<constructor-arg value="classpath:${env}_agent.xml" />
</bean>
其中agent.xml的配置如下:
<agent>
<region>HK</region>
<application>guanxing</application>
<threadPoolSize>2</threadPoolSize>
<loadBalancerName>WRR</loadBalancerName>
<!-- <servers>10.242.112.38,10.242.116.27,10.242.117.36</servers> -->
<endpoints>
<endpoint>
<host>10.242.112.38</host>
<port>4567</port>
<!-- <weight>1</weight> <maxFails>1</maxFails> <failTimeout>10</failTimeout>
<timeout>6</timeout> -->
</endpoint>
<endpoint>
<host>10.242.116.27</host>
<port>4567</port>
</endpoint>
<endpoint>
<host>10.242.117.36</host>
<port>4567</port>
</endpoint>
</endpoints>
</agent>
从配置文件可以大概知道我们agent上报的实现细节:
- 采用线程池异步上报
- 支持多个metric收集服务,客户端进行负载均衡。目前只支持Weighted Round Robin方式。从Nginx实现挖过来的。
- 目前不支持server的自动发现和注销。
也可以通过AgentFactory获取agent:
package com.baidu.global.mobile.server.monitor.agent;
/**
* <pre>
* 因为Agent的本身是有线程池和网络连接池,开销比较大,不适合于每次都New一个,最好是一个应用共用一个,所以提供这么一个单例工厂方法。
*
* 建议如果使用Spring的话,直接在Spring配置就可以了。默认就是单例,而且可以很方便的注入。
*
* <bean id="agent"
* class="com.baidu.global.mobile.server.monitor.agent.AgentImpl">
* <constructor-arg value="classpath:${env}_agent.xml" />
* </bean>
*
* </pre>
*
* @author argan
*/
public class AgentFactory {
private static volatile Agent instance = null;
public static Agent getInstance(String configLocation) {
if (instance == null) {
synchronized (AgentFactory.class) {
if (instance == null) {
instance = new AgentImpl(configLocation);
}
}
}
return instance;
}
public static Agent getInstance(AgentConfig config) {
if (instance == null) {
synchronized (AgentFactory.class) {
if (instance == null) {
instance = new AgentImpl(config);
}
}
}
return instance;
}
}
创建Agent之后,就可以调用Agent接口进行metric上报了:
package com.baidu.global.mobile.server.monitor.agent;
import java.util.Map;
public interface Agent {
// Counter
void increment(String name);
void increment(String name, long delta);
void increment(String category, String name);
void increment(String category, String name, long delta);
void increment(String category, String name, long delta, Map<String, String> tags);
// Guage
void guage(String name, double value);
void guage(String category, String name, double value);
void guage(String category, String name, double value, Map<String, String> tags);
// Label
void label(String name, String value);
void label(String category, String name, String value);
void label(String category, String name, String value, Map<String, String> tags);
void addMetrics(Metric...metrics);
}
Agent的接口很简单,其实就一个:void addMetrics(Metric… metrics); 其中Metric定义如下:
package com.baidu.global.mobile.server.monitor.agent;
import static com.baidu.global.mobile.server.monitor.agent.utils.Preconditions.checkNotNullOrEmpty;
import static com.google.common.base.Preconditions.checkArgument;
import static com.google.common.base.Preconditions.checkNotNull;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.zip.DataFormatException;
import org.springframework.util.Assert;
import org.springframework.util.StringUtils;
import com.baidu.global.mobile.server.agent.common.MetricType;
import com.baidu.global.mobile.server.agent.common.Region;
import com.baidu.global.mobile.server.monitor.agent.utils.WebUtils;
/**
* <pre>
*
* Thrift生成的类型不好用。
*
* 1. 不支持Object类型。
* 2. 无法进行校验
* 3. 没有Builder模式。
* 所以定义了一个,再进行转换。
*
* 但是builder模式有个问题,就是不好处理NULL,所以这里仍然提供setter方法。
*
* </pre>
*
* @author argan
*
*/
public class Metric {
// 哪个应用产生的日志:比如Nantianmen, Guanxing, Spider,
// Nginx等。这个需要应用配置,由client发送给Server。
private String application;
// 服务器区域,目前主要有HK,ID。有这个可以进行区域范围的筛选,没有的话只能根据host判断区域。
private Region region = Region.All;
// 这里不采用灵活的Map<String, String> tags,而是每增加一个tag需要显式的定义一下。
// 日志从哪台机器打过来的。可以动态获取到。如果client没有指定,那么log server以client IP地址为准。
private String host;
// MetricType, 会影响后面的body格式。
private MetricType type = MetricType.Metric;
// Category,用于过滤筛选,比如要找出Top10
// 访问量的URL。定义成String类型,用户自己区别开,常见的有URI,Service,DataSource, Memory, Thread等。
// 因为底层TSDB不支持Key的模糊查询,所以这个还是很有必要的。
private String category = "All";
// metricName,建议是package格式,如:sys.cpu.nice
private String name;
// 如果没有上传,则为server收集到该日志的时间。
private long timestamp;
// 目前只支持Long(Counter)和Double(Guage)和String(lable)。
private Object value;
// custom tags,比如Creator等
private Map<String, String> tags = new HashMap<String, String>();
public Metric() {
}
public Metric(String name) {
this.name = checkNotNullOrEmpty(name);
}
public void setApplication(String application) {
this.application = application;
}
public void setRegion(Region region) {
this.region = region;
}
public void setHost(String host) {
this.host = host;
}
public void setType(MetricType type) {
this.type = type;
}
public void setCategory(String category) {
this.category = category;
}
public void setName(String name) {
this.name = name;
}
public void setTimestamp(long timestamp) {
this.timestamp = timestamp;
}
public void setValue(Object value) {
this.value = checkNotNull(value);
}
public void setTags(Map<String, String> tags) {
this.tags = tags;
}
/**
* Time when the data point was measured.
*
* @return time when the data point was measured
*/
public long getTimestamp() {
return timestamp;
}
public Object getValue() {
return value;
}
public String stringValue() {
return (String) value;
}
public long longValue() throws DataFormatException {
try {
return ((Number) value).longValue();
} catch (Exception e) {
throw new DataFormatException("Value is not a long");
}
}
public double doubleValue() throws DataFormatException {
try {
return ((Number) value).doubleValue();
} catch (Exception e) {
throw new DataFormatException("Value is not a double");
}
}
public boolean isDoubleValue() {
if (value instanceof Number) {
return !(((Number) value).doubleValue() == Math.floor(((Number) value).doubleValue()));
}
return false;
}
public boolean isIntegerValue() {
if (value instanceof Number) {
return ((Number) value).doubleValue() == Math.floor(((Number) value).doubleValue());
}
return false;
}
public boolean isStringValue() {
return value.getClass().equals(String.class);
}
public static void main(String[] args) {
long l = 122L;
int i = 10;
Double d = 12.50;
String s = "hello world";
Metric m = new Metric();
m.setValue(l);
System.out.println(m.getValue().getClass());
m.setValue(i);
System.out.println(m.getValue().getClass());
m.setValue(d);
System.out.println(m.getValue().getClass());
m.setValue(s);
System.out.println(m.getValue().getClass());
System.out.println(m.isStringValue());
}
/**
* Returns the metric name.
*
* @return metric name
*/
public String getName() {
return name;
}
/**
* Returns the tags associated with the data point.
*
* @return tag for the data point
*/
public Map<String, String> getTags() {
return Collections.unmodifiableMap(tags);
}
public String getApplication() {
return application;
}
public Region getRegion() {
return region;
}
public String getHost() {
return host;
}
public MetricType getType() {
return type;
}
public String getCategory() {
return category;
}
public void validate() {
if (this.timestamp == 0) {
this.timestamp = System.currentTimeMillis();
}
if (StringUtils.isEmpty(this.host)) {
this.setHost(WebUtils.getLocalNetWorkIp());
}
Assert.notNull(this.name, "metric name is required!");
Assert.notNull(this.application, "application is required!");
Assert.notNull(this.type, "type is required!");
Assert.notNull(this.category, "category is required!");
Assert.notNull(this.region, "region is required!");
Assert.notNull(this.host, "host is required!");
}
/**
* This method starts the building process for Metrics.
*/
public static Builder start() {
return new Builder();
}
public static class Builder {
private final Metric metric = new Metric();
public Builder application(String application) {
checkNotNullOrEmpty(application);
metric.application = application;
return this;
}
public Builder host(String host) {
checkNotNullOrEmpty(host);
metric.host = host;
return this;
}
public Builder category(String category) {
checkNotNullOrEmpty(category);
metric.category = category;
return this;
}
public Builder region(Region region) {
Assert.notNull(region);
metric.region = region;
return this;
}
public Builder type(MetricType type) {
Assert.notNull(type);
metric.type = type;
return this;
}
public Builder name(String name) {
checkNotNullOrEmpty(name);
metric.name = name;
return this;
}
public Builder timestamp(long timestamp) {
metric.timestamp = timestamp;
return this;
}
public Builder value(Object value) {
metric.value = checkNotNull(value);
return this;
}
/**
* Adds a tag to the data point.
*
* @param name tag identifier
* @param value tag value
* @return the metric the tag was added to
*/
public Builder addTag(String name, String value) {
checkNotNullOrEmpty(name);
checkNotNullOrEmpty(value);
metric.getTags().put(name, value);
return this;
}
/**
* Adds tags to the data point.
*
* @param tags map of tags
* @return the metric the tags were added to
*/
public Builder addTags(Map<String, String> tags) {
checkNotNull(tags);
metric.getTags().putAll(tags);
return this;
}
/**
* set the data point for the metric.
*
* @param timestamp when the measurement occurred
* @param value the measurement value
* @return the metric
*/
protected Builder innerAddDataPoint(long timestamp, Object value) {
checkArgument(timestamp > 0);
metric.timestamp = timestamp;
metric.value = checkNotNull(value);
return this;
}
/**
* Adds the data point to the metric with a timestamp of now.
*
* @param value the measurement value
* @return the metric
*/
public Builder setDataPoint(long value) {
return innerAddDataPoint(System.currentTimeMillis(), value);
}
public Builder setDataPoint(long timestamp, long value) {
return innerAddDataPoint(timestamp, value);
}
/**
* Adds the data point to the metric.
*
* @param timestamp when the measurement occurred
* @param value the measurement value
* @return the metric
*/
public Builder setDataPoint(long timestamp, double value) {
return innerAddDataPoint(timestamp, value);
}
/**
* Adds the data point to the metric with a timestamp of now.
*
* @param value the measurement value
* @return the metric
*/
public Builder setDataPoint(double value) {
return innerAddDataPoint(System.currentTimeMillis(), value);
}
/**
* Adds the data point to the metric.
*
* @param timestamp when the measurement occurred
* @param value the measurement value
* @return the metric
*/
public Builder setDataPoint(long timestamp, String value) {
return innerAddDataPoint(timestamp, value);
}
/**
* Adds the data point to the metric with a timestamp of now.
*
* @param value the measurement value
* @return the metric
*/
public Builder setDataPoint(String value) {
return innerAddDataPoint(System.currentTimeMillis(), value);
}
public Metric build() {
metric.validate();
return metric;
}
}
}
Counter和Guage和Label其实是特殊的Metric,这里只是提供了一个便利的接口。
比如,我们要统计接口的调用频率和耗时,那么可以在Spring MVC的Interceptor中这么做:
/**
* 每个方法处理时间监控。
*
* @author zhengzhibin
*
*/
public class PerformanceMonitorInterceptor extends HandlerInterceptorAdapter {
private static final Logger logger = Logger.getLogger(PerformanceMonitorInterceptor.class);
@Autowired(required = false)
private Agent agent;
@Override
public boolean preHandle(HttpServletRequest request,
HttpServletResponse response, Object handler) throws Exception {
long startTime = System.currentTimeMillis();
request.setAttribute("startTime", startTime);
return super.preHandle(request, response, handler);
}
public void postHandle(HttpServletRequest request,
HttpServletResponse response, Object handler,
ModelAndView modelAndView) throws Exception {
if (handler instanceof HandlerMethod) {
HandlerMethod method = (HandlerMethod) handler;
long startTime = (Long) request.getAttribute("startTime");
long processTime = System.currentTimeMillis() - startTime;
if (agent == null) {
logger.warn("Agent is null, metric report disabled!");
}
// send to monitor server
String controllerMethod = method.getBean().getClass()
.getCanonicalName()
+ "." + method.getMethod().getName();
Map<String, String> tags = getGeneralParams(request);
tags.put("controller_method", controllerMethod);
agent.guage("URI", request.getRequestURI(), processTime, tags);
}
}
private Map<String, String> getGeneralParams(HttpServletRequest request) {
Map<String, String> gp = new HashMap<String, String>();
putIfNotEmpty(gp, "language", request.getParameter("la"));
putIfNotEmpty(gp, "country", request.getParameter("co"));
putIfNotEmpty(gp, "uuid", request.getParameter("uuid"));
return gp;
}
public void putIfNotEmpty(Map<String, String> map, String key, String value) {
if (StringUtils.isNotEmpty(value)) {
map.put(key, value);
}
}
}
然后在监控平台就可以看到每一次调用的情况了。可以查看某个时间段的某个metric的统计数据。不过现在的数据展示方面还是比较弱的。metric之间没有关联性查询。
另外,尽管Agent上报方式其实蛮简单的,不过更简单的方式是注解上报,后面会考虑加上。
发挥你的想象力,把它用起来:)
感兴趣的同学可以参考这篇文章: Observability at Twitter。介绍了Twitter的监控体系,其中就有我们现在在做的metric监控平台。