SNMP Monitoring on AWS Lambda

This week a task was assigned to me: Monitor a Service Provider router that serves a critical service on a remote partner located behind a Direct Connect.

This monitoring will help us to understand the reach of a failure when we have a service interruption; “Is the problem limited to the partner?”; “Is the problems reaching the Service Provider?” or “Is the problem already occurring inside the AWS at the DX level?”

Once our monitoring of this service is centralized on CloudWatch, I wouldn’t like to add an extra layer of management such as a EC2 or Container using a SNMP solution like ZABBIX, so I made my choice to use AWS Lambda to build a SNMP monitor using Python with a module called PySNMP (https://pypi.org/project/pysnmp/).

The scenario have some issues:

So I have to use the Lambda function attached to the VPC having an ENI with a private IP attached to these Function.

The subnets here uses an IGW to access internet, but a VPC attached Lambda Function requires a NAT Gateway to access public IP Addresses and that would incurr in some extra cost.

That would force me to use some kind of name resolution inside the Lambda function, I made some successfully tests with the dnspython module, but this would increase the complexity of the solution

To put the monitor information on Cloudwatch we need to create some Cloudwatch Custom Metrics, I have always created custom metrics from a Lambda using the Python SDK (boto3) put_metric_data but this time I have these issues I mentioned before: “To get to the AWS API I would need a NAT Gateway”; “To resolve the API url I need DNS resolution”

While I was making some tests, I found a feature of Cloudwatch logs that create a custom metric once you input log entries on a specific json format (https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html).

These feature is great in these scenario, once every Lambda Function generates Cloudwatch log entries, eliminating the need of my Lambda Function to reach the Internet.

So my only task was to get the information from the Service Provider device and do a simple print() with the information in the correct format.

In this first moment I got the following OIDs from the Service provider Router:

ifInOctets on the Router’s SP side interface: 1.3.6.1.2.1.2.2.1.10.1

ifOutOctets on the Router’s SP side interface: 1.3.6.1.2.1.2.2.1.16.1

ifInOctets on the Router’s Partner side interface: 1.3.6.1.2.1.2.2.1.10.6

ifOutOctets on the Router’s Partner side interface: 1.3.6.1.2.1.2.2.1.16.6

bgpPeerState of the Router’s SP side interface: 1.3.6.1.2.1.15.3.1.2.198.51.100.4

bgpPeerState of the Router’s Partner side interface: 1.3.6.1.2.1.15.3.1.2.203.0.113.17

To get these function working correctly you have to upload a .zip package containing the code plus the pysnmp, pyasn1 and snmpclitools modules.

The code I used on these Lambda Function is here, it has a python function called passing some arguments such as snmp community, destination IP address, SNMP port and OID

The fuction returns the OID output value and print it using the Embedded Metric Format

print(‘{“_aws”: {“Timestamp”:’ + str(curTime) + ‘,”CloudWatchMetrics”: [{“Namespace”: “service-mon”,”Dimensions”: [[“Link Usage”]],”Metrics”: [{“Name”:”Interface ifInOctets”,”Unit”: “Bits”}]}]},”Link Usage”: “SP Interface SNMP ifInOctets”,”Interface ifInOctets”: ‘ + str(lanInput) + ‘}’)

That generates the following Cloudwatch log entry:

Now we just need to create a Cloudwatch Event trigger to call these function in a desired rate e.g. every minute.

The print() to Cloudwatch Logs will generate a custom metric as specified on the json format:

The ifInOctect and ifOutOctect are Counter32 metrics so we need to do some math to get the correct value from the obtained metrics.

Initially we have something like this:

So we can use the “Add math expression” to put a RATE(METRICS()) expression to these metrics and get the correct info.

The bgpPeerState metric brings to us some values depending on the BGP status, the Only status we need to get an Alarm is bgpPeerState=6.

So, once more I used a math expression to show me an BGP Status of “1” when the bgpPeerState=6, and “0” if there is any other value:

And we can put all these informations on a Dashboard to make some Connectivity Monitoring.

As you can see on the image above both DX metric and the SNMP metric are very much alike each other.

Cloud, Network and DevOps Engineer, 7 x AWS Certified