Internet-Draft Network Incident Management October 2024
Hu, et al. Expires 13 April 2025 [Page]
Workgroup:
NMOP Working Group
Internet-Draft:
draft-ietf-nmop-network-incident-yang-latest
Published:
Intended Status:
Standards Track
Expires:
Authors:
T. Hu
CMCC
L. M. C. Murillo
Telefonica I+D
Q. Wu
Huawei
N. Davis
Ciena
C. Feng

A YANG Data Model for Network Incident Management

Abstract

A network incident refers to an unexpected interruption of a network service, degradation of a network service quality, or sub-health of a network service. Different data sources including alarms, metrics, and other anomaly information can be aggregated into a few amount of network incidents through data correlation analysis and the service impact analysis.

This document defines a YANG Module for the network incident lifecycle management. This YANG module is meant to provide a standard way to report, diagnose, and help resolve network incidents for the sake of network service health and root cause analysis.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 13 April 2025.

Table of Contents

1. Introduction

[RFC8969] defines a framework for Automating Service and Network Management with YANG [RFC7950] to full life cycle network management. A set of YANG data models have already been developed in IETF for network performance monitoring and fault monitoring, e.g., a YANG data model for alarm management [RFC8632] defines a standard interface for alarm management. A data model for Network and VPN Service Performance Monitoring [RFC9375] defines a standard interface for network performance management. In addition, distributed tracing mechanism defined in [W3C-Trace-Context] can be used to analyze and debug operations, such as configuration transactions, across multiple distributed systems.

However, these YANG data models for network maintenance are based on specific data source information and manage alarms and performance metrics data separately at different layers in various different management systems. In addition, the frequency and quantity of alarms and performance metrics data reported to Operating Support System (OSS) are increased dramatically (in many cases multiple orders of magnitude) with the growth of service types and complexity and greatly overwhelm OSS platforms; with existing known dependency relation between metric, alarm and events at each layer (e.g., packet layer or optical layer), it is possible to compress series of alarms into fewer network incidents and there are many solutions in the market today that essentially do this to some degree. However, conventional solutions such as data compression are time-consuming and labor-intensive, usually rely on maintenance engineers' experience for data analysis, which, in many cases, result in low processing efficiency, inaccurate root cause identification and duplicated tickets. It is also difficult to assess the impact of alarms, performance metrics and other anomaly data on network services without known relation across layers of the network topology data or the relation with other network topology data.

To address these challenges, a network wide incident-centric solution is specified to establish the dependency relation with both network service and network topology at different layers, which not only can be used at a specific layer in a domain but also can be used to span across layers for multi-layer network troubleshooting.

A network incident refers to an undesired occurrence such as an unexpected interruption of a network service,degradation of a network service quality, or sub-health of a network service [I-D.ietf-nmop-terminology][TMF724A]. Different data sources including alarms, metrics, and other anomaly information can be aggregated into one or a few amount of network incidents irrespective layer through correlation analysis and the service impact analysis. For example, if the protocol-related interface fails to work properly, large amount of alarms may be reported to upper layer management system since a lot of network services may be affected by the interface, but only one aggregated network incident pertaining to the abnormal interface will be reported. A network incident may also be raised through the analysis of some network performance metrics, for example, as described in SAIN [RFC9417], network services can be decomposed to several sub-services, specific metrics are monitored for each sub- service, symptoms will occur if services/sub-services are unhealthy (after analyzing metrics), these symptoms may raise one network incident when it causes degradation of the network services.

In addition, Artificial Intelligence (AI) and Machine Learning (ML) are key technologies in the processing of large amounts of data with complex data correlations. For example, Neural Network Algorithm or Hierarchy Aggregation Algorithm can be used to replace manual alarm data correlation. Through online and offline self-learning, these algorithms can be continuously optimized to improve the efficiency of fault diagnosis.

This document defines a YANG data model for network incident lifecycle management, which improves troubleshooting efficiency, ensures network service quality, and improves network automation [RFC8969].

2. Conventions and Definitions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

The following terms are defined in [RFC8632], [I-D.ietf-nmop-terminology] and are not redefined here:

The following terms are defined in this document:

Network incident:

An undesired occurrence such as an unexpected interruption of a network service,degradation of a network service quality, or sub-health of a network service [TMF724A]. A network incident is a single unplanned event that causes network service interruption. A problem is one cause or potential cause of one or more network incidents. The repeated network incidents can be raised as the problem.

Incident management:

Lifecycle management of network incidents, including network incident identification, reporting, acknowledgement, diagnosis, and resolution. Different from fault management, it take various different data sources including alarms, metrics, and other anomaly information and aggregate them into one or a few amount of network incidents irrespective layer through correlation analysis and the service impact analysis. One fault on the network device can be raised by one network incident, one fault on the network device can cause multiple network incidents, e.g., multiple service offerings that are dependent on that device will go down and others may suffer increased latency as redundant routes become more congested.

Incident management system:

An entity which implements network incident management. It includes (but not limited to) incident server and incident client.

Incident server:

An entity which provides which is responsible for detecting and reporting one network incident, performing network incident diagnosis, resolution and prediction, etc.

Incident client:

An entity which can manage network incidents. For example, it can receive network incident notifications, query the information of network incidents, instruct an incident management server to diagnose, help resolve, etc.

Incident handler:

An entity which can receive network incident notification, store and query the information of network incidents for data analysis.

3. Sample Use Cases

3.1. Incident-Based Trouble Tickets Dispatching

Usually, the dispatching of trouble tickets in a network is mostly based on alarms data analysis and needs to involve operators' maintenance engineers. These operators' maintenance engineers are responsible to monitor and detect and correlate some alarms, e.g., that alarms at both endpoints of a specific tunnel or at both optical and IP layers which are associated with the same network fault. Therefore, they can correlate these alarms to the same trouble ticket, which is in the low automation. If there are more alarms, then the human costs for network maintenance are increased accordingly.

Some operators preconfigure accept-lists and adopt some coarse granularity data correlation rules for the alarm management. This approach seems to improve fault management automation. However, some trouble tickets might be missed if the filtering conditions are too strict. If the filtering conditions are not strict, it might end up with multiple trouble tickets being dispatched to the same network fault. It is hard to achieve a perfect balance between the network management automation and duplicated trouble tickets under the conventional working situations.

With the help of the network incident management, massive alarms can be aggregated into a few network incidents based on service impact analysis, the number of trouble tickets will be reduced. At the same time, the efficiency of network troubleshooting can be largely improved. which address the pain point of traditional trouble ticket dispatching.

3.2. Incident Derivation from L3VPN Services Unavailability

The Service Attachment Points (SAPs) defined in [RFC9408] represent the network reference points where network services can be delivered or are being delivered to customers.

SLOs can be used to characterize the ability of a particular set of nodes to communicate according to certain measurable expectations [I-D.ietf-ippm-pam]. For example, an SLA might state that any given SLO applies to at least a certain percentage of packets, allowing for a certain level of packet loss and exceeding packet delay threshold to take place. For example, an SLA might establish a multi-tiered SLO of end-to-end latency as follows:

  • Not to exceed 30 ms for any packet.

  • Not to exceed 25 ms for 99.999% of packets.

  • Not to exceed 20 ms for 99% of packets.

This SLA information can be bound with two or multiple SAPs defined in [RFC9408], so that the service orchestration layer can use these interfaces to commit the delivery of a service on specific point-to-point service topology or point to multi-point topology. When specific levels of a threshold of an SLO is violated, a specific network incident, associated with, let's say L3VPN service will be derived.

3.3. Multi-layer Fault Demarcation

When a fault occurs in a network that contains both packet-layer devices and optical-layer devices, it may cause correlative faults in both layers, i.e., packet layer and optical layer. Specifically, fault propagation could be classified into three typical types. First, fault occurs at a packet-layer device will further cause fault (e.g., Wavelength Division Multiplexing (WDM) client fault) at an optical-layer device. Second, fault occurs at an optical-layer device will further cause fault (e.g., Layer 3 link down) at a packet- layer device. Third, fault occurs at the inter-layer link between a packet-layer device and an optical-layer device will further cause faults at both devices. Multiple operation teams are usually needed to first analyze huge amount of alarms (triggered by the above mentioned faults) from single network layer (either packet layer or optical layer)independently, then cooperate to locate the root cause through manually analyzing multi-layer topology data and service data, thus fault demarcation becomes more complex and time-consuming in multi-layer scenario than in single-layer scenario.

With the help of network incident management, the management systems first automatically analyze root cause of the alarms at each single network layer and report corresponding network incidents to the multi-layer,multi-domain management system, then such management system comprehensively analyzes the topology relationship and service relationship between the root causes of both layers. The inner relationship among the alarms will be identified and finally the root cause will be located among multiple layers. By cooperating with the integrated Optical time-domain reflectometer (OTDR) within the network device, we can determine the target optical exchange station before site visits. Therefore, the overall fault demarcation process is simplified and automated, the analyze result could be reported and visualized in time. In this case, operation teams only have to confirm the analyze result and dispatch site engineers to perform relative maintenance actions (e.g., splice fiber) based on the root cause.

4. Network Incident Management Architecture

     +------------------------------------------+
     |                                          |
     |            Incident  Client              |
     |                                          |
     |                                          |
     +------------+---------+---------+---------+
        ^         |         |         |
        |Incident |Incident |Incident |Incident
        |Report   |Ack      |Diagnose |Resolve
        |         |         |         |
        |         V         V         V
     +--+----------------------------------------+
     |                                           |
     |                                           |
     |             Incident  Server              |
     |                                           |
     |                                           |
     |                                           |
     |                                           |
     +-------------------------------+-----------+
           ^       ^Abnormal         ^
           |Alarm  |Operations       |Metrics
           |Report |Report           |/Telemetry
           |       |                 V
+----------+-------+------------------------------------+
|                                                       |
|                     Network                           |
|                                                       |
+-------------------------------------------------------+

Figure 1: Network Incident Management Architecture

Figure 1 illustrates the network incident management architecture. Two key components for the incident management are incident client and incident server.

Incident server can be deployed in network analytics platform, controllers and provides functionalities such as network incident identification, report, diagnosis, resolution, or querying for network incident lifecycle management.

Incident client can be deployed either in the same network platform, controller as the incident management server within a single domain, or in the upper layer network analytics platform or controller, e.g., multi-domain controller, invokes the functionalities provided by incident management server to meet the business requirements of fault management. The entire network incident lifecycle management can be independent from or not under control of the network OSS or other business system of operators.

A typical workflow of network incident management is as follows:

4.1. Interworking with Alarm Management

            +-----------------------------+
            |         OSS                 |
            |+-------+      +-----------+ |
            ||Alarm  |      | Incident  | |
            ||handler|      |  handler   | |
            |+-------+      +-----------+ |
            +---^---------------^---------+
                |               |
                |alarm          |incident
            +---|---------------|---------+
            |   |  controller   |         |
            |   |               |         |
            |+--+----+      +-----------+ |
            ||Alarm  |      |  Incident | |
            ||process+----->|   Process | |
            ||       |alarm |           | |
            |+-------+      +-----------+ |
            |   ^              ^          |
            +---|--------------|----------+
                |alarm         | metrics/trace/etc.
                |              |
            +---+--------------+----------+
            |         Network             |
            |                             |
            +-----------------------------+
Figure 2: Interworking with Alarm Management

A YANG model for the alarm management [RFC8632] defines a standard interface to manage the lifecycle of alarms. Alarms represent the undesirable state of network resources, alarm data model also defines the root causes and impacted services fields, but there may lack sufficient information to determine them in lower layer system (mainly in devices level), so alarms do not always tell the status of services or the root causes. As described in [RFC8632], alarm management act as a starting point for high-level fault management. While network incident management often works at the network level, so it is possible to have enough information to perform correlation and service impact analysis. Alarms can work as one of data sources of network incident management and may be aggregated into few amount of network incidents by correlation analysis, network service impact and root causes may be determined during incident process.

Network Incident also contains some related alarms,if needed users can query the information of alarms by alarm management interface [RFC8632]. In some cases, e.g., cutover scenario, incident server may use alarm management interface [RFC8632] to shelve some alarms.

Alarm management may keep the original process, alarms are reported from network to network controller or analytics and then reported to upper layer system (e.g., the alarm handler within the OSS).

Similarly, the network incident are reported from the network to the network controller or analytics and then reported to the upper layer system (e.g., incident handler within the OSS). Upper layer system may store these network incidents and provide the information for fault analysis (e.g., deeper analysis based on network incident).

Different from alarm management, incident process within the controller comprising both incident client and incident sever functionalities provides not only network incident reporting but also diagnosis and resolution functions, it's possible to support self-healing and may be helpful for single-domain closed-loop control.

Incident management is not a substitute for alarm management. Instead, they can work together to implement fault management.

4.2. Interworking with SAIN

SAIN [RFC9417] defines an architecture of network service assurance.

           +----------------+
           |Incident handler|
           +----------------+
                   ^
                   |incident
           +-------+--------+
           |Incident process|
           +----------------+
                   ^
                   |symptoms
           +-------+--------+
           |     SAIN       |
           |                |
           +----------------+
                   ^
                   |metrics
     +-------------+-------------+
     |                           |
     |         Network           |
     |                           |
     +---------------------------+

Figure 3: Interworking with SAIN

A network service can be decomposed into some sub-services, and some metrics can be monitored for sub-services. For example, a tunnel service can be decomposed into some peer tunnel interface sub- services and IP connectivity sub-service. If some metrics are evaluated to indicate unhealthy for specific sub-service, some symptoms will be present. Incident process comprising both incident client and incident server functionalities may identify the network incident based on symptoms, and then report it to incident handler within the Operation Support System (OSS). So, SAIN can be one way to identify network incident, services, sub-services and metrics can be preconfigured via APIs defined by service assurance YANG model [RFC9418] and the networkincident will be reported if symptoms match the condition of the network incident.

4.3. Relationship with RFC8969

[RFC8969] defines a framework for network automation using YANG, this framework breaks down YANG modules into three layers, service layer, network layer and device layer, and contains service deployment, service optimization/assurance, and service diagnosis. Network incident works at the network layer and aggregates alarms, metrics and other information from device layer, it's helpful to provide service assurance. And the network incident diagnosis may be one way of service diagnosis.

4.4. Relationship with Trace Context

W3C defines a common trace context [W3C-Trace-Context] for distributed system tracing, [I-D.rogaglia-netconf-trace-ctx-extension] defines a netconf extension for [W3C-Trace-Context] and [I-D.quilbeuf-opsawg-configuration-tracing] defines a mechanism for configuration tracing. If some errors occur when services are deploying, it's very easy to identify these errors by distributed system tracing, and a network incident should be reported.

5. Functional Interface Requirements between the Client and the Server

5.1. Incident Identification

As depicted in Figure 4, multiple alarms, metrics, or hybrid can be aggregated into a network incident after analysis.

            +--------------+
         +--|  Incident1   |
         |  +--+-----------+
         |     |  +-----------+
         |     +--+  alarm1   |
         |     |  +-----------+
         |     |
         |     |  +-----------+
         |     +--+  alarm2   |
         |     |  +-----------+
         |     |
         |     |  +-----------+
         |     +--+  alarm3   |
         |        +-----------+
         |  +--------------+
         +--|  Incident2   |
         |  +--+-----------+
         |     |  +-----------+
         |     +--+  metric1  |
         |     |  +-----------+
         |     |  +-----------+
         |     +--+  metric2  |
         |        +-----------+
         |
         |  +--------------+
         +--|  Incident3   |
            +--+-----------+
               |  +-----------+
               +--+ alarm1    |
               |  +-----------+
               |
               |  +-----------+
               +--| metric1   |
                  +-----------+
Figure 4: Incident Identification

The network incident management server MUST be capable of identifying network incidents. Multiple alarms, metrics and other information are reported to incident server, and the server must analyze it and find out the correlations of them, if the correlation match the network incident rules, network incident will be identified and reported to the client. Service impact analysis will be performed if an indent is identified, and the content of network incident will be updated if impacted network services are detected.

AI/ML may be used to identify the network incident. Expert system and online learning can help AI to identify the correlation of alarms, metrics and other information by time-base correlation algorithm, topo-based correlation algorithm, etc. For example, if interface is down, then many protocol alarms will be reported, AI will think these alarms have some correlations. These correlations will be put into knowledge base, and the network incident will be identified faster according to knowledge base next time.

As mentioned above, SAIN is another way to implement network incident identification. Trace context defined in [W3C-Trace-Context] may be helpful for network incident identification.

         +----------------------+
         |                      |
         |     Orchestrator     |
         |                      |
         +----+-----------------+
              ^VPN A Unavailable
              |
          +---+----------------+
          |                    |
          |     Controller     |
          |                    |
          |                    |
          +-+-+------------+---+
            ^ ^            ^
        IGP | |Interface   |IGP Peer
       Down | |Down        | Abnormal
            | |            |
VPN A       | |            |
+-----------+-+------------+------------------------+
| \  +---+       ++-++         +-+-+        +---+  /|
|  \ |   |       |   |         |   |        |   | / |
|   \|PE1+-------| P1+X--------|P2 +--------|PE2|/  |
|    +---+       +---+         +---+        +---+   |
+---------------------------------------------------+
Figure 5: Example 1 of Network Incident Identification

As described in Figure 5, vpn a is deployed from PE1 to PE2, if a interface of P1 is going down, many alarms are triggered, such as interface down, igp down, and igp peer abnormal from P2.

These alarms are aggregated and analyzed by the controller/incident management server, and then the network incident 'vpn unavailable' is triggered by the controller/incident management server.

Note that incident management server can rely on data correlation technology such as service impact analysis and data analytic component to evaluate the real effect on the relevant service and understand whether lower level or device level network anomaly, e.g., igp down, has impact on the service.

                +----------------------+
                |                      |
                |     Orchestrator     |
                |                      |
                +----+-----------------+
                         ^VPN A Degradation
                         |
                 +-------+------------+
                 |                    |
                 |     controller     |
                 |                    |
                 |                    |
                 +--+------------+----+
                    ^            ^
                    |Packet      |Path Delay
                    |Loss        |
                    |            |
VPN A               |            |
+-------------------+------------+-------------------+
| \  +---+       ++-++         +-+-+        +---+  / |
|  \ |   |       |   |         |   |        |   | /  |
|   \|PE1+-------|P1 +---------|P2 +--------|PE2|/   |
|    +---+       +---+         +---+        +---+    |
+----------------------------------------------------+
Figure 6: Example 2 of Network Incident Identification

As described in Figure 6, controller collect the network metrics from network elements, it finds the packet loss of P1 and the path delay of P2 exceed the thresholds, a network incident 'VPN A degradation' may be triggered after service impact analysis.

5.2. Incident Diagnosis

After a network incident is reported to the network incident management client, the incident management client MAY diagnose the incident to determine the root cause. Some diagnosis operations may affect the running network services. The client can choose not to perform that diagnosis operation after determining the impact is not trivial. The network incident management server can also perform self-diagnosis. However, the self-diagnosis MUST not affect the running network services. Possible diagnosis methods include link reachability detection, link quality detection, alarm/log analysis, and short-term fine-grained monitoring of network quality metrics, etc.

5.3. Incident Resolution

After the root cause is diagnosed, the client MAY resolve the network incident. The client MAY choose resolve the network incident by invoking other functions, such as routing calculation function, configuration function, dispatching a ticket or asking the server to resolve it. Generally, the client would attempt to directly resolve the root cause. If the root cause cannot be resolved, an alternative solution SHOULD be required. For example, if a network incident caused by a physical component failure, it cannot be automatically resolved, the standby link can be used to bypass the faulty component.

Incident server will monitor the status of the network incident, if the faults are fixed, the server will update the status of network incident to 'cleared', and report the updated network incident to the client.

Network incident resolution may affect the running network services. The client can choose not to perform those operations after determining the impact is not trivial.

6. Incident Data Model Concepts

6.1. Identifying the Incident Instance

An incident ID is used as an identifier of an incident instance, if an incident instance is identified, a new incident ID is created. The incident ID MUST be unique in the whole system.

6.2. The Incident Lifecycle

The incident model clearly separately network incident instance lifecycle from operator incident lifecycle. o Network incident instance lifecycle: The network incident instrumentation that control network incident raised, updated and cleared. o Operator incident lifecycle: Operators acting upon the network incident with rpcs like acknowledged, diagnosed and resolved.

6.2.1. Incident Instance Lifecycle

From a network incident instance perspective, a network incident can have the following lifecycle: 'raised', 'updated', 'cleared'. When a network incident instance is firstly generated, the status is 'raised'. If the status changes after the network incident instance is generated, (for example, self-diagnosis, diagnosis command issued by the client, or any other condition causes the status to change but does not reach the 'cleared' level.) , the status changes to 'updated'. When a network incident is successfully resolved, the status changes to 'cleared'.

6.2.2. Operator Incident Lifecycle

Operators can act upon network incident with network incident rpcs. From an operator perspective, the lifecycle of a network incident instance includes 'acknowledged', 'diagnosed', and 'resolved'.

When a network incident instance is generated, the operator SHOULD acknowledge the network incident with 'incident-acknowledge' rpc. And then the operator attempts to diagnose the network incident with 'incident-diagnose' rpc (for example, find out the root cause and affected components). Diagnosis is not mandatory. If the root cause and affected components are known when the network incident is generated, diagnosis is not required. After locating the root cause and affected components, operator can try to resolve the network incident by invoking 'incident-resolve' rpc.

7. Incident Data Model Design

7.1. Overview

There is one YANG module in the model, "ietf-incident", which defines technology independent abstraction of network incident construct for alarm, log, performance metrics, etc. The information reported in the network incident include Root cause, priority,impact, suggestion, etc.

At the top of "ietf-incident" module is the Network Incident. Network incident is represented as a list and indexed by "incident-id". Each Network Incident is associated with a service instance, domain and sources. Under sources, there is one or more sources. Each source corresponds to node defined in the network topology model and network resource in the network device,e.g., interface. In addition, "ietf-incident" support one general notification to report network incident state changes and three rpcs to manage the network incidents.

module: ietf-incident
  +--ro incidents
     +--ro incident* [name type incident-id]
        +--ro incident-no         uint64
        +--ro name                string
        +--ro type                identityref
        +--ro incident-id?        string
        +--ro service-instance*   string
        +--ro domain              identityref
        +--ro priority            incident-priority
        +--ro status?             enumeration
        +--ro ack-status?         enumeration
        +--ro category            identityref
        +--ro detail?             string
        +--ro resolve-advice?     String
           +--ro sources
          ...
          +--ro root-causes
          ...
          +--ro root-events
          ...
          +--ro events
          ...
          +--ro raise-time? yang:date-and-time
          +--ro occur-time? yang:date-and-time
          +--ro clear-time? yang:date-and-time
          +--ro ack-time? yang:date-and-time
          +--ro last-updated? yang:date-and-time
rpcs:
  +---x incident-acknowledge
  ...
  +---x incident-diagnose
  ...
  +---x incident-resolve

notifications:
  +---n incident-notification
         +--ro incident-no?
                         -> /inc:incidents/inc:incident/inc:incident-no
         ...
         +--ro time? yang:date-and-time

7.2. Incident Notifications

notifications:
  +---n incident-notification
         +--ro incident-no?
                         -> /inc:incidents/inc:incident/inc:incident-no
         +--ro name? string
         +--ro type? identityref
         +--ro incident-id? string
         +--ro service-instance* string
         +--ro domain? identityref
         +--ro priority? int:incident-priority
         +--ro status? enumeration
         +--ro ack-status? enumeration
         +--ro category? identityref
         +--ro detail? string
         +--ro resolve-advice? string
         +--ro sources
         |  +--ro source* [node-ref]
         |     +--ro node-ref  leafref
         |     +--ro network-ref?  -> /nw:networks/network/network-id
         |     +--ro resource* [name]
         |        +--ro name al:resource
         +--ro root-causes
         |  +--ro root-cause* [node-ref]
         |     +--ro node-ref  leafref
         |     +--ro network-ref?  -> /nw:networks/network/network-id
         |     +--ro resource* [name]
         |     |  +--ro name al:resource
         |     |  +--ro cause-name? string
         |     |  +--ro detail? string
         |     +--ro cause-name? string
         |     +--ro detail? string
         +--ro root-events
         |  +--ro root-event* [type event-id]
         |     +--ro type -> ../../../events/event/type
         |     +--ro event-id leafref
         +--ro events
         |  +--ro event* [type event-id]
         |     +--ro type enumeration
         |     +--ro event-id string
         |     +--ro (event-type-info)?
         |        +--:(alarm)
         |        |  +--ro alarm
         |        |     +--ro resource? leafref
         |        |     +--ro alarm-type-id? leafref
         |        |     +--ro alarm-type-qualifier? leafref
         |        +--:(notification)
         |        +--:(log)
         |        +--:(KPI)
         |        +--:(unknown)
         +--ro time? yang:date-and-time

A general notification, incident-notification, is provided here. When a network incident instance is identified, the notification will be sent. After a notification is generated, if the network incident management server performs self diagnosis or the client uses the interfaces provided by the network incident management server to deliver diagnosis and resolution actions, the notification update behavior is triggered, for example, the root cause objects and affected objects are updated. When a network incident is successfully resolved, the status of the network incident would be set to 'cleared'.

7.3. Incident Acknowledge

+---x incident-acknowledge
|  +---w input
|  |  +---w incident-no*
|  |          -> /inc:incidents/inc:incident/inc:incident-no

After an incident is generated, updated, or cleared, (In some scenarios where automatic diagnosis and resolution are supported, the status of an incident may be updated multiple times or even automatically resolved.) The operator needs to confirm the incident to ensure that the client knows the incident.

The incident-acknowledge rpc can confirm multiple incidents at a time

7.4. Incident Diagnose

+---x incident-diagnose
|  +---w input
|  |  +---w incident-no*
|  |          -> /inc:incidents/inc:incident/inc:incident-no

After a network incident is generated, network incident diagnose rpc can be used to diagnose the network incident and locate the root causes. On demand Diagnosis can be performed on some detection tasks, such as BFD detection, flow detection, telemetry collection, short-term threshold alarm, configuration error check, or test packet injection.

After the on demand diagnosis is performed sucessfully, a separate network incident update notification will be triggered to report the latest status of the network incident asynchronously.

7.5. Incident Resolution

+---x incident-resolve
 +---w input
 |  +---w incident-no*
 |          -> /inc:incidents/inc:incident/inc:incident-no

After the root causes and impacts are determined, incident-resolve rpc can be used to resolve the incident (if the server can resolve it). How to resolve an incident instance is out of the scope of this document.

Network incident resolve rpc allows multiple network incident instances to be resolved at a time. If a network incident instance is successfully resolved, a separate notification will be triggered to update the network incident status to 'cleared'. If the network incident content is changed during this process, a notification update will be triggered.

7.6. RPC Failure

If the RPC fails, the RPC error response MUST indicate the reason for the failure. The structures defined in this document MUST encode specific errors and be inserted in the error response to indicate the reason for the failure.

The tree diagram [RFC8340] for structures are defined as follows:

  structure incident-acknowledge-error-info:
    +-- incident-acknowledge-error-info
       +-- incident-no?   incident-ref
       +-- reason?        identityref
       +-- description?   string
  structure incident-diagnose-error-info:
    +-- incident-diagnose-error-info
       +-- incident-no?   incident-ref
       +-- reason?        identityref
       +-- description?   string
  structure incident-resolve-error-info:
    +-- incident-resolve-error-info
       +-- incident-no?   incident-ref
       +-- reason?        identityref
       +-- description?   string

Valid errors that can occur for each structure defined in this doucment are described as follows:

incident-acknowledge-error-info
-----------------------------------
repeated-acknowledge

incident-diagnose-error-info
-----------------------------------
root-cause-unlocated
permission-denied
operation-timeout
resource-unavailable

incident-resolve-error-info
-----------------------------------
root-cause-unresolved
permission-denied
operation-timeout
resource-unavailable

8. Network Incident Management YANG Module

This module imports types defined in [RFC6991], [RFC8345], [RFC8632].

<CODE BEGINS> file "ietf-incident@2024-06-06.yang"

module ietf-incident {
  yang-version 1.1;
  namespace "urn:ietf:params:xml:ns:yang:ietf-incident";
  prefix inc;

  import ietf-yang-types {
    prefix yang;
    reference
      "RFC 6991: Common YANG Data Types";
  }
  import ietf-alarms {
    prefix al;
    reference
      "RFC 8632: A YANG Data Model for Alarm Management";
  }
  import ietf-network {
    prefix nw;
    reference
      "RFC 8345: A YANG Data Model for Network Topologies";
  }
  import ietf-yang-structure-ext {
    prefix sx;
  }
  organization
    "IETF NMOP Working Group";
  contact
    "WG Web:   <https://datatracker.ietf.org/wg/nmop/>;
     WG List:  <mailto:nmop@ietf.org>

     Author:   Chong Feng
               <mailto:frank.fengchong@huawei.com>
     Author:   Tong Hu
               <mailto:hutong@cmhi.chinamobile.com>
     Author:   Luis Miguel Contreras Murillo
               <mailto:luismiguel.contrerasmurillo@telefonica.com>
     Author :  Qin Wu
               <mailto:bill.wu@huawei.com>
     Author:   Chaode Yu
               <mailto:yuchaode@huawei.com>
     Author:   Nigel Davis
               <mailto:ndavis@ciena.com>";
  description
    "This module defines the interfaces for incident management
     lifecycle.

     This module is intended for the following use cases:
     * incident lifecycle management:
       - incident report: report incident instance to client
                          when an incident instance is detected.
       - incident acknowledge: acknowledge an incident instance.
       - incident diagnose: diagnose an incident instance.
       - incident resolve: resolve an incident instance.

     Copyright (c) 2024 IETF Trust and the persons identified as
     authors of the code.  All rights reserved.

     Redistribution and use in source and binary forms, with or
     without modification, is permitted pursuant to, and subject
     to the license terms contained in, the Revised BSD License
     set forth in Section 4.c of the IETF Trust's Legal Provisions
     Relating to IETF Documents
     (https://trustee.ietf.org/license-info).

     This version of this YANG module is part of RFC XXXX
     (https://www.rfc-editor.org/info/rfcXXXX); ; see the RFC
     itself for full legal notices.

     The key words 'MUST', 'MUST NOT', 'REQUIRED', 'SHALL', 'SHALL
     NOT', 'SHOULD', 'SHOULD NOT', 'RECOMMENDED', 'NOT RECOMMENDED',
     'MAY', and 'OPTIONAL' in this document are to be interpreted as
     described in BCP 14 (RFC 2119) (RFC 8174) when, and only when,
     they appear in all capitals, as shown here. ";

  revision 2024-06-06 {
    description
      "Merge incident yang with incident type yang
       and fix broken ref.";
    reference
      "RFC XXX: YANG module for network incident management.";
  }

  //identities

  identity incident-domain {
    description
      "The abstract identity to indicate the domain of
       an incident.";
  }

  identity single-domain {
    base incident-domain;
    description
      "single domain.";
  }

  identity access {
    base single-domain;
    description
      "access domain.";
  }

  identity ran {
    base access;
    description
      "radio access network domain.";
  }

  identity transport {
    base single-domain;
    description
      "transport domain.";
  }

  identity otn {
    base transport;
    description
      "optical transport network domain.";
  }

  identity ip {
    base single-domain;
    description
      "ip domain.";
  }

  identity ptn {
    base ip;
    description
      "packet transport network domain.";
  }

  identity cross-domain {
    base incident-domain;
    description
      "cross domain.";
  }

  identity incident-category {
    description
      "The abstract identity for incident category.";
  }

  identity device {
    base incident-category;
    description
      "device category.";
  }

  identity power-environment {
    base device;
    description
      "power environment category.";
  }

  identity device-hardware {
    base device;
    description
      "hardware of device category.";
  }

  identity device-software {
    base device;
    description
      "software of device category";
  }

  identity line {
    base device-hardware;
    description
      "line card category.";
  }

  identity maintenance {
    base incident-category;
    description
      "maintenance category.";
  }

  identity network {
    base incident-category;
    description
      "network category.";
  }

  identity protocol {
    base incident-category;
    description
      "protocol category.";
  }

  identity overlay {
    base incident-category;
    description
      "overlay category";
  }

  identity vm {
    base incident-category;
    description
      "vm category.";
  }

  identity event-type {
    description
      "The abstract identity for Event type";
  }

  identity alarm {
    base event-type;
    description
      "alarm event type.";
  }

  identity notif {
    base event-type;
    description
      "Notification event type.";
  }

  identity log {
    base event-type;
    description
      "Log event type.";
  }

  identity KPI {
    base event-type;
    description
      "KPI event type.";
  }

  identity unknown {
    base event-type;
    description
      "Unknown event type.";
  }

  identity incident-class {
    description
      "The abstract identity for Incident category.";
  }

  identity problem {
    base incident-class;
    description
      "It indicates the class of the incident is a problem
             (i.e.,cause of the incident) for example an interface
             fails to work.";
  }

  identity sla-violation {
    base incident-class;
    description
      "It indicates the class of the incident is a sla
             violation, for example high CPU rate may cause
             a fault in the future.";
  }

  identity acknowledge-error {
    description
      "Base identity for the problem found while attempting
       to fulfill an 'incident-acknowledge' RPC request.";
  }

  identity diagnose-error {
    description
      "Base identity for the problem found while attempting
       to fulfill an 'incident-diagnose' RPC request.";
  }

  identity resolve-error {
    description
      "Base identity for the problem found while attempting
       to fulfill an 'incident-resolve' RPC request.";
  }

  identity repeated-acknowledge {
    base acknowledge-error;
    description
      "The incident referred to has already been acknowledged.";
  }

  identity root-cause-unlocated {
    base diagnose-error;
    description
      "Fails to locate the root causes when performing the
       diagnosis operation. The detailed reason MUST be included
       in the 'description'.";
  }

  identity root-cause-unresolved {
    base resolve-error;
    description
      "Fails to resolve the root causes when performing the
       resolution operation. The detailed reason MUST be included
       in the 'description'";
  }

  identity permission-denied {
    base diagnose-error;
    base resolve-error;
    description
      " The permission required for performing specific
       detection/resolution task is not granted.";
  }

  identity operation-timeout {
    base diagnose-error;
    base resolve-error;
    description
      " The diagnosis/resolution time exceeds the preset time.";
  }

  identity resource-unavailable {
    base diagnose-error;
    base resolve-error;
    description
      " The resource is unavailable to perform
        the diagnosis/resolution operation.";
  }

  identity cause-name {
    description
      "Base identity for the cause name.";
  }

  //typedefs

  typedef incident-priority {
    type enumeration {
      enum critical {
        description
          "the incident MUST be handled immediately.";
      }
      enum high {
        description
          "the incident should be handled as soon as
           possible.";
      }
      enum medium {
        description
          "network services are not affected, or the
           services are slightly affected,but corrective
           measures need to be taken.";
      }
      enum low {
        description
          "potential or imminent service-affecting
           incidents are detected,but services are
           not affected currently.";
      }
    }
    description
      "define the priority of incident.";
  }

  typedef incident-ref {
    type leafref {
      path "/inc:incidents/inc:incident/inc:incident-no";
    }
    description
      "reference a network incident.";
  }

  //groupings

  grouping root-cause-info {
    description
      "The information of root cause.";
    leaf cause-name {
      type identityref{
        base cause-name;
      }
      description
        "the name of cause.";
    }
    leaf detail {
      type string;
      description
        "the detail information of the cause.";
    }
  }

  grouping resources-info {
    description
      "the grouping which defines the network
       resources of a node.";
    uses nw:node-ref;
    list resource {
      key "name";
      description
        "the resources of a network node.";
      leaf name {
        type al:resource;
        description
          "network resource name.";
      }
    }
  }

  grouping incident-time-info {
    description
      "the grouping defines incident time information.";
    leaf raise-time {
      type yang:date-and-time;
      description
        "the time when an incident instance is raised.";
    }
    leaf occur-time {
      type yang:date-and-time;
      description
        "the time when an incident instance occurs.
         It's the occur time of the first event during
         incident detection.";
    }
    leaf clear-time {
      type yang:date-and-time;
      description
        "the time when an incident instance is
         resolved.";
    }
    leaf ack-time {
      type yang:date-and-time;
      description
        "the time when an incident instance is
         acknowledged.";
    }
    leaf last-updated {
      type yang:date-and-time;
      description
        "the latest time when an incident instance is
         updated";
    }
  }

  grouping incident-info {
    description
      "the grouping defines the information of an
       incident.";
    leaf name {
      type string;
      mandatory true;
      description
        "the name of an incident.";
    }
    leaf type {
      type identityref {
        base incident-class;
      }
      mandatory true;
      description
        "The type of an incident.";
    }
        leaf incident-id {
        type string;
        description
          "The unique qualifier of an incident instance type.
           This leaf is used when the 'type' leaf cannot
           uniquely identify the incident instance type.  Normally,
           this is not the case, and this leaf is the empty string.";
    }
    leaf-list service-instance {
      type string;
      description
        "the related network service instances of
         the incident instance.";
    }
    leaf domain {
      type identityref {
        base incident-domain;
      }
      mandatory true;
      description
        "the domain of an incident.";
    }
    leaf priority {
      type incident-priority;
      mandatory true;
      description
        "the priority of an incident instance.";
    }
    leaf status {
      type enumeration {
        enum raised {
          description
            "an incident instance is raised.";
        }
        enum updated {
          description
            "the information of an incident instance
             is updated.";
        }
        enum cleared {
          description
            "an incident is cleared.";
        }
      }
      default "raised";
      description
        "The status of an incident instance.";
    }
    leaf ack-status {
      type enumeration {
        enum acknowledged {
          description
            "The incident has been acknowledged by user.";
        }
        enum unacknowledged {
          description
            "The incident hasn't been acknowledged.";
        }
      }
      default "unacknowledged";
      description
        "the acknowledge status of an incident.";
    }
    leaf category {
      type identityref {
        base incident-category;
      }
      mandatory true;
      description
        "The category of an incident.";
    }
    leaf detail {
      type string;
      description
        "detail information of this incident.";
    }
    leaf resolve-advice {
      type string;
      description
        "The advice to resolve this incident.";
    }
    container sources {
      description
        "The source components.";
      list source {
        key "node-ref";
        min-elements 1;
        description
          "The source components of incident.";
        uses resources-info;
      }
    }
    container root-causes {
      description
        "The root cause objects.";
      list root-cause {
        key "node-ref";
        description
          "the root causes of incident.";
        uses resources-info {
          augment "resource" {
            description
              "augment root cause information.";
            //if root cause object is a resource of a node
            uses root-cause-info;
          }
        }
        //if root cause object is a node
        uses root-cause-info;
      }
    }
    container root-events {
      description
        "the root cause related events of the incident.";
      list root-event {
        key "type event-id";
        description
          "the root cause related event of the incident.";
        leaf type {
          type leafref {
            path "../../../events/event/type";
          }
          description
            "the event type.";
        }
        leaf event-id {
          type leafref {
            path "../../../events/event[type = current()/../type]"
               + "/event-id";
          }
          description
            "the event identifier, such as uuid,
             sequence number, etc.";
        }
      }
    }
    container events {
      description
        "related events.";
      list event {
        key "type event-id";
        description
          "related events.";
        leaf type {
          type identityref {
            base event-type;
          }
          description
            "event type.";
        }
        leaf event-id {
          type string;
          description
            "the event identifier, such as uuid,
             sequence number, etc.";
        }
        choice event-type-info {
          description
            "event type information.";
          case alarm {
            when "derived-from-or-self(type, 'alarm')" {
              description
                "Only applies when type is alarm.";
            }
            container alarm {
              description
                "alarm type event.";
              leaf resource {
                type leafref {
                  path "/al:alarms/al:alarm-list/al:alarm"
                     + "/al:resource";
                }
                description
                  "network resource.";
                reference
                  "RFC 8632: A YANG Data Model for Alarm
                   Management";
              }
              leaf alarm-type-id {
                type leafref {
                  path "/al:alarms/al:alarm-list/al:alarm"
                     + "[al:resource = current()/../resource]"
                     + "/al:alarm-type-id";
                }
                description
                  "alarm type id";
                reference
                  "RFC 8632: A YANG Data Model for Alarm
                    Management";
              }
              leaf alarm-type-qualifier {
                type leafref {
                  path "/al:alarms/al:alarm-list/al:alarm"
                     + "[al:resource = current()/../resource]"
                     + "[al:alarm-type-id = current()/.."
                     + "/alarm-type-id]/al:alarm-type-qualifier";
                }
                description
                  "alarm type qualitifier";
                reference
                  "RFC 8632: A YANG Data Model for Alarm
                   Management";
              }
            }
          }
          case notification {
            //TODO
          }
          case log {
            //TODO
          }
          case KPI {
            //TODO
          }
          case unknown {
            //TODO
          }
        }
      }
    }
  }

  // rpcs

  rpc incident-acknowledge {
    description
      "This rpc can be used to acknowledge the specified
       incidents.";
    input {
      leaf-list incident-no {
        type incident-ref;
        description
          "the identifier of an incident instance.";
      }
    }
  }

  rpc incident-diagnose {
    description
      "This rpc can be used to diagnose the specified
       incidents. The result of diagnosis will be reported
       by incident notification.";
    input {
      leaf-list incident-no {
        type incident-ref;
        description
          "the identifier of an incident instance.";
      }
    }
  }

  rpc incident-resolve {
    description
      "This rpc can be used to resolve the specified
       incidents. The result of resolution will be reported
       by incident notification.";
    input {
      leaf-list incident-no {
        type incident-ref;
        description
          "the identifier of an incident instance.";
      }
    }
  }

  sx:structure incident-acknowledge-error-info {
    container incident-acknowledge-error-info {
      description
        "This structure data MAY be inserted in the RPC error
         response to indicate the reason for the
         incident acknowledge failure.";
      leaf incident-no {
        type incident-ref;
        description
          "Indicates the incident identifier that
           fails the operation.";
      }
      leaf reason {
        type identityref {
          base acknowledge-error;
        }
        description
          "Indicates the reason why the operation is failed.";
      }
      leaf description {
        type string;
        description
          "Indicates the detailed description about the failure.";
      }
    }
  }
  sx:structure incident-diagnose-error-info {
    container incident-diagnose-error-info {
      description
        "This structure data MAY be inserted in the RPC error
         response to indicate the reason for the
         incident diagnose failure.";
      leaf incident-no {
        type incident-ref;
        description
          "Indicates the incident identifier that
           fails the operation.";
      }
      leaf reason {
        type identityref {
          base diagnose-error;
        }
        description
          "Indicates the reason why the operation is failed.";
      }
      leaf description {
        type string;
        description
          "Indicates the detailed description about the failure.";
      }
    }
  }
  sx:structure incident-resolve-error-info {
    container incident-resolve-error-info {
      description
        "This structure data MAY be inserted in the RPC error
         response to indicate the reason for the
         incident resolution failure.";
      leaf incident-no {
        type incident-ref;
        description
          "Indicates the incident identifier that
           fails the operation.";
      }
      leaf reason {
        type identityref {
          base resolve-error;
        }
        description
          "Indicates the reason why the operation is failed.";
      }
      leaf description {
        type string;
        description
          "Indicates the detailed description about the failure.";
      }
    }
  }

  // notifications

  notification incident-notification {
    description
      "incident notification. It will be triggered when
       the incident is raised, updated or cleared.";
    leaf incident-no {
      type incident-ref;
      description
        "the identifier of an incident instance.";
    }
    uses incident-info;
    leaf time {
      type yang:date-and-time;
      description
        "occur time of an incident instance.";
    }
  }

  //data definitions

  container incidents {
    config false;
    description
      "the information of incidents.";
    list incident {
      key "name type incident-id";
      description
        "the information of incident.";
          leaf incident-no {
      type uint64;
      mandatory true;
      description
        "The unique sequence number of the incident instance.";
    }
      uses incident-info;
      uses incident-time-info;
    }
  }
}

<CODE ENDS>

9. Security Considerations

The YANG modules specified in this document define a schema for data that is designed to be accessed via network management protocol such as NETCONF [RFC6241] or RESTCONF [RFC8040]. The lowest NETCONF layer is the secure transport layer, and the mandatory-to-implement secure transport is Secure Shell (SSH) [RFC6242]. The lowest RESTCONF layer is HTTPS, and the mandatory-to-implement secure transport is TLS [RFC8446].

The Network Configuration Access Control Model (NACM) [RFC8341] provides the means to restrict access for particular NETCONF or RESTCONF users to a preconfigured subset of all available NETCONF or RESTCONF protocol operations and content.

Some of the readable data nodes in this YANG module may be considered sensitive or vulnerable in some network environments. It is thus important to control read access (e.g., via get, get-config, or notification) to these data nodes. These are the subtrees and data nodes and their sensitivity/vulnerability:

'/incidents/incident': This list specifies the network incident entries. Unauthorized read access of this list can allow intruders to access network incident information and potentially get a picture of the broken state of the network. Intruders may exploit the vulnerabilities of the network to cause further negative impact on the network. Care must be taken to ensure that this list are accessed only by authorized users.

Some of the RPC operations in this YANG module may be considered sensitive or vulnerable in some network environments. It is thus important to control access to these operations. These are the operations and their sensitivity/vulnerability:

"incident-diagnose": This RPC operation performs network incident diagnosis and root cause locating. If a malicious or buggy client performs an unexpectedly large number of this operation, the result might be an excessive use of system resources on the server side as well as network resources. Servers MUST ensure they have sufficient resources to fulfill this request; otherwise, they MUST reject the request.

"incident-resolve": This RPC operation is used to resolve the network incident. If a malicious or buggy client performs an unexpectedly large number of this operation, the result might be an excessive use of system resources on the server side as well as network resources. Servers MUST ensure they have sufficient resources to fulfill this request; otherwise, they MUST reject the request.

10. IANA Considerations

10.1. The "IETF XML" Registry

This document requests IANA to register one XML namespace URN in the "ns" subregistry within the "IETF XML Registry" [RFC3688]:

URI: urn:ietf:params:xml:ns:yang:ietf-incident
Registrant Contact: The IESG.
XML: N/A, the requested URIs are XML namespaces.

10.2. The "YANG Module Names" Registry

This document requests IANA to register one module name in the 'YANG Module Names' registry, defined in [RFC6020].

Name: ietf-incident
Maintained by IANA?  N
Namespace: urn:ietf:params:xml:ns:yang:ietf-incident
Prefix: inc
Reference:  RFC XXXX
// RFC Ed.: replace XXXX and remove this comment

Acknowledgments

The authors would like to thank Mohamed Boucadair, Robert Wilton, Benoit Claise, Oscar Gonzalez de Dios, Adrian Farrel, Mahesh Jethanandani, Balazs Lengyel, Dhruv Dhody,Bo Wu, Qiufang Ma, Haomian Zheng, YuanYao,Wei Wang, Peng Liu, Zongpeng Du, Zhengqiang Li, Andrew Liu , Joe Clark, Roland Scott, Alex Huang Feng, Kai Gao, Jensen Zhang, Ziyang Xing for their valuable comments and great input to this work.

References

Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC3688]
Mealling, M., "The IETF XML Registry", BCP 81, RFC 3688, DOI 10.17487/RFC3688, , <https://www.rfc-editor.org/rfc/rfc3688>.
[RFC6020]
Bjorklund, M., Ed., "YANG - A Data Modeling Language for the Network Configuration Protocol (NETCONF)", RFC 6020, DOI 10.17487/RFC6020, , <https://www.rfc-editor.org/rfc/rfc6020>.
[RFC6241]
Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., and A. Bierman, Ed., "Network Configuration Protocol (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, , <https://www.rfc-editor.org/rfc/rfc6241>.
[RFC6242]
Wasserman, M., "Using the NETCONF Protocol over Secure Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, , <https://www.rfc-editor.org/rfc/rfc6242>.
[RFC6991]
Schoenwaelder, J., Ed., "Common YANG Data Types", RFC 6991, DOI 10.17487/RFC6991, , <https://www.rfc-editor.org/rfc/rfc6991>.
[RFC8040]
Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF Protocol", RFC 8040, DOI 10.17487/RFC8040, , <https://www.rfc-editor.org/rfc/rfc8040>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.
[RFC8341]
Bierman, A. and M. Bjorklund, "Network Configuration Access Control Model", STD 91, RFC 8341, DOI 10.17487/RFC8341, , <https://www.rfc-editor.org/rfc/rfc8341>.
[RFC8345]
Clemm, A., Medved, J., Varga, R., Bahadur, N., Ananthakrishnan, H., and X. Liu, "A YANG Data Model for Network Topologies", RFC 8345, DOI 10.17487/RFC8345, , <https://www.rfc-editor.org/rfc/rfc8345>.
[RFC8446]
Rescorla, E., "The Transport Layer Security (TLS) Protocol Version 1.3", RFC 8446, DOI 10.17487/RFC8446, , <https://www.rfc-editor.org/rfc/rfc8446>.
[RFC8632]
Vallin, S. and M. Bjorklund, "A YANG Data Model for Alarm Management", RFC 8632, DOI 10.17487/RFC8632, , <https://www.rfc-editor.org/rfc/rfc8632>.

Informative References

[BERT]
"BERT (language model)", n.d., <https://en.wikipedia.org/wiki/BERT_(language_model)>.
[I-D.ietf-ippm-pam]
Mirsky, G., Halpern, J. M., Min, X., Clemm, A., Strassner, J., and J. François, "Precision Availability Metrics for Services Governed by Service Level Objectives (SLOs)", Work in Progress, Internet-Draft, draft-ietf-ippm-pam-09, , <https://datatracker.ietf.org/doc/html/draft-ietf-ippm-pam-09>.
[I-D.ietf-nmop-terminology]
Davis, N., Farrel, A., Graf, T., Wu, Q., and C. Yu, "Some Key Terms for Network Fault and Problem Management", Work in Progress, Internet-Draft, draft-ietf-nmop-terminology-05, , <https://datatracker.ietf.org/doc/html/draft-ietf-nmop-terminology-05>.
[I-D.quilbeuf-opsawg-configuration-tracing]
Quilbeuf, J., Claise, B., Graf, T., Lopez, D., and S. Qiong, "External Trace ID for Configuration Tracing", Work in Progress, Internet-Draft, draft-quilbeuf-opsawg-configuration-tracing-02, , <https://datatracker.ietf.org/doc/html/draft-quilbeuf-opsawg-configuration-tracing-02>.
[I-D.rogaglia-netconf-trace-ctx-extension]
Gagliano, R., Larsson, K., and J. Lindblad, "NETCONF Extension to support Trace Context propagation", Work in Progress, Internet-Draft, draft-rogaglia-netconf-trace-ctx-extension-03, , <https://datatracker.ietf.org/doc/html/draft-rogaglia-netconf-trace-ctx-extension-03>.
[RFC7950]
Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", RFC 7950, DOI 10.17487/RFC7950, , <https://www.rfc-editor.org/rfc/rfc7950>.
[RFC8969]
Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and L. Geng, "A Framework for Automating Service and Network Management with YANG", RFC 8969, DOI 10.17487/RFC8969, , <https://www.rfc-editor.org/rfc/rfc8969>.
[RFC9375]
Wu, B., Ed., Wu, Q., Ed., Boucadair, M., Ed., Gonzalez de Dios, O., and B. Wen, "A YANG Data Model for Network and VPN Service Performance Monitoring", RFC 9375, DOI 10.17487/RFC9375, , <https://www.rfc-editor.org/rfc/rfc9375>.
[RFC9408]
Boucadair, M., Ed., Gonzalez de Dios, O., Barguil, S., Wu, Q., and V. Lopez, "A YANG Network Data Model for Service Attachment Points (SAPs)", RFC 9408, DOI 10.17487/RFC9408, , <https://www.rfc-editor.org/rfc/rfc9408>.
[RFC9417]
Claise, B., Quilbeuf, J., Lopez, D., Voyer, D., and T. Arumugam, "Service Assurance for Intent-Based Networking Architecture", RFC 9417, DOI 10.17487/RFC9417, , <https://www.rfc-editor.org/rfc/rfc9417>.
[RFC9418]
Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T. Arumugam, "A YANG Data Model for Service Assurance", RFC 9418, DOI 10.17487/RFC9418, , <https://www.rfc-editor.org/rfc/rfc9418>.
[TMF724A]
"Incident Management API Profile v1.0.0", , <https://www.tmforum.org/resources/standard/tmf724a-incident-management-api-profile-v1-0-0/>.
[W3C-Trace-Context]
"W3C Recommendation on Trace Context", , <https://www.w3.org/TR/2021/REC-trace-context-1-20211123/>.

Appendix A. Appendix Examples

A.1 Network Incident management with specific network topology and the network service

{
  "incident-no": 56433218,
  "incident-id": "line fault",
  "service-instance": ["optical-svc-A"],
  "domain": "FAN",
  "priority": "critical",
  "occur-time": "2026-03-10T04:01:12Z",
  "clear-time": "2026-03-10T06:01:12Z",
  "ack-time": "2026-03-10T05:01:12Z",
  "last-updated": "2026-03-10T05:31:12Z",
  "status": "unacknowledged-and-uncleared",
  "category": "Line",
  "source": [
    {
      "node-ref": "example:D1",
      "network-ref": "example:L2-topo",
      "resource": [
        {
          "name": "7985e01a-5aad-11ea-b214-286ed488cf99"
        }
      ]
    }
  ],
  "root-causes": [
    {
      "name": "Feeder fiber great loss change",
      "detail-information": "The connector of the optical fiber
       is contaminated, Or the optical fiber is bent too much.",
      "root-cause": {
        "network-ref": "example:L2-topo",
        "node-ref": "example:D1",
        "resource": [
          {
            "name": "7985e01a-5aad-11ea-b214-286ed488cf99",
            "cause-name": "ltp",
            "detail": "Frame=0, Slot=6, Subslot=65535, Port=7,
            ODF= ODF001,  Level1Splitter= splitter0025"
          }
        ]
      }
    }
  ],
  "root-event": [
    {
      "event-id": "8921834",
      "type": "alarm"
    }
  ],
  "events": [
    {
      "even-id": "8921832",
      "type": "alarm"
    },
    {
      "even-id": "8921833",
      "type": "alarm"
    },
    {
      "even-id": "8921834",
      "type": "alarm"
    }
  ]
}

Appendix B. Changes between Revisions

v01 - v-2

v00 - v01

v00 (draft-ietf-nmop-network-incident-yang)

v03 - v04 (draft-feng-opsawg-incident-management)

v02 - v03 (draft-feng-opsawg-incident-management)

v01 - v02

v00 - v01 (draft-feng-opsawg-incident-management)

Contributors

Thomas Graf
Swisscom
Binzring 17CH-8045
CH- Zurich
Switzerland
Zhenqiang Li
CMCC
Yanlei Zheng
China Unicom
Yunbin Xu
CAICT
Xing Zhao
CAICT
Chaode Yu
Huawei
MingShuang Jin
Huawei Technologies
Chunchi Liu
Huawei Technologies
Aihua Guo
Futurewei Technologies
Zhidong Yin
Huawei
Guoxiang Liu
Huawei
Kaichun Wu
Huawei

Authors' Addresses

Tong Hu
CMCC
Building A01, 1600 Yuhangtang Road, Wuchang Street, Yuhang District
Hangzhou
311121
China
Luis Miguel Contreras Murillo
Telefonica I+D
Madrid
Spain
Qin Wu
Huawei
101 Software Avenue, Yuhua District
Nanjing
210012
China
Nigel Davis
Ciena
Chong Feng