Automatic and NLP methods for domain concepts elicitation from text

by Soldatov Ilya, iasoldatov@edu.hse.ru

Domain-Driven Design is a software development approach that emphasizes understanding of software domain and the logic related to it. One of the keys to analyze it is the text elicitation which allows experts to get main domain terms, concepts and actions automatically or semi-automatically in order to deeply understand the specific subject domain's characteristics to create effective and relevant software solutions. Without automatic methods the process of analysis is usually carried manually. The text used for extraction often comes from various sources: from corporate documents, technical specifications, and manuals to interactions with domain experts and interviews with stakeholders. These materials often contain valuable knowledge about the domain, its key concepts, actors, and interactions. In the traditional manual approach to text extraction, analysts and domain experts carefully studied and analyzed textual materials, highlighting the most important and relevant fragments. This process required a deep understanding of the subject matter and a high degree of attention to detail, as even minor details could be significant for understanding the domain. However, this method could be pretty slow, prone to errors, and depended on the qualification of a specific expert.

Natural Language Processing, or NLP, is a field of research at the intersection of computer science, artificial intelligence, and linguistics aimed at understanding and interpreting human language using machines. Since the primary task in DDD is identifying and formalizing key domain concepts, NLP can become a powerful tool. Using NLP methods, one can automate the process of studying and analyzing large volumes of textual data, such as documentation, technical specifications, or even unstructured notes and discussions. Where analysts and domain experts used to spend a lot of time manually studying materials, today NLP algorithms can identify key concepts, dependencies, and patterns in the text, significantly speeding up and improving the domain understanding process.

Besides NLP there are some others automatic text elicitation methods. For example, statistical analysis methods are used to determine term frequencies, correlations, and other indicators that can reveal key domain concepts. Semantic analysis technologies, like ontological modeling, provide means for representing and interpreting knowledge extracted from text. They allow the construction of complex models that reflect the structure of the subject domain, the relationships between its elements, and their properties. Data visualization methods can help to represent complex domain structures in a graphical format, making their understanding more intuitive and visual. There are some models related to NLP methods like the Recursive Object Model (ROM) which is a tool for representing the syntactical structure of text. This model was developed to capture the technical nuances of English text in software documents, particularly where only declarative statements are involved. The ROM uses a set of symbols to represent various linguistic elements, such as nouns and verbs, and their relationships. For instance, it employs solid-line boxes to denote specific entities corresponding to nouns, while various arrows and lines represent relationships and actions between these entities. The ROM's strength lies in its ability to transform natural language into a structured diagram, which can then be stored in formats like XRD, an extension of XML. This structured representation aids in capturing the essence of requirements and provides a foundation for further analysis. Another models kind is Expert Comparable Contextual (ECC), which are extracted from domain-specific data models. The ECC models provide domain-related information, enriching the syntactical analysis. These models describe the structure and meaning of the data, serving as a standard for designing databases. The ECC models assist in identifying necessary elements in the conceptual models due to their resemblance with data models.

First and foremost, automated methods provide a significant acceleration in analyzing large volumes of text. Time is a valuable resource, and the speed of processing can save many person-hours that were previously spent on manual document examination. Also, manual analysis sometimes can be subjective and inconsistent. Automated algorithms offer a level of objectivity, ensuring that the extraction process remains consistent across different datasets. Furthermore, it allows us to move from simple information extraction to more complex interpretation. For example, NLP methods can determine not only the presence of specific terms in the text but also their context, semantic relationships with other terms, and the emotional tone of mentions. This enables a deeper and more comprehensive understanding of the subject domain, which is key to successful Domain-Driven Design.

However, NLP is not perfect at all at the moment. In the natural language words can have multiple meanings based on their context. Although NLP models can be adapted for specific domains, the nuances of certain specialized fields might pose challenges. For instance, scientific jargon, niche terminologies, or emerging concepts may not always be captured accurately by generalized models. The efficiency of NLP methods is often tied to the quality of the data they are trained on. If the training data is biased, incomplete, or not representative of the domain, the extracted concepts might work not really well. With the efficiencies offered by automated methods, there's a potential risk of over-reliance, which makes it tempting to automate the process entirely without further verification. The desire to effectively identify domain concepts opens up broader paths to the discovery and representation of knowledge. At this point, assessing the progress made and recognizing the challenges ahead, it is clear that the fusion of human knowledge and technological progress will shape the future of this endeavor, promising a richer, deeper and more holistic understanding of the world's diverse knowledge.

Automatic text elicitation methods and NLP in first place provide the ability to quickly and consistently analyze vast amounts of textual information, extracting key domain concepts and relationships. However, like any technology, they have their limitations that require careful and conscious application. It is important to maintain a balance between automation and human intervention to ensure the accuracy and reliability of results. With progress in the fields of artificial intelligence and machine learning, we are likely to see even more innovations and improvements in this area. However, the key to successful application of these technologies lies in the harmony between machine and human, their collaborative work, and mutual understanding.

Extracting Domain Models from Natural-Language Requirements: Approach and Industrial Evaluation by Chetan Arora, Mehrdad Sabetzadeh, Lionel Briand https://people.svv.lu/sabetzadeh/pub/MODELS16.pdf

Automated Assistance for Use Cases Elicitation from User Requirements Text by Shadi Moradi Seresht and Olga Ormandjieva https://www.researchgate.net/publication/264874836_Automated_Assistance_for_Use_Cases_Elicitation_from_User_Requirements_Text

Extraction of use case diagram elements using natural language processing and network science by Maryam Imtiaz Malik, Muddassar Azam Sindhu, Rabeeh Ayaz Abbasi https://www.researchgate.net/publication/371809721_Extraction_of_use_case_diagram_elements_using_natural_language_processing_and_network_science