To view a C/C++ code-base in Structure101 Studio for Clang C/C++, you first need to generate a .cpa (C/C++ architecture) file containing all the raw data about the composition and inter-dependencies in your source code. Once you have this, create a Structure101 Studio project to point at that .cpa file.

The .cpa file generation is a two stage process which can be driven by the provided CPAGenerator class.

  1. Pre-compile each compilation unit in the project to output an AST file

  2. Read each AST file and convert the AST constructs to generate the cpa file

Quick Start

It is assumed Structure101 Studio or Build for Clang C/C++ has been installed into <Structure101 install path>.

  1. Download the Structure101 patched clang distribution (available here) and unpack to a suitable folder.
    eg C:\Tools\LLVM-13.0.1-win64

  2. Generate a compile_commands.json file for your project in the root folder of the source tree.

  3. Copy the compile_config.json and excludes.txt files from <Structure101 install path>/build-tools/default-config-files to your project's root folder.

  4. Edit compile_config.json and change clang in "compiler": "clang" so it references the Structure101 patched clang distribution.
    eg "C:\\Tools\\LLVM-13.0.1-win64\\bin\\clang.exe" or "/home/myuser/LLVM-13.0.1-win64/clang"

  5. In a shell or command prompt in your project's root folder run: -
    java -cp <Structure101 install path>/structure101-cpa-build.jar com.headway.parser.ast.CPAGenerator generate-asts-cpa -x -k -o my-project-model

  6. Check the console output or the cpa-parser.log file for any reported errors.

If there are errors my-project-model will be incomplete. You must resolve all the errors to get a complete model of your code.

On subsequent runs: -

Add -ocf true to force overwrite of the my-project-model.cpa file if it already exists.

Add -x to avoid re-generating existing AST files to improve performance of the AST generation phase.

Add '-p 6 -cp 6 to improve performance of both phases.

This will utilise 6 threads each for the 'AST generation pool' and the 'AST converter pool' respectively.

Generating AST Files with Clang

The AST (Abstract Syntax Tree) file generation relies on the magic of Clang. However, the standard clang AST file does not output all the information needed to generate a structural model.

The most problematic issue was identifying macro expansions and their source location. The clang AST has two locations for an AST construct, the Spelling location, and the Expansion location. These two locations are very often the same. But for macro expansions, where a previously defined macro is used in a file, these two locations are different. The Spelling location is the source location of the macro definition. The Expansion location is the source location of the macro usage.

Unfortunately, in the standard AST, macro expansions are output with their Spelling location. This creates false dependencies to/from the file containing the macro definition to the files containing types that were used in the macro expansions. Those dependencies should be to/from the file in which the macro is expanded. The only way to resolve this issue was to modify the AST dump code to output the Expansion location instead of the Spelling location.

Therefore, the AST generation stage must use the Structure101 custom clang compiler

A new option has been added to -ast-dump/-ast-dump-all to output the Structure101 format AST with the Expansion location

-ast-dump-all=s101

In addition a new option has been added to generate a custom file containing the source locations of #includes and the macro defines and expansions. The AST and custom file are generated from a single pre-compile to maximise performance.

The new option is -HM

-Xclang -HM=<path to output file>.inc

This writes the dependency hierarchy and the macro define/expansion locations into a .inc file.

The patched clang distribution is available from the Structure101 here.

 

Below is a typical command line converted from a compile_commands.json file by the CPAGenerator class.

clang.exe -E -g -w -fno-color-diagnostics -Xclang -ast-dump-all=s101 -Xclang -HM=C:\development\cpptest\generated-asts/cpptest_Struct.c.inc -isystem <snip> "C:/development/cpptest/Struct.c"

Using the CPAGenerator class with a compile_commands.json file is the recommended approach.

 

The compilation database approach

A compilation database is a JSON file, which consists of an array of “command objects”, where each command object specifies one way a translation unit is compiled / parsed in the project.

Compilation databases are now widely used and supported and can easily be generated from most build systems using a wide variety of open source tools such as: CMake, Bear, compiledb and YCM-Generator. There is also a free proprietary Visual Studio Addin from Sourcetrail to generate a compilation database from a Visual Studio solution.

The Structure101 CPAGenerator can take a JSON compilation database file as input, to produce a .cpa output file.

The Structure101 CPAGenerator

The generator is packaged in the structure101-cpa.jar file that can be found in your Structure101 installation directory.

The generator supports the following commands:

generate-asts-cpa

Runs all the commands in the specified JSON compilation database. Each command in the database should be configured (using a compilation configuration file) to generate a Clang AST file (.ast) for each compilation unit.

When all commands in the compilation database have been run, a recursive search for .ast files is carried out to generate a .cpa file.

generate-cpa

Generates a .cpa file from existing .ast files, skipping the AST generation phase of the generate-asts-cpa command. It will typically only be used in initial setup when there are issues with AST generation and you want to get some sense of the partial Structure101 model of your code.

merge-dbs

Merges the content of one CPA database into another CPA database. After merging the generate-from-db command can be used to generate the cpa file. Typically used after merging several databases into one.

merge-dbs-cpa

Merges the content of one CPA database into another CPA database and generates a cpa file from the merged database.

import-edges

Queries an input CPA database for edges using a custom SQL statement and imports them into an output CPA database. After import the generate-from-db command can be used to generate the cpa file. Typically used to import edges from several databases.

import-edges-cpa

Queries an input CPA database for edges using a custom SQL statement and imports them into an output CPA database and generates a cpa file from the merged database.

generate-from-db

Generates a cpa file from a previously populated CPA database.

run-compile-commands

Runs all the commands in compile-commands-file-name and generates .stdout and .stderr files to capture output.

clean

Removes all references in your .ast files to code entities referenced in your excludes file. The generate-asts-cpa command automatically runs clean over the generated ast files. The clean commands excludes configuration file is also used during cpageneration to remove unwanted system and third-party library includes from the model.

One of the above commands is required when invoking the Structure101CPAGenerator.

Usage: java -cp <installation-dir>/structure101-cpa.jar com.headway.parser.ast.CPAGenerator <generate-asts-cpa | generate-cpa | clean> [optional command specific arguments]

Usage information can be output using the -help argument:

java -cp <installation-dir>/structure101-cpa.jar com.headway.parser.ast.CPAGenerator -help

Details of each command and their command line options follow.

 

The generate-asts-cpa command

Usage: generate-asts-cpa -c <compile-commands-file-name> -f <compile-configuration-file-name> -e <excludes-file-name> -d <ast-root-folder-name> -o <output-file-name> -pool-size <size> -converter-pool-size <size> -gzip-compress -keep-going -ignore-compilation-errors -dry-run

Runs all the commands in compile-commands-file-name. Each command in compile-commands-file-name should be configured to generate a .ast file for each compilation command. When all commands in compile-commands-file-name have been run, a recursive search of ast-root-folder-name is carried out to generate the output-file-name .cpa file from the generated .ast files found.

output-file-name can then be loaded into Structure101 Studio.

If the -c argument is not used, compile-commands-file-name defaults to compile_commands.json.

If the -f argument is not used, compile-configuration-file-name defaults to compile_config.json.

If the -e argument is not used, excludes-file-name defaults to excludes.txt.

If the -d argument is not used, ast-root-folder-name defaults to generated-asts.

If the -o argument is not used, output-file-name defaults to structure101-model.cpa.

The generate-asts-cpa command runs in two stages.

Stage 1 processes your compile-commands-file-name to run modified compilation steps that each generate a Clang AST file instead of an object file. The "standard" compilation is modified using compile-configuration-file-name.

Stage 2 processes the generated AST files to create a single Structure101 .cpa model .

The command input options are:

-c <compile-commands-file-name>

-f <compile-configuration-file-name>

-e <excludes-file-name>

The output options are:

-d <ast-root-folder-name> the folder into which the .ast files are written, defaults to generated-asts.

-o <output-file-name> is the cpa file to be generated, defaults to structure101-model.cpa.

The following options modify the behaviour of the command:

argument-short-form / argument-long-form

-rt / -root-folder

is the root file path that will be stripped from the model file paths. If not provided the root folder is set to the parent of the working folder.

The root folder can be set to the folder containing the compile_commands.json file using -root-folder const(COMPILE_COMMANDS_FILE)

It can be set to the parent of the working folder using -root-folder const(WORKING_FOLDER)

Note that AST files generated on one host can be converted successfully on another host by setting -root-folder appropriately. The absolute paths of the compile commands are captured in the .src files, one for each AST file.

For example, this .src file was created running generate-asts-cpa on Linux.

Class1.cpp.src

/home/mikesr/cpptest

Class1.cpp

The project folder is cpptest. These AST files can be successfully run anywhere provided -root-folder /home/mikesr/ is used.

-rf / -use-response-file

when true, creates a response file of arguments to avoid limitations on command line length. Defaults to true.

-z / -gzip-compress [true|false]

when true, will gzip all generated ast files to reduce disk space usage. Defaults to true.

-x / -keep-existing-files

doesn't clear the ast folder and skips any compile_commands for which the ast file exists. This saves time when configuring a new project as previously successful compile steps are not re-run.

-ocf / -overwrite-cpa-file [true|false]

when true, forces overwrite of an existing cpa file. Defaults to false.

-p / -pool-size n

set the thread pool size to n for parallel execution. Each pool thread spawns a new process which executes the compile command. The default size is 1. On a multi-core machine increasing the thread pool size can reduce execution time by 75+%.

-i / -ignore-compilation-errors

ignore any errors and save any AST output produced by clang to the resulting .ast file. If this option is not used an empty .ast file is created for each compilation unit that encounters a compilation error even though clang may have produced AST output. The AST is generated up to the point of the compilation failure and may be useful when diagnosing compilation issues.

NOTE that the partial AST is not processed during the conversion phase and no model elements will be created for the failed compilation unit.

-k / -keep-going

keep going even if clang parsing errors are encountered. By default, processing of compilation units stops on error.

-dry-run

prints the pre-compile commands to the console, but does not execute them.

-cp / -converter-pool-size n

set the number of concurrent threads to use when converting the AST files to the CPA model. The default size is 1.

-mapp / max-asts-per-pool

specifies the maximum number of AST files to queue for conversion by the pool. The pool is refreshed before processing each sub-set of AST files which reduces the memory consumed during the conversion process. The default is 8000.

-pc / -preserve-case

The -pc preserve-case option determines if the case sensitivity of file and folder names is preserved or the names are converted to lower case. The default is true.

-idb / -initialise-db

when set false will not drop and re-create the tables of an existing CPA database file. Default is true.

-dbo / -db-options

will be appended to the CPA JDBC connect string. This is useful for enabling logging from CPA with -dbo TRACE_LEVEL_FILE=2 (info level or =3 debug). The log is written to a .mv.trace.db file.

-ncs / -name-col-size

specifies the length of the name column in the H2 database tables. Some deeply nested types (when using boost, for example) can be thousands of characters in length. Default is 30K.

NOTE that if any errors are detected during database activity a .mv.trace.db is automatically generated and the error logged. If this happens the database data will be corrupt and it will not be possible to open the .cpa file generated from it. The cause of the error must be resolved and the ast generation run again.

-ps / -page-size

sets the page size of results read from the database during writing of the cpa file. Default is 100000.

Configuration

The compile commands must be modified to generate the AST file. The necessary changes are controlled by the compile-configuration-file-name file. Here is an example configuration file:

Copy
[
 {
    "//": "Change the compiler argument to point to the Structure101 patched clang",
    "compiler": "clang",
    "compile-commands-file-name": "compile_commands.json",
    "excludes-file-name": "excludes.txt",
    "ast-root-folder-name": "generated-asts",
    "output-file-name": "structure101-model.cpa",
    "compilerArguments": [
   {
    "//": "Required to execute clang pre-compile with warnings suppressed",
    "compilerArguments": [
     "-E",
     "-w"
    ]
   },
   {
    "//": "Instruct clang to emit an AST dump in Structure101 format with debug info and no color",
    "compilerArguments": [
     "-Xclang",
     "-ast-dump-all=s101",
     "-g",
     "-fno-color-diagnostics"
    ]
   },
   {
    "//": "Instruct clang to output a file of pre-processor directives, includes and macros",
    "compilerArguments": [
     "-Xclang",
     "-HM="
    ]
   },
   {
    "//": "Add any additional compiler arguments here",
    "compilerArguments": [
    ]
   }
  ],
  "argumentsToDelete": [
  ],
   "argumentsToReplace": [
   {
     "pattern": "/D(.*)",
     "replace": "-D$1"
   },
  {
   "pattern": "/I(.*)",
   "replace": "-I$1"
  }
  ],
 "argumentsToInclude": [
  {
   "pattern": "-I",
   "includeNextArg": "true"
  },
  {
   "pattern": "-I(.*)"
  },
  {
   "pattern": "-isystem",
   "includeNextArg": "true"
  },
  {
   "pattern": "-isystem(.*)"
  },
  {
   "pattern": "-U",
   "includeNextArg": "true"
  },
  {
   "pattern": "-U(.*)"
  },
  {
   "pattern": "-D",
   "includeNextArg": "true"
  },
  {
   "pattern": "-D(.*)"
  },
  {
   "pattern": "-std",
   "includeNextArg": "true"
  },
  {
   "pattern": "-std(.*)"
  }
 ] ,
      "compileCommandsToInclude": [
       {
        "pattern": "(.*).cpp"
       }
      ],
      "compileCommandsToExclude": [
       {
        "pattern": "(.*).S"
       }
      ]  
 }
]

compile_config.json

 

The compiler element will not need altering if our patched Clang executable is already on your PATH environment variable. If necessary you can provide the absolute path to our patched clang.

Note that the first argument of each compile command (the compiler) is always removed and substituted with the compiler element. If you are already using your own custom clang implementation in the compile commands you can configure the CPAGenerator to use it with "compiler": "const(USE_COMMAND_COMPILER)"

NOTE:If you do this you must patch your clang source to support the additional options described above. A patch file is available on request by emailing support@structure101.com.

The "-HM=" compiler argument will have an appropriate file name appended by the CPAGenerator as each compile command is processed.

The argumentsToDelete, argumentsToReplace and argumentsToInclude are used to modify the list of arguments in the compile command such that only those required for a successful pre-compile are passed to clang. All 3 can be used together but the intention is that either argumentsToInclude or argumentsToDelete is used with the other left empty as shown in the example compile_config.xml above.

It is recommended that argumentsToInclude is used in preference to argumentsToDelete such that arguments are removed if they have not been matched by a replace or include pattern. We have found that fewer changes to the compile_config are required when using argumentsToInclude.

Note however that the argumentsToInclude in the default compile_config.json will cause any platform or compiler specific arguments to be removed and the clang pre-compile will use the local host running the parser as the target and the default libraries identified by clang.

If using argumentsToDelete with an empty argumentsToInclude the arguments are retained if they have not been matched by a delete or replace argument.

argumentsToDeleteargumentsToReplace, argumentsToInclude, compileCommandsToInclude and compileCommandsToExclude all support regex. Our implementation uses the Java matches method. Consequently any special characters must be escaped. For instance, Windows paths using '\' as the path separator may cause problems in the Java regex matching when the character following is one of the escape codes (b, f, n, r or t). In these cases the Windows \ is replaced with /. So any pattern including a Windows path with an escape code should similarly use / rather than \. For example,

Copy
{
"pattern": "/IC:/references\(.*)",
"replace": "-IC:/references\$1"
}

rather than /IC:\\\\references(.*) in the pattern.

Note that argumentsToInclude has an optional element "includeNextArg” . This defaults to false when not provided. When true, the argument immediately following is also included. This is useful for cases where the option can be specified with or without a space before its value.

This requires two argumentsToInclude to cover both cases

Copy
"argumentsToIncude": [
{
"pattern": "-D",
"includeNextArg": "true"
},
{
"pattern": "-D(.*)",
}

The elements must be provided in this order since the second form matches both cases and would fail to include the following argument if their order was reversed.

 

When using regular expressions in argumentsToReplace $1 identifies the first group, $2 the second etc. For example:

Copy
  "argumentsToReplace": [
    {
      "pattern": "/D(.*)",
      "replace": "-D$1"
    }

 

Note that the argumentsToDelete has two optional elements "deleteNextArg” and “deletePreviousArg”. Both default to false when not provided. When true, the argument immediately following/preceding is also deleted. This is useful for cases where the option can be specified with a space before its value as in the case of the /F option for MSVC. It can be specified as ‘/Fvalue’ or ‘/F value’.

This requires two argumentsToDelete to cover both cases

Copy
"argumentsToDelete": [
{
"pattern": "/FI",
"deleteNextArg": "true"
},
{
"pattern": "/FI(.*)",
}

The elements must be provided in this order since the second form matches both cases and would fail to delete the following argument if their order was reversed.

The compileCommandsToInclude and compileCommandsToExclude can be used to select the project compile commands that you wish to process with the CPA parser. Typical uses are to exclude link and assembler compilation commands which are not appropriate for AST generation.

The patterns are matched first against the compile command file element and if no match then matched against the directory element.

The includes and excludes can be used together. The excludes take precedence over the includes.

Note that the format of the file element in the compile_commands.json depends on the tool used to generate the file. It may not include the full path to the file. Hence the check against the directory element to match file path patterns when the file element contains just the file name.

If compileCommandsToInclude is empty the default is to include everything. If the compileCommandsToExclude the default is to exclude nothing. In the default compile_config.json file both elements are empty.

Command line argument configuration

compile-configuration-file-name also supports the AST generation phase command line options making it easier to keep the configuration under source control.

Copy
"ast-root-folder-name": "generated-ast-files",
"compile-commands-file-name": "compile_commands.json",
"excludes-file-name": "my_excludes.txt",
"output-file-name": "structure101-model"
"root-folder":"/home/user/",
"use-response-file" :"true",
"gzip-compress": "true",
"keep-existing-files":"false",
"overwrite-cpa-file":"false",
"pool-size": "5",
"ignore-compilation-errors":"true",
"keep-going": "true",
"dry-run": "false"
Excludes configuration

excludes-file-name contains partial file paths one per line:

Copy
#Linux            
/usr/            
/opt/            

#OS X            
/System            
/Applications/Xcode.app            

#Windows            
C:\Program            
C://Program

The paths are used to exclude commands from the processing and to strip unwanted lines from the generated AST files.

The compile command contains a file element. If a compile commands file string contains any of the exclude paths it is discarded and will not be executed.

AST files, by default, contain a significant number of lines that represent system library constructs. These are not needed for modelling your application structure. The lines representing these system constructs are stripped from the file as the generate-asts-cpa command calls the clean command, see below for more on the clean command. They are identified using the location string of the ast line. The ast entry is stripped if any of the exclude paths match the file path of the ast location string.

Note that both matches are case sensitive and currently do not support wild cards or regex expressions.

 

The generate-cpa command

Usage: generate-cpa -d <ast-root-folder-name> -o <output-file-name>

Generates a .cpa file (from Clang .ast files) which can then be loaded into Structure101.

The -d ast-root-folder-name will be recursively searched for .ast files.

The -o output-file-name is the .cpafile to be generated. If no output-file-name is provided, structure101-model.cpa will be created. When an output-file-name is provided with the -o argument but no -d argument, the ast-root-folder-name defaults to the current working directory.

The -cp converter-pool-size option specifies the number of concurrent threads to use when converting the AST files to the CPA model. The default size is 1.

generate-cpa will typically only be used in initial setup when there are some issues with AST generation and you want to get a sense of the partial Structure101 model of your code.

 

The run-compile-commands command

Usage: run-compile-commands -c <compile-commands-file-name> -f <compile-configuration-file-name> -e <excludes-file-name> -d <ast-root-folder-name>

Runs all the commands in compile-commands-file-name and generates .stdout and .stderr files to capture output.

This command is a useful diagnostic tool to help understand why the AST generation is failing.

A default diagnostic_compile_config.json file is provided for projects that already use clang. It uses -E -H -w to run a pre-compile and dump the header includes.

For compilers other than clang use appropriate options to run a pre-compile and dump include directives if possible.

The compile commands are run with the arguments as is except for the -E -H -w. This will confirm that the compile commands are valid and can run a successful pre-compile.

The .stdout file will contain the pre-compiler output and the .stderr file will contain the nested header includes.

These same compiler options can be used in a copy of you project’s compile_config.json. Remove all other compilerArguments from the copied file and add -E -H -w.

The .stderr files will contain the nested include directives (and any errors) which can be compared with the output from the diagnostic_compile_config.json.

 

The clean command

Usage: clean <AST-file-name> <excludes-file-name>

Removes all references in AST-file-name (first argument) to code entities referenced in excludes-file-name (second argument).

Each line in the excludes-file-name should be a path to be excluded from the AST content. For example, to remove unwanted system include data, excludes.txt would contain /usr and/or c:\Program. As the focus of Structure101 is a model of your application code, system includes are typically excluded. In addition, AST files can be exceptionally large when they contain system includes.

The excludes-file-name defaults to excludes.txt.

In the event, you expressly wish to keep all system and third party library references in your model, simply use an empty excludes-file-name.

As the generate-asts-cpa command runs the clean command over each .ast as they are created, it is unlikely you will need to use this command. However, in the event you need to use clean, it is recommended to run -clean over each .ast as they are created in order to minimise disk usage.

 

Database functionality licensed separately

The merge-dbs command

Usage: merge-dbs -in-db <input-db-name> -out-db <output-db-name>

Merges the content of the input CPA database into the output CPA database.

NOTE that the input-db-name and output-db-name must be existing CPA database files generated from previous runs of the CPA Parser.

 

The merge-dbs-cpa command

Usage: merge-dbs-cpa -in-db <input-db-name> -out-db <output-db-name -o <output-file-name>

Merges the content of the input CPA database into the output CPA database and generates a .cpa file.

The -o output-file-name is the .cpa file to be generated. If no output filename is provided, structure101-model.cpa will be created.

NOTE that the input-db-name and output-db-name must be existing CPA database files generated from previous runs of the CPA Parser.

 

The import-edges command

Usage: import-edges -in-db <input-db-name> -out-db <output-db-name -max-sql <max-results-sql> -res-sql <results-sql>

Queries the input CPA database for edges using the results-sql and imports them into the output CPA database.

NOTE that the max-results-sql should return the total count of rows that will be returned by the results-sql.

NOTE also that the input-db-name and output-db-name must be existing CPA database files generated from previous runs of the CPA Parser.

The

The import-edges-cpa command

Usage: import-edges-cpa -in-db <input-db-name> -out-db <output-db-name -max-sql <max-results-sql> -res-sql <results-sql> -o <output-file-name>

Queries the input CPA database for edges using the results-sql and imports them into the output CPA database then generates a .cpa file.

The -o output-file-name is the .cpa file to be generated. If no output filename is provided, structure101-model.cpa will be created.

NOTE that the max-results-sql should return the total count of rows that will be returned by the results-sql.

NOTE also that the input-db-name and output-db-name must be existing CPA database files generated from previous runs of the CPA Parser.

The generate-from-db command

Usage: generate-from-db -in-db <input-db-name> -o <output-file-name>

Generates a .cpa file from the input CPA database.

The -o output-file-name is the .cpa file to be generated. If no output filename is provided, structure101-model.cpa will be created.

NOTE that the input-db-name must be an existing CPA database file generated from a previous run of the CPA Parser.