PyCLang Notes

Page Contents

References

Installing, Loading LibClang, Versions That Play Nice

A Basic Install

To install on Ubuntu try:

sudo apt-get update -y
sudo apt-get install -y libclang-dev

But beware, sometimes versions don't play nice together. For example, Python bindings at version 6.0.0.2 seem to require at least libclang1-8.so.1.

If you see anything like the following, you may have a bindings v.s. library version dependency issue:

>>> import clang.cindex
>>> index = clang.cindex.Index.create()
Traceback (most recent call last):
   <snip>
AttributeError: /usr/lib/llvm-3.8/lib/libclang.so: undefined symbol: clang_CXXConstructor_isConvertingConstructor
                         ^^^^^^^^
                        Out-of-date libclang library!!!

To install a specific version use, for example:

sudo apt install libclang1-8

Getting Python To Find LibClang

On some platforms Python doesn't seem to automatically find the LibClang library. You'll know it hasn't found the library when you see something like this:

>>> import clang.cindex
>>> index = clang.cindex.Index.create()
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/clang/cindex.py", line 4129, in get_cindex_library
    <snip>
OSError: libclang.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  <snip>
clang.cindex.LibclangError: libclang.so: cannot open shared object file: No such file or directory. To provide a path to libclang use Config.set_library_path() or Config.set_library_file().

To get it to find the library set your LD_LIBRARY_PATH environment variable to include a path to that library's directory [Ref]. You may also need to set DYLD_LIBRARY_PATH. For example:

export DYLD_LIBRARY_PATH=/usr/lib/llvm-8/lib/
export LD_LIBRARY_PATH=/usr/lib/llvm-8/lib/

If you want to look at where Python is loading the library from use:

LD_DEBUG=libs python3

Do your import and index creation as normal and look at the trace output so see how it is finding your libclang library.

Debian Stretch

Getting the right version of libclang for the Python3 bindings that are auto installed using pip was a little challenging. So far I have this [Ref1][Ref2] (seems to work!):

sudo apt install software-properties-common
sudo apt update
sudo apt install lsb-release

# For latest version:
# bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"

# But I want 8, so
wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
sudo ./llvm.sh 8

sudo ln -s /usr/lib/llvm-8/lib/libclang.so.1 /usr/lib/llvm-8/lib/libclang.so
export LD_LIBRARY_PATH=/usr/lib/llvm-8/lib
export DYLD_LIBRARY_PATH=/usr/lib/llvm-8/lib

But, note that this installs the entire clang toolchain, which if you only want the llibclang shared library, takes up a whole load more memory than needed - gigs worth! The installed tree can be pruned however to get rid of anything you dont need. There is probably as easier way! Sigh...

Some Examples / Playing

Here are some examples of playing around with pyclang...

I have version 6.0.0.2, installed using pip install clang (clang install seperately).

Poo, at the moment, I cant see a way of getting the opcode of a binary operator using these bindings. There appears to be an accepted patch for this functionality, but its been hanging around for over 4 years at the time of writing... so err... not holding my breath.

PyBee seems to have added this functionality in their fork called Sealang, which they say is an improved set of Python bindings for libclang, but unfortunately this project is no longer maintained. I tried testing it. Although it installed the module could not be imported due a missing symbol - I'm guessing its too out of date to work with the later libclang verions :'(

The Translation Unit - clang.cindex.TranslationUnit

The TranslationUnit seems to have the following useful properties:

The Cursor Abstraction - clang.cindex.Cursor

The cursor abstraction unifies the different kinds of entities in a program - declaration, statements, expressions, references to declarations, etc. - under a single &auot;cursor" abstraction with a common set of operations. Common operation for a cursor include: getting the physical location in a source file where the cursor points, getting the name associated with a cursor, and retrieving cursors for any child nodes of a particular cursor. [ref].

The cursor functions that are useful for navigating the AST are get_children(), lixical_parent(), sematic_parent and walk_preorder(). From the clang docs:

The lexical parent of a cursor is the cursor in which the given cursor was actually written. For many declarations, the lexical and semantic parents are equivalent (the semantic parent is returned by clang_getCursorSemanticParent()). They diverge when declarations or definitions are provided out-of-line. For example:

class C {
 void f();
};
void C::f() { }

In the out-of-line definition of C::f, the semantic parent is the class C, of which this function is a member. The lexical parent is the place where the declaration actually occurs in the source code; in this case, the definition occurs in the translation unit. In general, the lexical parent for a given entity can change without affecting the semantics of the program, and the lexical parent of different declarations of the same entity may be different. Changing the semantic parent of a declaration, on the other hand, can have a major impact on semantics, and redeclarations of a particular entity should all have the same semantic context.

In the example above, both declarations of C::f have C as their semantic context, while the lexical context of the first C::f is C and the lexical context of the second C::f is the translation unit.

The cursor abstraction has the following properties/functions of interest, some of which wrap up the C cursor manipulator functions [ref]:

Finding Enums

I wanted to find enums, whether they were anonymous or named, and for both cases if they were hidden behind a typedef. I was only interested in globally defined enums, not enums embedded in structs or local to functions, but I've included some examples here.

  1. An anonymous enum:
    // 1. Anonymous enum
    enum { ANON_ENUM_1, ... };
    • cursor.spelling = ""
    • cursor.type.spelling = "name enum (anonymous)"
    • cursor.is_anonymous() = True
    • The AST tree representing this is:
      +-- NODE: CursorKind.ENUM_DECL spel = '' (len=0)
          |   : cur.type.spelling: enum (anonymous at test_files/test1.c:1:1)
          |   : cur.type.kind: TypeKind.ENUM
          |   : cur.is_anonymous: True
          |   : cur.lexical_parent.spelling: test_files/test1.c
          |   : cur.semantic_parent.spelling: test_files/test1.c
          |   : cur.enum_type.spelling: int
          +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'ANON_ENUM_1' (len=11)
          |       : cur.type.spelling: int
          |       : cur.enum_value: 0
          |       : cur.semantic_parent.type.spelling: (anonymous at test_files/test1.c:1:1)
          |       : cur.semantic_parent.kind.spelling: CursorKind.ENUM_DECL
          +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'ANON_ENUM_2' (len=11)
          ...
          ...
          ...
  2. A named enum called bare_named.
    // 2. Named enum
    enum Bare_Named_Enum { BARE_NAMED_ENUM_1, ... };
    • There is only one enum decl.
    • cursor.spelling = "bare_named"
    • cursor.type.spelling = "enum bare_named"
    • cursor.is_anonymous() = False
    • The AST tree respresenting this:
      +-- NODE: CursorKind.ENUM_DECL spel = 'Bare_Named_Enum' (len=15)
          |   : cur.type.spelling: enum Bare_Named_Enum
          |   : cur.type.kind: TypeKind.ENUM
          |   : cur.is_anonymous: False
          |   : cur.lexical_parent.spelling: test_files/test1.c test_files/test1.c
          |   : cur.enum_type.spelling: int
          +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'BARE_NAMED_ENUM_1' (len=17)
          |       : cur.type.spelling: int
          |       : cur.enum_value: 0
          |       : cur.semantic_parent.type.spelling: enum Bare_Named_Enum 
          |       : cur.semantic_parent.kind.spelling: CursorKind.ENUM_DECL
          +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'BARE_NAMED_ENUM_2' (len=17)
          ...
          ...
          ...
  3. A typedef'ed anonymouse enum.
    // 3. Typdef'd anonymouse enum
    typedef enum { TYPEDEF_ANON_ENUM_1, ... } Typedef_Anonymouse_Enum_t;
    • cursor.spelling = ""
    • cursor.type.spelling = "type_t"
    • cursor.is_anonymous() = False. Presumably because it is referenced by the type created.
    • AST:
      +-- NODE: CursorKind.TYPEDEF_DECL spel = 'Typedef_Anonymouse_Enum_t' (len=25)
          |       : cur.type.spelling: Typedef_Anonymouse_Enum_t
          |       : cur.spelling: Typedef_Anonymouse_Enum_t
          |       : cur.underlying_typedef_type.spelling: enum Typedef_Anonymouse_Enum_t
          +-- NODE: CursorKind.ENUM_DECL spel = '' (len=0)
              |   : cur.type.spelling: Typedef_Anonymouse_Enum_t
              |   : cur.type.kind: TypeKind.ENUM
              |   : cur.is_anonymous(): False
              |   : cur.lexical_parent.type.spelling: test_files/test1.c
              |   : cur.semantic_parent.type.spelling: test_files/test1.c
              |   : cur.enum_type.spelling: int
              +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'TYPEDEF_ANON_ENUM_1' (len=19)
              |       : cur.type.spelling: int
              |       : cur.enum_value: 0
              |       : cur.lexical_parent.type.spelling: enum MySecondTestEnum
              |       : cur.semantic_parent.kind.spelling: CursorKind.ENUM_DECL
              +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'TYPEDEF_NAMED_ENUM_2' (len=20)
              ...
              ...
  4. A typedef'ed enum with a name.
    // 4. Typdef'd named enum
    typedef enum Typdef_Named_enum { TYPEDEF_NAMED_ENUM_1, ... } Typedef_Named_Enum_t;
    • There are two enum decls - one for the enum alone, and one as a child of the typedef.
    • cursor.spelling = "named_and_typedefed"
    • cursor.type.spelling = "enum named_and_typedefed"
    • cursor.is_anonymous() = False
    • AST:
      +-- NODE: CursorKind.TYPEDEF_DECL spel = 'Typedef_Named_Enum_t' (len=20)
          |   : cur.type.spelling: Typedef_Named_Enum_t
          |   : cur.spelling: Typedef_Named_Enum_t
          |   : cur.underlying_typedef_type.spelling: enum Typdef_Named_enum
          +-- NODE:  CursorKind.ENUM_DECL spel = 'Typdef_Named_enum' (len=17)
              |       : cur.type.spelling: enum Typdef_Named_enum
              |       : cur.kind.spelling: TypeKind.ENUM
              |       : cur.is_anonymous(): False
              |       : cur.lexical_parent.spelling test_files/test1.c
              |       : cur.sementic_parent.spelling test_files/test1.c
              |       : cur.enum_type.spelling: int
              +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'TYPEDEF_NAMED_ENUM_1' (len=20)
              |           : cur.type.spelling: int
              |           : cur.enum_value: 0
              |           : cur.lexical_parent.type.spelling: enum Typdef_Named_enum 
              |           : cur.semantic_parent.type.spelling: enum Typdef_Named_enum
              |           : cur.semantic_parent.kind.spelling: CursorKind.ENUM_DECL
              +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'TYPEDEF_NAMED_ENUM_2' (len=20)
              ...
              ...
           ...
  5. A named enum declared inside a structure.
    struct thestruct {
       enum enum_in_struct {
          ENUM_IN_STRUCT_1, ENUM_IN_STRUCT_2
       } val;
    };
    • The AST looks like this:
      +-- NODE:  CursorKind.STRUCT_DECL spel = 'thestruct' (len=9)
          +-- NODE:  CursorKind.ENUM_DECL spel = 'enum_in_struct' (len=14)
          |   |   : cur.type.spelling: enum enum_in_struct
          |   |   : cur.type.kind: TypeKind.ENUM
          |   |   : cur.is_anonymous(): False
          |   |   : cur.lexical_parent.spelling: thestruct test_files/test1.c
          |   |   : cur.semantic_parent.spelling: thestruct test_files/test1.c
          |   |   : cur.enum_type.spelling:  int
          |   +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'ENUM_IN_STRUCT_1' (len=16)
          |   |       : cur.type.spelling: int
          |   |       : cur.enum_value: 0
          |   |       : cur.lexical_parent.type.spelling: enum enum_in_struct 
          |   |       : cur.semantic_parent.type.spelling: enum enum_in_struct
          |   |       : cur.semantic_parent.kind: CursorKind.ENUM_DECL
          |   +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'ENUM_IN_STRUCT_2' (len=16)
          |           ...
          +-- NODE:  CursorKind.FIELD_DECL spel = 'val' (len=3)
              +-- NODE:  CursorKind.ENUM_DECL spel = 'enum_in_struct' (len=14)
                  |   : cur.type.spelling: enum enum_in_struct
                  |   : cur.type.kind: TypeKind.ENUM
                  |   : cur.is_anonymous(): False
                  |   : cur.lexical_parent.spelling: thestruct test_files/test1.c
                  |   : cur.semantic_parent.spelling: thestruct test_files/test1.c
                  |   : cur.enum_type.spelling:  int
                  +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'ENUM_IN_STRUCT_1' (len=16)
                  |       : cur.type.spelling: int
                  |       : cur.enum_value: 0
                  |       : parents: enum enum_in_struct enum enum_in_struct CursorKind.ENUM_DECL
                  +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'ENUM_IN_STRUCT_2' (len=16)
                          : cur.type.spelling: int
                          : cur.enum_value: 1
                          : parents: enum enum_in_struct enum enum_in_struct CursorKind.ENUM_DECL
  6. A typedef'd enum declared in a function:
    +-- NODE:  CursorKind.FUNCTION_DECL spel = 'func' (len=4)
        +-- NODE:  CursorKind.COMPOUND_STMT spel = '' (len=0)
            +-- NODE:  CursorKind.DECL_STMT spel = '' (len=0)
                +-- NODE:  CursorKind.ENUM_DECL spel = 'enum_in_func' (len=12)
                |   +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'E_IN_FUNC_1' (len=11)
                |   +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'E_IN_FUNC_2' (len=11)
                +-- NODE:  CursorKind.TYPEDEF_DECL spel = 'Enum_In_Func_t' (len=14)
                    +-- NODE:  CursorKind.ENUM_DECL spel = 'enum_in_func' (len=12)
                        +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'E_IN_FUNC_1' (len=11)
                        +-- NODE:  CursorKind.ENUM_CONSTANT_DECL spel = 'E_IN_FUNC_2' (len=11)

To get the enums:

Functions

All functions are represented in the AST using CursorKind.FUNCTION_DECL nodes. To differentiate between declarations and definitions, the cursor function is_definition() is used.

To go from the declaration to the definition the cursor function get_definition() can be used.

When a function is called, it is represented in the AST using a CursorKind.CALL_EXPR node.

typedef int NewType_t;

long func_with_params(char a, short b, NewType_t c)
{
   return a * b * c;
}
         
+-- NODE:  CursorKind.FUNCTION_DECL spel = 'func_with_params' (len=16)
    |   : cur.is_definition() True
    |   : cur.linkage: LinkageKind.EXTERNAL
    |   : cur.result_type.spelling: long
    |   : cur.get_arguments().type.spelling: ['char', 'short', 'NewType_t']
    +-- NODE:  CursorKind.PARM_DECL spel = 'a' (len=1)
    |       : cur.type.spelling: char
    +-- NODE:  CursorKind.PARM_DECL spel = 'b' (len=1)
    |       : cur.type.spelling: short
    +-- NODE:  CursorKind.PARM_DECL spel = 'c' (len=1)
    |       : cur.type.spelling: NewType_t
    +-- NODE:  CursorKind.COMPOUND_STMT spel = '' (len=0)
        +-- NODE:  CursorKind.RETURN_STMT spel = '' (len=0)
            +-- NODE:  CursorKind.UNEXPOSED_EXPR spel = '' (len=0)
                +-- NODE:  CursorKind.BINARY_OPERATOR spel = '' (len=0)
                    |   : tokens: ['a', '*', 'b', '*', 'c']
                    +-- NODE:  CursorKind.BINARY_OPERATOR spel = '' (len=0)
                    |   |   : tokens: ['a', '*', 'b']
                    |   +-- NODE:  CursorKind.UNEXPOSED_EXPR spel = 'a' (len=1)
                    |   |   +-- NODE:  CursorKind.UNEXPOSED_EXPR spel = 'a' (len=1)
                    |   |       +-- NODE:  CursorKind.DECL_REF_EXPR spel = 'a' (len=1)
                    |   |               : type char
                    |   |               : referenced type char
                    |   +-- NODE:  CursorKind.UNEXPOSED_EXPR spel = 'b' (len=1)
                    |       +-- NODE:  CursorKind.UNEXPOSED_EXPR spel = 'b' (len=1)
                    |           +-- NODE:  CursorKind.DECL_REF_EXPR spel = 'b' (len=1)
                    |                   : type short
                    |                   : referenced type short
                    +-- NODE:  CursorKind.UNEXPOSED_EXPR spel = 'c' (len=1)
                        +-- NODE:  CursorKind.DECL_REF_EXPR spel = 'c' (len=1)
                                : type NewType_t
                                : referenced type NewType_t
         

void call_func_with_params(void)
{
   long a;
   a = func_with_params('c', 10, 100);
}
         
+-- NODE:  CursorKind.FUNCTION_DECL spel = 'call_func_with_params' (len=21)
    |   : cur.is_definition() True
    |   : cur.get_definition().is_definition() True
    |   : cur.linkage: LinkageKind.EXTERNAL
    |   : cur.result_type.spelling: void
    +-- NODE:  CursorKind.COMPOUND_STMT spel = '' (len=0)
        +-- NODE:  CursorKind.DECL_STMT spel = '' (len=0)
        |   +-- NODE:  CursorKind.VAR_DECL spel = 'a' (len=1)
        +-- NODE:  CursorKind.BINARY_OPERATOR spel = '' (len=0)
            |   : tokens: ['a', '=', 'func_with_params', '(', "'c'", ',', '10', ',', '100', ')']
            +-- NODE:  CursorKind.DECL_REF_EXPR spel = 'a' (len=1)
            |       : type long
            |       : referenced type long
            +-- NODE:  CursorKind.CALL_EXPR spel = 'func_with_params' (len=16)
                |   : cur.type.spelling: long
                |   : cur.get_arguments().type.spelling: ['char', 'short', 'int']
                +-- NODE:  CursorKind.UNEXPOSED_EXPR spel = 'func_with_params' (len=16)
                |   +-- NODE:  CursorKind.DECL_REF_EXPR spel = 'func_with_params' (len=16)
                |           : type long (char, short, int)
                |           : referenced type long (char, short, int)
                +-- NODE:  CursorKind.UNEXPOSED_EXPR spel = '' (len=0)
                |   +-- NODE:  CursorKind.CHARACTER_LITERAL spel = '' (len=0)
                +-- NODE:  CursorKind.UNEXPOSED_EXPR spel = '' (len=0)
                |   +-- NODE:  CursorKind.INTEGER_LITERAL spel = '' (len=0)
                |           : tokens: ['10']
                +-- NODE:  CursorKind.INTEGER_LITERAL spel = '' (len=0)
                        : tokens: ['100']
         

void use_a_function_pointer(void)
{
   long (*ptr)(char a, short b, int c);

   ptr = &func_with_params;

   struct
   {
      void(*ptr)(char a, short b, int c);
   } s;

   s.ptr = &func_with_params;

   ptr(1, 2, 3);
   s.ptr(11, 12, 13);
}