Thursday, January 05, 2012

Code-completion strategies in PyDev

I believe one of the strong points in PyDev is its code-completion, so, I thought a bit about giving some details on it :)

The main preference page for code completion is: Window > Preferences > PyDev > Editor > Code Completion (my preferred configuration is setting the 'Request completions on all letter chars and '_'', so that completions appear automatically when typing, otherwise Ctrl+Space would need to be used to request the completions -- I was actually thinking about making that the default and decided against it to conform to other editors in Eclipse).

* Word completion (also called Hippie Completion):

This is probably the simplest one and is provided by Eclipse itself (through Alt+/). It provides a simple word-based completion which uses all the currently opened editors in Eclipse.

I've actually provided a patch for Eclipse to improve the speed of this completion (https://bugs.eclipse.org/bugs/show_bug.cgi?id=270385), which was added in Eclipse 3.6.

* Templates completion:

These are user-defined templates that may be configured at PyDev > Editor > Templates (most of the base for this completion is provided by Eclipse... PyDev uses a subclass: PyTemplateCompletionProcessor and some of the available variables may be defined in Jython code -- see: pytemplate_defaults.py for details).

* Common tokens completion:

When you start typing in PyDev, some common tokens (i.e.: keywords, self, etc) start appearing directly. Those can be configured in PyDev > Editor > Code Completion (ctx insensitive and common tokens).

It's implementation is pretty simple (may be seen at: KeywordsSimpleAssist)

* Context insensitive completion:

This completion goes through all the tokens available for a given project (which may need to consider project dependencies and which interpreter is being used) and shows those tokens as a completion (i.e.: top-level tokens such as classes or methods and the modules themselves).

If one of those is selected, the token will be completed and an import will be added for it too (if the preference in PyDev > Editor > Auto Imports > "Do auto import?" is marked as true -- in that same preferences page, the number of chars that need to be available in a word so that these completions start appearing may be specified).

Note that if the option was set not to do the auto-import, one could just add the token, let it be marked as an unrecognized variable by PyDev and later do an Organize Imports (Ctrl+O), or a Quick Fix in that line (Ctrl+1), to add the import.

The major issue in this completion isn't actually the completion per-se (implemented in ImportsCompletionParticipant and CtxParticipant), but the structure which needs to be kept to have it as a fast and efficient completion.

Mainly, PyDev has a concept called 'AdditionalInfo' (this was done when PyDev Extensions was separated from the PyDev Open Source, so, the name is a bit strange now, but the general idea is that it was additional information related to a given project or interpreter), which keeps the following information:

- Two TreeMaps (AbstractAdditionalTokensInfo.topLevelInitialsToInfo and AbstractAdditionalTokensInfo.innerInitialsToInfo) which map token names to information of the places where the token may be found (i.e.: module and structure inside that module). Those are all kept in memory and are pretty fast to access (AbstractAdditionalTokensInfo.getTokensStartingWith is what's interesting for a code-completion and AbstractAdditionalTokensInfo.getTokensEqualTo is interesting when doing a quick fix or organize imports). This structure is also used in the global tokens browser (Ctrl+Shift+T).

- Note that it also has a structure (AbstractAdditionalDependencyInfo.completeIndex) which maps a module to all the available tokens in it. This structure is kept in memory only as a SoftHashMap (so, it's only kept in memory while there's enough space for it) and persisted to the disk. It's also only lazily created on operations that need it (currently only a project-wide rename refactoring or a find references (Ctrl+Shift+G) would use it as it's basically a structure which is a bit faster for doing exact match searches than actually doing a search in Eclipse -- especially if the SoftHashMap is still in memory, so, if many find references are done in succession, if there's enough memory, from the 2nd attempt onwards, things should be fast).

On a project build, the tokens of the completeIndex are simply all removed (to be recalculated when some action that needs it is called). As for the maps, those are always kept up to date when a file is changed. The strategy for having it build fast is that the in-memory cache is directly updated (which is reasonably fast) and instead of saving the whole map it just saves the delta information and when restoring the info, those deltas are applied to have it in the last state (and from time to time it does dump the whole structure and removes the deltas). Also, it runs in a separate thread (not actually in the thread that's doing the build, and a singleton: RunnableAsJobsPoolThread, makes sure than only some of those, depending on the number of processors in your machine, are running at the same time, so, if you change 200 files at once, your computer won't come to a halt).

* Context sensitive completion:

This is by far the most complex completion available as it analyzes the context where you're requesting a completion and provides tokens based on it. Basically, PyDev has an internal type-inference engine to do that (which is also used by actions such as find definition or TDD actions such as create method).

Internally it uses an LRU structure which maps module names to the module AST (Abstract Syntax Tree) and in a pretty recursive algorithm finds out about the available tokens needed for a given context and provides completions based on that (thankfully it has a huge amount of unit tests holding it all together). That process starts in PyCodeCompletion.getCodeCompletionProposals(ITextViewer, CompletionRequest) and the type inference engine main classes are: ASTManager and ProjectModulesManager.

On some occasions some modules may be pretty hard to analyze, in which case PyDev resorts to launching a shell and querying it for the needed tokens (those are pre-specified as in window > preferences > PyDev > Interpreter > Forced Builtins, and the communication happens in the java side through the AbstractShell class) -- it's also probably one of the main reasons of problems when configuring PyDev, as it's common to have a firewall blocking that communication (in which case PyDev wouldn't even be able to get common builtins such as len, object, etc).

On the good side, this also makes it possible for PyDev to analyze .pyd modules (although if you're developing such a module as a part of your project, you have to remember to call Ctrl+2, kill so that PyDev will kill those shells before you actually build it, otherwise that module will be locked and you won't be able to link it -- and tokens wouldn't be updated).