Indexing in ENSIME

Sam Halliday

Scala Sphere 2017

Introduction

Sam Halliday @fommil

  • Chartered Mathematician
    • DSP, optimisation, quantum, machine learning, etc
  • Libre Education and Software
    • 5mil textbooks in South Africa (Siyavula)
    • FSF Fellow (they fight for BSD / Apache too!)
    • netlib-java (underpinning Spark ML)
    • ENSIME core developer

I am really an applied mathematician by training, this software thing is really just a hobby that pays the bills.

Before the crash, I did industrial research in digital signal processing, multi-high-dimensional optimisation, quantum mechanics, machine learning and a bunch of other stuff that I could just talk about forever.

I’m really into Free or Libre education and software. When I was a student in Cape Town I was one of the founders of an initiative that eventually became Siyavula and has printed 5 million Free textbooks to students in South Africa.

I’m a fellow of the Free Software Foundation. I believe they do really great things and I’d encourage you to join up even if you don’t believe in the GPL. They are doing some great lobbying for all of us against legislation that seeks to undermine our right to use or write Free software, which includes the Apache 2.0 and BSD licenses.

My most used free software project is netlib-java, which I spoke about at last year’s Scala eXchange. It’s included in the Spark Machine Learning library.

But my favourite project is ENSIME, which is an alternative development environment to Eclipse and IntelliJ for Scala and now Java.

Editor Poll

  • IntelliJ
  • ScalaIDE
  • ENSIME
  • Other

What is ENSIME

  • language server for Scala (and Java)
  • build tool plugins (sbt, maven, etc)
  • text editor plugins (emacs, vim, etc)

What is an Indexer for?

  • import class at point
  • jump to source (classes, methods, fields)
  • CamelCase / substring search
  • find implementations
  • find references

Architecture

This is an architectural overview of the internals of the ensime-server, which is bounded here by the dotted lines.

The text editor communicates with the server via SWANK, a bidirectional TCP/IP sockets protocol using S-Expressions as the language, or JERKY which is JSON over WebSockets.

The server runs locally, so it also has direct access to the files on the disc and can watch for changes without needing to be told about them. This is typically used for detecting changes in the compiled files rather than looking for changes in source code.

And when the server is started, it needs to be given a .ensime file which defines the project layout. This is typically generated by the build tool.

Inside the server, everything goes via the central Project class which effectively just delegates to the relevant sub-component. The two big parts are the Search Service and the Analyzer:

  1. The Search Service indexes all the binaries related to the project, including third party jars. We use ASM to do the heavy lifting and we persist the results to H2 to enable various types of searches. We also build up an index in Lucene for advanced searching, such as camel case searching of a classname.
  2. The Analyzer is our layer that sits on top of the Scala Presentation Compiler, which is an interactive version of the Scala Compiler but is supposed to be quicker because it shortcuts various stages in order to be responsive. This is the same backend that is used by the Scala IDE, but it is released as part of the official Scala Compiler jar.
  3. We also have the ability to identify source code to binaries, e.g. to relate your third party source zip files to the jars that you’re including. This lets us implement the “jump to source” functionality beyond the user’s project files.
  4. Documentation is hosted via a Spray HTTP server and viewed in a normal web browser.
  5. A debug manager component allows interactive debugging sessions against a running JVM. It manages the state of the threads and allows stepping and inspection.

1.0

Index vs Database

  • Database
    • retrievable data
    • fully qualified name
  • Index
    • human search (CamelCase, simple classname)
    • not needed if you have the FQN

How we extract the data

ASM: ClassVisitor

import org.objectweb.asm._
import org.objectweb.asm.Opcodes._

val receiver = new ClassVisitor(ASM5) { ... }
val reader = new ClassReader(in)
reader.accept(receiver, ClassReader.SKIP_FRAMES)
receiver // mutated having been visited
override def visit(
  version: Int,
  access: Int,
  name: String,
  signature: String,
  superName: String,
  interfaces: Array[String]
): Unit
override def visitSource(
  filename: String,
  debug: String
): Unit

ASM: MethodVisitor

override def visitMethod(
  access: Int,
  region: String,
  desc: String,
  signature: String,
  exceptions: Array[String]
) = new MethodVisitor {
  override def visitMethodInsn(
    opcode: Int,
    owner: String,
    name: String,
    desc: String,
    itf: Boolean
  ): Unit

ASM: FieldVisitor

override def visitField(
  access: Int,
  name: String,
  desc: String,
  signature: String,
  value: AnyRef
) = new FieldVisitor {
  override def visitAnnotation(
    desc: String,
    visible: Boolean
   ): Unit

scalap

import scala.tools.scalap.scalax.rules.scalasig._

val classFile = ClassFileParser.parse(byteCode)
ScalaSigParser.parse(classFile): Option[ScalaSig]

FQNs and Descriptors

scala.collection.immutable.List
scala.collection.immutable.List$
scala.collection.immutable.$colon$colon
scala.collection.immutable.$colon$colon$
scala.collection.immutable.Nil
scala.collection.immutable.Nil$
scala.collection.convert.package$$anon$1
scala.collection.convert.package$$anon$2
scala.collection.convert.package$$anon$3
scala.collection.convert.package$$anon$4
scala.collection.convert.package$$anon$5
scala.collection.convert.package$
scala.collection.convert.package
java.nio.channels.FileChannel.write // OVERLOADING!
java.nio.channels.FileChannel.write(Ljava/nio/ByteBuffer;)I
java.nio.channels.FileChannel.write([Ljava/nio/ByteBuffer;II)J
java.nio.channels.FileChannel.write([Ljava/nio/ByteBuffer;)J
java.nio.channels.FileChannel.write(Ljava/nio/ByteBuffer;J)I

We hit a problem when we get to methods, because overloaded methods have the same FQN in bytecode.

We have a hack, and we include the descriptor in the FQN.

But the descriptor is another kind of string with its own format. e.g. slashes instead of dots, and a representation for arrays and primitives, and separating parameters from return values.

And remember, everything is erased… so an FQN does not include generic information.

Model

type S = String
sealed trait Desc { def desc: S }
case class ArrayDescriptor(fqn: Desc) extends Desc
case class Descriptor(params: List[Desc], ret: Desc)

sealed trait Fqn { def fqn: S}
case class Package(p: List[S]) extends Fqn
case class Class(p: Package, n: S) extends Fqn with Desc

sealed trait Member extends Fqn
case class Field(o: Class, n: S) extends Member
case class Method(o: Class, n: S, d: Descriptor) extends Member

This is how we model the core ADT, I’m taking some liberties here to let it fit onto one slide.

Note that a Desc is an internal bytecode string, and so is an Fqn.

I could probably have modelled this much cleaner with typeclasses.

Generics

  • Bytecode / Java Generics not Scala types
  • how hard can it be?

class Foo[T]
<X:Ljava/lang/Object;>Ljava/lang/Object;

<T::Lorg/ensime/TraitOne;:Lorg/ensime/TraitTwo;>Ljava/lang/Object;

<U:Ljava/lang/Object;V:Lorg/ensime/DummyParent<TU;>;>Lorg/ensime/DummyParent<TV;>;

// +- is upper/lower not covariance/contravariance
Dummy[java.util.List[_ >: Number]]
Lorg/ensime/Dummy<Ljava/util/List<-Ljava/lang/Number;>;>;
// contributed with SignatureParser by Adam Sznajder
case class GenericClass(
  params: Seq[GenericParam],
  supers: Seq[GenericClassName]
)

case class GenericParam(
  n: S,
  classes: Seq[Generic]
)
sealed trait Generic
case class GenericArray(className: Generic) extends Generic
case class GenericVar(name: S) extends Generic
case class GenericClassName(
  c: ClassName,
  args: Seq[GenericArg],
  inners: Seq[InnerClassName]
) extends Generic
case class GenericArg(bound: Option[BoundType], sig: Generic)

sealed trait BoundType
case object UpperBound extends BoundType
case object LowerBound extends BoundType

// e.g. Lorg/scalatest/SuperEngine<TT;>.Node;
// (kinda weird that it's a special case...)
case class InnerClassName(n: S, args: Seq[GenericArg])

ADT

case class RawClassfile(
  name: ClassName,
  generics: Option[GenericClass],
  superClass: Option[ClassName],
  interfaces: List[ClassName],
  access: Access,
  deprecated: Boolean,
  fields: Queue[RawField],
  methods: Queue[RawMethod],
  source: RawSource
)

case class RawSource(file: Option[S], line: Option[Int])
case class RawField(
  name: FieldName,
  clazz: DescriptorType,
  generics: Option[S],
  access: Access
)

case class RawMethod(
  name: MethodName,
  access: Access,
  generics: Option[S],
  line: Option[Int]
)

Scalap names

case class RawType(
  fqn: S,
  access: Access
)

case class RawScalaClass(
  javaName: ClassName,
  scalaName: S,
  typeSignature: S,
  access: Access,
  declaredAs: DeclaredAs,
  fields: Seq[RawScalaField],
  methods: Seq[RawScalaMethod]
)
case class RawScalaField(
  javaName: FieldName,
  scalaName: S,
  typeInfo: S,
  access: Access
)

case class RawScalaMethod(
  scalaName: S,
  signature: S,
  access: Access
)

What we store

final case class FileCheck(
    id: Option[Int],
    filename: String,
    timestamp: Timestamp
)
final case class FqnSymbol(
    id: Option[Int],
    file: String, // the underlying file (class or jar)
    path: String, // URL to the classfile
    fqn: String,  // should really be the primary key
    internal: Option[String], // FQN of a field's type
    source: Option[String], // URL to the source file
    line: Option[Int],
    offset: Option[Int] = None
)

This is the definition of the very simple SQL schema (we use slick to access H2).

We have a TABLE to keep track of the timestamp of every file that we are tracking.

And for every FQN in your project, we store the file that it is contained in, and the URL path to that file. There is probably a lot of wasted memory here because almost everything will have the same prefix.

The fqn should probably be the id field to be honest but we had some problems with duplicate FQNs in an earlier schema that we’ve since resolved (overloaded methods). This is the bytecode FQN plus the method signature if it’s a method.q

The internal column is for storing the FQN of a field’s type (if this FQN is a field). With this we can distinguish between fields, methods and classes.

The source is the full URL of the source code for this FQN, along with the line and offset (if we know it).

does it sound familiar?

Caching and Naming

“There are only two hard things in Computer Science”

– Phil Karlton

A rose by any other name

  • bytecode FQN
  • descriptor
  • FQN + descriptor
  • name with generics
  • scalap name
  • java language name
  • scala language name
  • scalac internal name

Workaround

  • convert everything to FQN + descriptor

About that caching…

  • one big batch index on startup
    • fast restart is a design requirement
  • incremental in response to filewatcher
    • not as important: structure rarely changes
    • debounce, don’t handle failures

Source Resolver

  • bytecode: Filename.scala:123
  • we know the path/to/the/Clazz.class
  • infer path/to/the/Filename.scala
  • and path/Filename.scala
  • but not some/other/place/Filename.scala
  • nor multiple sources in different jars

Lucene

class FqnAnalyzer extends DynamicSynonymAnalyzer {
  private def cases(s: String) = List(s, s.toLowerCase)
  override def synonyms(term: String): Set[String] = {
    val (path, name) = term.replaceAll("\\(.*", "").split('.').toList.initLast
    def camel(s: String) = s.filter(_.isUpper)
    def spacey(s: String) = List(s, s.replace('.', ' '))

    val shortPkg = path.map(_.charAt(0)).mkString("", ".", "." + name)
    val innerNames = name.split('$').flatMap(cases)

    Set(term, name, camel(name)) ++ innerNames ++ spacey(shortPkg)
  }.filter(_.size > 1).flatMap(cases) - term
}
  • do we really need Lucene?
    • probably not
    • one user, we can spend CPU cycles
  • could just index FQNs in-memory
    • e.g. Emacs `flx-ido`

Limitations

  • we are throwing away almost everything
    • references (i.e. find usages)
    • hierachy (i.e. find implementations)
    • scala type info (i.e. semantic search)
  • throwing stuff away
  • minimal use of scalap

It all started so well

for {
  checks <- db.knownFiles()
  stale = findStaleFileChecks(checks)
  deletes <- deleteReferences(stale)
  bases = findBases()
  added <- indexBases(bases, checks)
  _ <- commitIndex()
} yield (deletes, added)

Horror of the code

You know you’re in trouble when…

def global: ExecutionContext = null
val ec = dispatchers.lookup("magic-dispatcher")

Future {
  blocking {
    semaphore.acquire() // homebrew backpressure
// TODO: make up our mind out being a service or actor...
class IndexingQueueActor(searchService: SearchService)
  with ActorLogging {

    // FileObject.equals doesn't work, use toString as keys
    var todo = Map.empty[String, FileObject]
    // TODO: refactor this into DebounceActor
    var worker: Cancellable = _

    val advice =
      "If the problem persists, you may need to restart"

2.0

Google Summer of Code

Several key features are missing from ENSIME that are frequently requested by users: find usages and show implementations.

Graphpocalypse

New kinds of data

  • scalap attached to the nodes
  • scala.meta
  • runtime information (needs agent)

New kinds of features

  • find implementations
  • find usages
  • find dead code
  • coverage
  • compilation optimisation

Performance Considerations

Future

Needs Contributors

  • more granular API
  • perf improvements
  • more data sources
  • rewrite: FS2 / Swave / Monix / scala-async
  • natural search (type search)