设为首页 收藏本站
查看: 668|回复: 0

[经验分享] Glusterfs hacker guide(一)

[复制链接]

尚未签到

发表于 2019-2-1 10:54:03 | 显示全部楼层 |阅读模式
1  Translator 101 Lesson
1.1   Translator 101 Lesson 1: Setting the Stage
  This is the first postin a series that will explain some of the details of writing a GlusterFStranslator, using some actual code to illustrate.
  Before we begin, aword about environments. GlusterFS is over 300K lines of code spread across afew hundred files. That’s no Linux kernel or anything, but you’re still goingto be navigating through a lot of code in every code-editing session, so somekind of cross-referencing is essential. I use cscope with the vimbindings, and if I couldn’t do “crtl-\ g” and such to jump between definitionsall the time my productivity would be cut in half. You may prefer differenttools, but as I go through these examples you’ll need something functionallysimilar to follow on. OK, on with the show.
  The first thing youneed to know is that translators are not just bags of functions and variables.They need to have a very definite internal structure so that thetranslator-loading code can figure out where all the pieces are. The way itdoes this is to use dlsym to look for specific names within your shared-objectfile, as follow (from xlator.c):
  if(!(xl->fops = dlsym (handle,"fops")))
  {
  gf_log  ("xlator", GF_LOG_WARNING,"dlsym(fops) on %s", dlerror ());goto out;
  }
  if(!(xl->cbks = dlsym (handle,"cbks")))
  {
  gf_log  ("xlator", GF_LOG_WARNING,"dlsym(cbks) on %s", dlerror ());goto out;
  }
  if(!(xl->init = dlsym (handle,"init")))
  {
  gf_log  ("xlator", GF_LOG_WARNING,"dlsym(init) on %s", dlerror ());goto out;
  }
  if(!(xl->fini = dlsym (handle,"fini")))
  {
  gf_log  ("xlator", GF_LOG_WARNING,"dlsym(fini) on %s", dlerror ());goto out;
  }
  In this example, xl isa pointer to the in-memory object for the translator we’re loading. As you cansee, it’s looking up various symbols by name in the sharedobject it just loaded, and storing pointers to those symbols. Some of them(e.g. init are functions, while others e.g. fops aredispatch tables containing pointers to many functions. Together, these make upthe translator’s public interface.
  Most of this glue orboilerplate can easily be found at the bottom of one of the source files thatmake up each translator. We’re going to use the rot-13 translator just for fun,so in this case you’d look in rot-13.c to see this:
  struct xlator_fops fops ={
  .readv= rot13_readv,
  .writev= rot13_writev
  };
  struct xlator_cbks cbks ={};
  struct volume_options options[]={
  { .key={"encrypt-write"}, .type= GF_OPTION_TYPE_BOOL },
  { .key={"decrypt-read"}, .type= GF_OPTION_TYPE_BOOL },
  { .key={NULL}},
  };
  The fops table,defined in xlator.h, is one of the most important pieces. This table contains apointer to each of the filesystem functions that your translator mightimplement – open, read, stat, chmod, and so on. There are 82 such functions inall, but don’t worry; any that you don’t specify here will be see as null andfilled with defaults from defaults.c when your translator is loaded. In thisparticular example, since rot-13 is an exceptionally simple translator, we onlyfill in two entries for readv and writev.

  There are actually twoother tables, also required to have predefined names, that are also used tofind translator functions: cbks (which is empty in thissnippet) and dumpops (which is missing entirely). The first ofthese specify entry points for when inodes are forgotten or file descriptorsare>
  The last piece I’llcover today is options. As you can see, this is a table oftranslator-specific option names and some information about their types.GlusterFS actually provides a pretty rich set of types (volume_option_type_t inoptions.h) which includes paths, translator names, percentages, and times inaddition to the obvious integers and strings. Also, the volume_option_t structurecan include information about>  { .key={"data-self-heal-algorithm"},
  .type= GF_OPTION_TYPE_STR,
  .default_value="",
  .description="Select between \"full\", \"diff\". The ""\"full\" algorithm copies the entire file from ""source to sink. The \"diff\" algorithm copies to ""sink only those blocks whose checksums  don't match ""with those of source.", .value={"diff","full",""}},
  { .key={"data-self-heal-window-size"},
  .type= GF_OPTION_TYPE_INT, .min=1, .max=1024,
  .default_value="1", .description="Maximum number blocks per file for  which self-heal ""process would be applied  simultaneously."},
  When your translatoris loaded, all of this information is used to parse the options actuallyprovided in the volfile, and then the result is turned into a dictionary andstored as xl->options. This dictionary is then processed byyour init function, which you can see being looked up in thefirst code fragment above. We’re only going to look at a small part of therot-13′s init for now.
  priv->decrypt_read =1; priv->encrypt_write =1;
  data = dict_get (this->options,"encrypt-write");
  if(data){
  if(gf_string2boolean  (data->data,&priv->encrypt_write)==-1)
  {
  gf_log (this->name, GF_LOG_ERROR,"encrypt-write  takes only boolean options");
  return-1;
  }}

  What we can see hereis that we’re setting some defaults in our priv structure,then looking to see if an “encrypt-write” option was actually provided. If so,we convert and store it. This is a pretty>  So far we’ve coveredthe basic of how a translator gets loaded, how we find its various parts, andhow we process its options. In my next Translator 101 post, we’ll go a littledeeper into other things that init and its companion fini mightdo, and how some other fields in our xlator_t structure(commonly referred to asthis) are commonly used.
1.2   Translator 101 Lesson 2: init, fini, and private context
  In the previousTranslator 101 post, we looked at some of the dispatch tables and optionsprocessing in a translator. This time we’re going to cover the rest of the“shell” of a translator – i.e. the other global parts not specific to handlinga particular request.

  Let’s start by looking at the>  132
  133
  134
  135
  136
  137
  138
  139
  int32_t init (xlator_t *this) { data_t *data = NULL; rot_13_private_t *priv  = NULL;    if (!this->children  || this->children->next) { gf_log ("rot13",  GF_LOG_ERROR, "FATAL:  rot13 should have exactly one child"); return -1; }   if (!this->parents) { gf_log (this->name, GF_LOG_WARNING,  "dangling volume. check volfile "); }
  priv = GF_CALLOC (sizeof (rot_13_private_t), 1, 0); if (!priv) return -1;

  At the very top, we see the function signature – we get a pointer tothe xlator_t object that we’re initializing, and we returnan int32_t status. As with most functions in the translatorAPI, this should be zero to indicate success. In this case it’s safe to return-1 for failure, but watch out: in dispatch-table functions, the return valuemeans the status of the function call rather than the request.A request error should be reflected as a callback with a non-zero op_retvalue,but the dispatch function itself should still return zero. In fact, thehandling of a non-zero return from a dispatch function is not all that robust(we recently had a bug report in HekaFS>  The first thing this init function does is check that thetranslator is being set up in the right kind of environment. Translators arecalled by parents and in turn call children. Some translators are “initial”translators that inject requests into the system from elsewhere – e.g.mount/fuse injecting requests from the kernel, protocol/server injectingrequests from the network. Those translators don’t need parents, but rot-13does and so we check for that. Similarly, some translators are “final”translators that (from the perspective of the current process) terminaterequests instead of passing them on – e.g. protocol/client passing them toanother node, storage/posix passing them to a local filesystem. Othertranslators “multiplex” between multiple children – passing each parent requeston to one (cluster/dht), some (cluster/stripe), or all (cluster/afr) of thosechildren. Rot-13 fits into none of those categories either, so it checks thatit has exactly one child. It might be more convenient orrobust if translator shared libraries had standard variables describing theserequirements, to be checked in a consistent way by the translator-loadinginfrastructure itself instead of by each separate init function,but this is the way translators work today.

  The last thing we see in this fragment is allocating our private dataarea. This can literally be anything we want; the infrastructure just providesthe priv pointer as a convenience but takes no responsibilityfor how it’s used. In this case we’re using GF_CALLOC toallocate our own rot_13_private_t structure. This gets us allthe benefits of GlusterFS’s memory-leak detection infrastructure, but the waywe’re calling it is not quite>  To finish our tour of standard initialization/termination, let’s look atthe end of init and the beginning of fini
  174
  175
  176
  177
  this->private  = priv;  gf_log ("rot13", GF_LOG_DEBUG,  "rot13 xlator loaded"); return 0; }   void fini (xlator_t  *this) { rot_13_private_t *priv  = this->private;   if (!priv) return; this->private  = NULL;  GF_FREE (priv);
  At the end of init we’re just storing our private-datapointer in the priv field of our xlator_t, thenreturning zero to indicate that initialization succeeded. As is usually thecase, our fini is even simpler. All it really has to dois GF_FREE our private-data pointer, which we do in a slightlyroundabout way here. Notice how we don’t even have a return value here, sincethere’s nothing obvious and useful that the infrastructure could do if fini failed.
  That’s practically everything we need to know to get our translatorthrough loading, initialization, options processing, and termination. If we haddefined no dispatch functions, we could actually configure a daemon to use ourtranslator and it would work as a basic pass-through from its parent to asingle child. In the next post I’ll cover how to build the translator andconfigure a daemon to use it, so that we can actually step through it in adebugger and see how it all fits together before we actually start addingfunctionality.
1.3   Translator 101 Lesson 3: This Time For Real
  In the first two parts of this series, we learned how to write a basictranslator skeleton that can get through loading, initialization, and optionprocessing. This time we’ll cover how to build that translator, configure avolume to use it, and run the glusterfs daemon in debug mode.
  Unfortunately, there’s not much direct support for writing newtranslators. You can check out a GlusterFS tree and splice in your owntranslator directory, but that’s a bit painful because you’ll have to updatemultiple makefiles plus a bunch of autoconf garbage. As part of the HekaFSproject, I basically reverse engineered the truly necessary parts of thetranslator-building process and then pestered one of the Fedora glusterfspackage maintainers (thanks daMaestro!) to add a glusterfs-devel package withthe required headers. Since then the complexity level in the HekaFS tree hascrept back up a bit, but I still remember the simple method and still considerit the easiest way to get started on a new translator. For the sake of thosenot using Fedora, I’m going to describe a method that doesn’t depend on thatheader package. What it does depend on is a GlusterFS source tree, much as youmight have cloned fromGitHub or the Gluster review site. This treedoesn’t have to be fully built, but you do need to run autogen.sh and configure init. Then you can take the following simple makefile and put it in a directorywith your actual source.
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  # Change these to match your source code. TARGET = rot-13.so
  OBJECTS = rot-13.o
  # Change these to match your  environment. GLFS_SRC = /play/glusterfs
  GLFS_LIB = /opt/glusterfs/3git/lib64
  HOST_OS = GF_LINUX_HOST_OS
  # You shouldn't need to change  anything below here.
  CFLAGS = -fPIC  -Wall -O0  -g \ -DHAVE_CONFIG_H  -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D$(HOST_OS) \ -I$(GLFS_SRC) -I$(GLFS_SRC)/libglusterfs/src  \ -I$(GLFS_SRC)/contrib/uuid
  LDFLAGS = -shared  -nostartfiles -L$(GLFS_LIB) -lglusterfs  -lpthread
  $(TARGET): $(OBJECTS) $(CC) $(OBJECTS) $(LDFLAGS) -o $(TARGET)
  Yes, it’s still Linux-specific. Mea culpa. As you can see, we’re stickingwith the rot-13 example, so you can just copy the files from…/xlators/encryption/rot-13/src in your GlusterFS tree to follow on. Type“make” and you should be rewarded with a nice little .so file.
  1
  2
  [jeff@gfs-i8c-01 xlator_example]$ ls -l rot-13.so
  -rwxr-xr-x. 1 jeff jeff 40784 Nov 16 16:41 rot-13.so
  Notice that we’ve built with optimization level zero and debugging symbolsincluded, which would not typically be the case for a packaged version ofGlusterFS. Let’s put our version of rot-13.so into a slightly different file onour system, so that it doesn’t stomp on the installed version (not that you’dever want to use that anyway).
  1
  2
  3
  [root@gfs-i8c-01 xlator_example]# ls /opt/glusterfs/3git/lib64/glusterfs/3git/xlator/encryption/
  crypt.so  crypt.so.0  crypt.so.0.0.0  rot-13.so   rot-13.so.0  rot-13.so.0.0.0
  [root@gfs-i8c-01 xlator_example]# cp rot-13.so  /opt/glusterfs/3git/lib64/glusterfs/3git/xlator/encryption/my-rot-13.so
  These paths represent the current Gluster filesystem layout, which islikely to be deprecated in favor of the Fedora layout; your paths may vary. Atthis point we’re ready to configure a volume using our new translator. To dothat, I’m going to suggest something that’s strongly discouraged except duringdevelopment (the Gluster guys are going to hate me for this): write our ownvolfile. Here’s just about the simplest volfile you’ll ever see.
  1
  2
  3
  4
  5
  6
  7
  8
  9
  volume my-posix
  type storage/posix
  option directory /play/export
  end-volume
  volume my-rot13
  type encryption/my-rot-13
  subvolumes my-posix
  end-volume
  All we have here is a basic brick using /play/export for its data, andthen an instance of our translator layered on top – no client or server isnecessary for what we’re doing, and the system will automatically push amount/fuse translator on top if there’s no server translator. To try this out,all we need is the following command (assuming the directories involved alreadyexist).
  1
  [jeff@gfs-i8c-01 xlator_example]$ glusterfs --debug -f my.vol  /play/import
  You should be rewarded with a whole lot of log output, including the textof the volfile (this is very useful for debugging problems in the field). Ifyou go to another window on the same machine, you can see that you have a newfilesystem mounted.
  1
  2
  3
  4
  [jeff@gfs-i8c-01 ~]$ df /play/import
  Filesystem           1K-blocks      Used Available Use% Mounted on
  /play/xlator_example/my.vol
  114506240   2706176  105983488   3% /play/import
  Just for fun, write something into a file in /play/import, then look atthe corresponding file in /play/export to see it all rot-13′ed for you.
  1
  2
  3
  4
  [jeff@gfs-i8c-01 ~]$ echo hello > /play/import/a_file
  [jeff@gfs-i8c-01 ~]$ cat /play/export/a_file
  uryyb

  There you have it – functionality you control, implemented easily, layeredon top of local storage. Now you could start adding functionality – realencryption, perhaps – and inevitably having to debug it. You could do that theold-school way, with gf_log (preferred) or even plain old printf, or you couldrun daemons under gdb instead.>1.4   Translator 101 Lesson 4: Debugging a Translator
  Now that we’ve learned what a translator looks like and how to build one,it’s time to run one and actually watch it work. The best way to do this isgood old-fashioned gdb, as follows (using some of the examples from last time).
  1
  2
  3
  4
  5
  6
  7
  [root@gfs-i8c-01 xlator_example]# gdb glusterfs
  GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
  ...
  (gdb) r --debug -f my.vol /play/import
  Starting program: /usr/sbin/glusterfs --debug -f my.vol /play/import
  ...
  [2011-11-23 11:23:16.495516] I [fuse-bridge.c:2971:fuse_init]  0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel  7.13
  If you get to this point, your glusterfs client process is alreadyrunning. You can go to another window to see the mountpoint, do fileoperations, etc.
  [root@gfs-i8c-01 ~]# df /play/import
  Filesystem           1K-blocks      Used Available Use% Mounted on
  /root/xlator_example/my.vol
  114506240   2643968 106045568   3% /play/import
  [root@gfs-i8c-01 ~]# ls /play/import
  a_file
  [root@gfs-i8c-01 ~]# cat /play/import/a_file
  hello
  Now let’s interrupt the process and see where we are.
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  ^C
  Program received signal SIGINT, Interrupt.
  0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from  /lib64/libpthread.so.0
  (gdb) info threads
  5 Thread 0x7fffeffff700 (LWP  27206)  0x0000003a002dd8c7 in readv ()
  from /lib64/libc.so.6
  4 Thread 0x7ffff50e3700 (LWP  27205)  0x0000003a0060b75b in  pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 Thread 0x7ffff5f02700 (LWP 27204)  0x0000003a0060b3dc in  pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 0x7ffff6903700 (LWP  27203)  0x0000003a0060f245 in sigwait  ()
  from /lib64/libpthread.so.0
  * 1 Thread 0x7ffff7957700 (LWP 27196)   0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from  /lib64/libpthread.so.0
  Like any non-toy server, this one has multiple threads. What are they alldoing? Honestly, even I don’t know. Thread 1 turns out to be inevent_dispatch_epoll,which means it’s the one handling all of our network I/O. Note that with socket multi-threading patch thiswill change, with one thread insocket_poller per connection. Thread2 is in glusterfs_sigwaiter which means signals will be isolatedto that thread. Thread 3 is in syncenv_task, so it’s a workerprocess for synchronous requests such as those used by the rebalance and repaircode. Thread 4 is in janitor_get_next_fd, so it’s waiting for achance to close no-longer-needed file descriptors on the local filesystem. (Iadmit I had to look that one up, BTW.) Lastly, thread 5 is in fuse_thread_proc,so it’s the one fetching requests from our FUSE interface. You’ll often seemany more threads than this, but it’s a pretty good basic set. Now, let’s set abreakpoint so we can actually watch a request.
  1
  2
  3
  4
  (gdb) b rot13_writev
  Breakpoint 1 at 0x7ffff50e4f0b: file rot-13.c, line 119.
  (gdb) c
  Continuing.
  At this point we go into our other window and do something that willinvolve a write.
  1
  2
  3
  4
  5
  6
  7
  [root@gfs-i8c-01 ~]# echo goodbye > /play/import/another_file
  (back to the first window)
  [Switching to Thread 0x7fffeffff700 (LWP 27206)]
  Breakpoint 1, rot13_writev (frame=0x7ffff6e4402c, this=0x638440,  fd=0x7ffff409802c,
  vector=0x7fffe8000cd8,  count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:119
  119             rot_13_private_t  *priv = (rot_13_private_t *)this->private;
  Remember how we built with debugging symbols enabled and no optimization?That will be pretty important for the next few steps. As you can see, we’re inrot13_writev,with several parameters.
  frame is our always-present frame pointer for this request. Also,frame->local will point to any local data we created and attached to therequest ourselves.
  this is a pointer to our instance of the rot-13 translator. You canexamine it if you like to see the name, type, options, parent/children, inodetable, and other stuff associated with it.
  fd is a pointer to a file-descriptor object (fd_t, not just a file-descriptorindex which is what most people use “fd” for). This in turn points to an inodeobject (inode_t) and we can associate our own rot-13-specific data with eitherof these.
  vector and count together describe the data buffers for this write, whichwe’ll get to in a moment.
  offset is the offset into the file at which we’re writing.

  iobref is a buffer-reference object, which is used to track the life cycleof buffers containing read/write data. If you look closely, you’ll noticethatvector[0].iov_base points to the same address as iobref->iobrefs[0]NaNr,which should give you some>  OK, now what about that vector? We can use it to examine the data beingwritten, like this.
  (gdb) p vector[0]
  $2 = {iov_base = 0x7ffff7936000, iov_len = 8}
  (gdb) x/s 0x7ffff7936000
  0x7ffff7936000: "goodbye\n"
  It’s not always safe to view this data as a string, because it might justas well be binary data, but since we’re generating the write this time it’ssafe and convenient. With that knowledge, let’s step through things a bit.
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  (gdb) s
  120             if  (priv->encrypt_write)
  (gdb)
  121                     rot13_iovec  (vector, count);
  (gdb)
  rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57
  57              for (i = 0; i <  count; i++) {
  (gdb)
  58                      rot13  (vector.iov_base, vector.iov_len);
  (gdb)
  rot13 (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:45
  45              for (i = 0; i <  len; i++) {
  (gdb)
  46                      if (buf  >= 'a' && buf
  They weigh in at 224 and 229 lines respectively, with some of that takenup by licenses and white space. Each took less than a day to write. Please bearin mind, though, that these are only prototypes. They exist to teach and tomake a point, not – in their current form – to be used in production. Makingthem suitable for real-world use would at least double their>  negative-lookup /negative.h
  #ifndef __NEGATIVE_H__
  #define __NEGATIVE_H__
  #ifndef _CONFIG_H
  #define _CONFIG_H
  #include "config.h"
  #endif
  #include "mem-types.h"
  #include "hashfn.h"
  #define GHOST_BUCKETS 64
  #define GHOST_HASH(x) (SuperFastHash(x,strlen(x)) %  GHOST_BUCKETS)
  typedef struct _ghost {
  struct  _ghost *next;
  char *path;
  } ghost_t;
  typedef struct {
  ghost_t  *ghosts[GHOST_BUCKETS];
  } negative_private_t;
  enum gf_negative_mem_types_ {
  gf_negative_mt_priv  = gf_common_mt_end + 1,
  gf_negative_mt_ghost,
  gf_negative_mt_end
  };
  #endif /* __NEGATIVE_H__ */
  negative-lookup /  negative.c
  #include
  #include
  #ifndef _CONFIG_H
  #define _CONFIG_H
  #include "config.h"
  #endif
  #include "glusterfs.h"
  #include "xlator.h"
  #include "logging.h"
  #include "negative.h"
  void
  exorcise (xlator_t *this,  char *spirit)
  {
  negative_private_t  *priv = this->private;
  ghost_t  *gp = NULL;
  ghost_t  **gpp = NULL;
  uint32_t bucket =  0;
  bucket =  GHOST_HASH(spirit);
  for  (gpp = &priv->ghosts[bucket]; *gpp; gpp =  &(*gpp)->next) {
  gp  = *gpp;
  if  (!strcmp(gp->path,spirit)) {
  *gpp  = gp->next;
  GF_FREE(gp->path);
  GF_FREE(gp);
  gf_log(this->name,GF_LOG_DEBUG,"removed  %s",spirit);
  break;
  }
  }
  }
  int32_t
  negative_lookup_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,
  int32_t op_ret, int32_t op_errno,  inode_t *inode,
  struct  iatt *buf, dict_t *dict, struct iatt *postparent)
  {
  negative_private_t  *priv = this->private;
  ghost_t  *gp = NULL;
  uint64_t ctx = 0;
  uint32_t bucket =  0;
  inode_ctx_get(inode,this,&ctx);
  if  (op_ret < 0) {
  gp  = GF_CALLOC(1,sizeof(ghost_t),gf_negative_mt_ghost);
  if  (gp) {
  gp->path  = (char *)ctx;
  bucket  = GHOST_HASH(gp->path);
  /* TBD:  locking */
  gp->next  = priv->ghosts[bucket];
  priv->ghosts[bucket]  = gp;
  gf_log(this->name,GF_LOG_DEBUG,"added  %s",
  (char *)ctx);
  goto  unwind;
  }
  }
  else  {
  gf_log(this->name,GF_LOG_DEBUG,"found  %s", (char *)ctx);
  exorcise(this,(char *)ctx);
  }
  /* Both  positive result and allocation failure come here. */
  GF_FREE((void *)ctx);
  unwind:
  STACK_UNWIND_STRICT  (lookup, frame, op_ret, op_errno, inode, buf,
  dict,  postparent);
  return  0;
  }
  int32_t
  negative_lookup (call_frame_t  *frame, xlator_t *this, loc_t *loc,
  dict_t  *xattr_req)
  {
  negative_private_t  *priv = this->private;
  ghost_t  *gp = NULL;
  uint32_t bucket =  0;
  bucket =  GHOST_HASH(loc->path);
  for  (gp = priv->ghosts[bucket]; gp; gp = gp->next)  {
  if  (!strcmp(gp->path,loc->path)) {
  gf_log(this->name,GF_LOG_DEBUG,"%s (%p)  => HIT",
  loc->path,  loc->inode);
  STACK_UNWIND_STRICT  (lookup, frame, -1, ENOENT,
  NULL, NULL, NULL, NULL);
  return  0;
  }
  }
  gf_log(this->name,GF_LOG_DEBUG,"%s (%p)  => MISS",
  loc->path,  loc->inode);
  inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));
  STACK_WIND  (frame, negative_lookup_cbk, FIRST_CHILD(this),
  FIRST_CHILD(this)->fops->lookup,  loc, xattr_req);
  return  0;
  }
  int32_t
  negative_create_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,
  int32_t op_ret, int32_t op_errno,  fd_t *fd, inode_t *inode,
  struct  iatt *buf, struct iatt *preparent,
  struct  iatt *postparent)
  {
  uint64_t ctx = 0;
  inode_ctx_get(inode,this,&ctx);
  exorcise(this,(char *)ctx);
  GF_FREE((void *)ctx);
  STACK_UNWIND_STRICT  (create, frame, op_ret, op_errno, fd, inode, buf,
  preparent,  postparent);
  return  0;
  }
  int32_t
  negative_create (call_frame_t  *frame, xlator_t *this, loc_t *loc, int32_t flags,
  mode_t  mode, fd_t *fd, dict_t *params)
  {
  inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));
  STACK_WIND  (frame, negative_create_cbk, FIRST_CHILD(this),
  FIRST_CHILD(this)->fops->create,  loc, flags, mode, fd,
  params);
  return  0;
  }
  int32_t
  negative_mkdir_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,
  int32_t op_ret, int32_t op_errno,  inode_t *inode,
  struct  iatt *buf, struct iatt *preparent,
  struct  iatt *postparent)
  {
  uint64_t ctx = 0;
  inode_ctx_get(inode,this,&ctx);
  exorcise(this,(char *)ctx);
  GF_FREE((void *)ctx);
  STACK_UNWIND_STRICT  (mkdir, frame, op_ret, op_errno, inode,
  buf,  preparent, postparent);
  return  0;
  }
  int
  negative_mkdir (call_frame_t  *frame, xlator_t *this, loc_t *loc, mode_t mode,
  dict_t  *params)
  {
  inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));
  STACK_WIND  (frame, negative_mkdir_cbk, FIRST_CHILD(this),
  FIRST_CHILD(this)->fops->mkdir,  loc, mode, params);
  return  0;
  }
  int32_t
  init (xlator_t *this)
  {
  negative_private_t *priv = NULL;
  if (!this->children  || this->children->next) {
  gf_log ("negative",  GF_LOG_ERROR,
  "FATAL:  negative should have exactly one child");
  return -1;
  }
  if (!this->parents)  {
  gf_log (this->name,  GF_LOG_WARNING,
  "dangling  volume. check volfile ");
  }

  priv =  GF_CALLOC (1,>  if  (!priv)
  return  -1;
  this->private  = priv;
  gf_log ("negative",  GF_LOG_DEBUG, "negative xlator loaded");
  return 0;
  }
  void
  fini (xlator_t *this)
  {
  negative_private_t *priv =  this->private;
  if  (!priv)
  return;
  this->private  = NULL;
  GF_FREE (priv);
  return;
  }
  struct xlator_fops fops = {
  .lookup  = negative_lookup,
  .create  = negative_create,
  .mkdir =  negative_mkdir,
  };
  struct xlator_cbks cbks = {
  };
  struct volume_options options[] = {
  { .key =  {NULL} },
  };
  negative-lookup /Makefile
  # Change these to match your source code.
  TARGET = negative.so
  OBJECTS = negative.o
  # Change these to match your environment.
  GLFS_SRC = /root/glusterfs_patches
  GLFS_ROOT = /opt/glusterfs
  GLFS_VERS = 3git
  GLFS_LIB = $(GLFS_ROOT)/$(GLFS_VERS)/lib64
  HOST_OS = GF_LINUX_HOST_OS
  # You shouldn't need to change anything below here.
  CFLAGS = -fPIC -Wall -O0 -g \
  -DHAVE_CONFIG_H  -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D$(HOST_OS) \
  -I$(GLFS_SRC)  -I$(GLFS_SRC)/libglusterfs/src \
  -I$(GLFS_SRC)/contrib/uuid  -I.
  LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB)  -lglusterfs -lpthread
  $(TARGET): $(OBJECTS)
  $(CC)  $(CFLAGS) $(OBJECTS) $(LDFLAGS) -o $(TARGET)
  install: $(TARGET)
  cp $(TARGET)  $(GLFS_LIB)/glusterfs/$(GLFS_VERS)/xlator/features
  clean:
  rm -f $(TARGET)  $(OBJECTS)
  negative-lookup /README.md
  This is a very simple translator to cache  "negative lookups" for workloads in which the same file is looked  up many times in places where it doesn't exist. In particular, web script  files with many includes/requires and long paths can generate hundreds of  such lookups per front-end request. If we don't cache the negative results,  this can mean hundreds of back-end network round trips per front-end request.  So we cache. Very simple tests for this kind of workload on two machines  connected via GigE show an approximately 3x performance improvement.
  This code is nowhere near ready for production use yet.  It was originally developed as a pedagogical example, but one that could  lead to something truly useful as well. Among other things, the  following features need to be added.
  ·          Support for other namespace-modifying operations -  link, symlink, mknod, rename, even funky xattr requests.
  ·          Time-based cache expiration to cover the case  where another client creates a file that's in our cache  because it wasn't there when we first looked it up. This might even include  periodic pruning of entries that are already stale but will never be looked  up (and therefore never reaped in-line) again.
  ·          Locking on the cache for when we're called  concurrently.
  This is intended to be a learning tool. I might not get  back to this code myself for a long time, but I always have time to help  anyone who's learning to write translators. If you want to help move it  along, please fork and send me pull requests.
  For more information on writing GlusterFS translators,  check out my "Translator 101" series:
  ·          http://hekafs.org/index.php/2011/11/translator-101-class-1-setting-the-stage/
  ·          http://hekafs.org/index.php/2011/11/translator-101-lesson-2-init-fini-and-private-context/
  ·          http://hekafs.org/index.php/2011/11/translator-101-lesson-3-this-time-for-real/
  ·          http://hekafs.org/index.php/2011/11/translator-101-lesson-4-debugging-a-translator/
  模块二:
  bypass / bypass.h
  /*
  * Copyright (c)  2011 Red Hat
  */
  #ifndef __bypass_H__
  #define __bypass_H__
  #ifndef _CONFIG_H
  #define _CONFIG_H
  #include "config.h"
  #endif
  #include "mem-types.h"
  /* Deal with casts for 32-bit architectures. */
  #define CAST2INT(x) ((uint64_t)(long)(x))
  #define CAST2PTR(x) ((void *)(long)(x))
  typedef struct {
  xlator_t  *target;
  } bypass_private_t;
  enum gf_bypass_mem_types_ {
  gf_bypass_mt_priv_t  = gf_common_mt_end + 1,
  gf_by_mt_int32_t,
  gf_bypass_mt_end
  };
  #endif /* __bypass_H__ */
  bypass / bypass.c
  /*
  * Copyright (c)  2011 Red Hat
  */
  #include
  #include
  #ifndef _CONFIG_H
  #define _CONFIG_H
  #include "config.h"
  #endif
  #include "glusterfs.h"
  #include "call-stub.h"
  #include "defaults.h"
  #include "logging.h"
  #include "xlator.h"
  #include "bypass.h"
  int32_t

  bypass_readv (call_frame_t  *frame, xlator_t *this, fd_t *fd,>  off_t offset)
  {
  bypass_private_t  *priv = this->private;
  STACK_WIND  (frame, default_readv_cbk, priv->target,

  priv->target->fops->readv,  fd,>  return  0;
  }
  dict_t *
  get_pending_dict (xlator_t *this)
  {
  dict_t *dict = NULL;
  xlator_list_t *trav = NULL;
  char *key = NULL;
  int32_t *value = NULL;
  xlator_t  *afr = NULL;
  bypass_private_t  *priv = this->private;
  dict = dict_new();
  if (!dict) {
  gf_log (this->name, GF_LOG_WARNING, "failed  to allocate dict");
  return  NULL;
  }
  afr =  this->children->xlator;
  for (trav = afr->children;  trav; trav = trav->next) {
  if  (trav->xlator == priv->target) {
  continue;
  }
  if (gf_asprintf(&key,"trusted.afr.%s",trav->xlator->name)  < 0) {
  gf_log (this->name, GF_LOG_WARNING,
  "failed to allocate key");
  goto free_dict;
  }
  value = GF_CALLOC(3,sizeof(*value),gf_by_mt_int32_t);
  if (!value) {
  gf_log (this->name, GF_LOG_WARNING,
  "failed to allocate value");
  goto free_key;
  }
  /* Amazingly,  there's no constant for this. */
  value[0] =  htons(1);
  if (dict_set_dynptr(dict,key,value,3*sizeof(*value))  < 0) {
  gf_log (this->name, GF_LOG_WARNING,
  "failed to set up dict");
  goto free_value;
  }
  }
  return  dict;
  free_value:
  GF_FREE(value);
  free_key:
  GF_FREE(key);
  free_dict:
  dict_unref(dict);
  return  NULL;
  }
  int32_t
  bypass_set_pending_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,
  int32_t op_ret, int32_t op_errno,  dict_t *dict)
  {
  if (op_ret < 0) {
  goto unwind;
  }
  call_resume(cookie);
  return0;
  unwind:
  STACK_UNWIND_STRICT  (writev, frame, op_ret, op_errno, NULL, NULL);
  return  0;
  }
  int32_t
  bypass_writev_resume (call_frame_t  *frame, xlator_t *this, fd_t *fd,
  struct  iovec *vector, int32_t count, off_t off,
  struct  iobref *iobref)
  {
  bypass_private_t  *priv = this->private;
  STACK_WIND  (frame, default_writev_cbk, priv->target,
  priv->target->fops->writev,  fd, vector, count, off,
  iobref);
  return  0;
  }
  int32_t
  bypass_writev (call_frame_t  *frame, xlator_t *this, fd_t *fd,
  struct  iovec *vector, int32_t count, off_t off,
  struct  iobref *iobref)
  {
  dict_t *dict = NULL;
  call_stub_t *stub = NULL;
  bypass_private_t  *priv = this->private;
  /*
  * I wish we  could just create the stub pointing to the target's
  * writev  function, but then we'd get into another translator's code
  * with  "this" pointing to us.
  */
  stub = fop_writev_stub(frame,  bypass_writev_resume,
  fd, vector, count, off, iobref);
  if (!stub) {
  gf_log (this->name, GF_LOG_WARNING, "failed  to allocate stub");
  goto wind;
  }
  dict =  get_pending_dict(this);
  if  (!dict) {
  gf_log (this->name, GF_LOG_WARNING, "failed  to allocate stub");
  goto  free_stub;
  }
  STACK_WIND_COOKIE (frame, bypass_set_pending_cbk, stub,
  priv->target,  priv->target->fops->fxattrop,
  fd,  GF_XATTROP_ADD_ARRAY, dict);
  return0;
  free_stub:
  call_stub_destroy(stub);
  wind:
  dict_unref(dict);
  STACK_WIND  (frame, default_writev_cbk, FIRST_CHILD(this),
  FIRST_CHILD(this)->fops->writev,  fd, vector, count, off,
  iobref);
  return  0;
  }
  /*
  * Even  applications that only read seem to call this, and it can force an
  * unwanted  self-heal.
  * TBD: there are  probably more like this - stat, open(O_RDONLY), etc.
  */
  int32_t
  bypass_fstat (call_frame_t  *frame, xlator_t *this, fd_t *fd)
  {
  bypass_private_t  *priv = this->private;
  STACK_WIND  (frame, default_fstat_cbk, priv->target,
  priv->target->fops->fstat,  fd);
  return  0;
  }
  int32_t
  init (xlator_t *this)
  {
  xlator_t *tgt_xl = NULL;
  bypass_private_t *priv = NULL;
  if (!this->children ||  this->children->next) {
  gf_log (this->name, GF_LOG_ERROR,
  "FATAL: bypass should have exactly one child");
  return -1;
  }
  tgt_xl = this->children->xlator;
  /* TBD: check for cluster/afr as well */
  if (strcmp(tgt_xl->type,"cluster/replicate")) {
  gf_log (this->name, GF_LOG_ERROR,
  "%s must be loaded above cluster/replicate",
  this->type);
  return -1;
  }
  /* TBD: pass  target-translator name as an option (instead of first) */
  tgt_xl = tgt_xl->children->xlator;

  priv = GF_CALLOC (1,>  if  (!priv)
  return  -1;
  priv->target = tgt_xl;
  this->private = priv;
  gf_log (this->name, GF_LOG_DEBUG, "bypass  xlator loaded");
  return0;
  }
  void
  fini (xlator_t *this)
  {
  bypass_private_t *priv = this->private;
  if  (!priv)
  return;
  this->private  = NULL;
  GF_FREE (priv);
  return;
  }
  struct xlator_fops fops = {
  .readv =  bypass_readv,
  .writev = bypass_writev,
  .fstat =  bypass_fstat
  };
  struct xlator_cbks cbks = {
  };
  struct volume_options options[] = {
  { .key = {NULL} },
  };
  bypass / Makefile
  # Change these to match your source code.
  TARGET = bypass.so
  OBJECTS = bypass.o
  # Change these to match your environment.
  GLFS_SRC = /root/glusterfs_patches
  GLFS_ROOT = /opt/glusterfs
  GLFS_VERS = 3git
  GLFS_LIB = `ls -d $(GLFS_ROOT)/$(GLFS_VERS)/lib*`
  HOST_OS = GF_LINUX_HOST_OS
  # You shouldn't need to change anything below here.
  CFLAGS = -fPIC -Wall -O0 -g \
  -DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64  -D_GNU_SOURCE -D$(HOST_OS) \
  -I$(GLFS_SRC) -I$(GLFS_SRC)/libglusterfs/src  \
  -I$(GLFS_SRC)/contrib/uuid -I.
  LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB)  -lglusterfs -lpthread
  $(TARGET): $(OBJECTS)
  $(CC) $(CFLAGS) $(OBJECTS)  $(LDFLAGS) -o $(TARGET)
  install: $(TARGET)
  cp $(TARGET) $(GLFS_LIB)/glusterfs/$(GLFS_VERS)/xlator/features
  clean:
  rm -f $(TARGET) $(OBJECTS)
  bypass /bytest-fuse.vol
  volume bytest-posix-0
  type storage/posix
  option directory  /export/bytest1
  end-volume
  volume bytest-locks-0
  type features/locks
  subvolumes bytest-posix-0
  end-volume
  volume bytest-client-1
  type protocol/client
  option remote-host gfs1
  option remote-subvolume  /export/bytest2
  option transport-type tcp
  end-volume
  volume bytest-replicate-0
  type cluster/replicate
  subvolumes bytest-locks-0  bytest-client-1
  end-volume
  volume bytest-bypass
  type features/bypass
  subvolumes bytest-replicate-0
  end-volume
  volume bytest
  type debug/io-stats
  option latency-measurement off
  option count-fop-hits off
  subvolumes bytest-bypass
  end-volume
  bypass /README.md

  This is a proof-of-concept translator for an>  ·          If multiple clients try to write the same file with  bypass turned on, you'll get massive split-brain problems. Solutions  might include honoring AFR's quorum-enforcement rules, auto-issuing locks  during open to prevent such concurrent access, or simply documenting the fact  that users must do such locking themselves. The last might sound like a  cop-out, but such locking is already common for the likely use case of  serving virtual-machine images.
  ·          We only intercept readv, writev, and fstat. There are  many other calls that can trigger self-heal, including plain old lookup. The  only way to prevent a lookup self-heal would be to put another  translator below AFR to intercept xattr requests and pretend  everything's OK. Ick. Remember, though, that this is only a proof of concept.  If we really wanted to get serious about this, we could implement the same  technique within AFR and do all the necessary coordination there.
  ·          It would be nice if the AFR subvolume to use could be  specified as an option (instead of just picking the first child), if bypass  could be made selective, etc.
  The coolest direction to go here would be to put information  about writes we've seen onto a queue, with a separate process listening on  that queue to perform assynchronous but nearly immediate self-heal on just  those files. As long as the other consistency issues are handled properly,  this might be a really easy way to get near-local performance for  virtual-machine-image use cases without introducing consistency/recovery  nightmares.


运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-670373-1-1.html 上篇帖子: Glusterfs hacker guide(二) 下篇帖子: GlusterFS 3.5.3 实战
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表